From the HA Trenches
Updated 922 Days AgoPublic
Actions

This is a place to collect some example success stories and failure post-mortems.

CGGVeritas (2009)

CGGVeritas, a global provider of geophysical services and equipment, did set up several clusters to provide seismic data to users. They attached a 37 TB JBODs to each node of the cluster so using a total of 72 TB XFS filesystem on each node. 10 of these clusters are set up with Linux-HA version 2.1.3 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6) exporting the data with NFS in an active/active setup.

Each node of the clusters has 16 GByte RAM, a 10 GBit Network interface toward the clients and a 4 GBit HBA direct attached storage. Each cluster serves more than 500 clients. The systems came into production 2006.

Minor hiccups caused by file system corruption were resolved after a failover and reboot of the node. Special hint: The admins did set up a uniq fsid. Otherwise the clients might get confused.

Thanks to Sachin Patel for this story.

Heilig-Geist-Hospital, Bingen (2009)

The Heilig-Geist-Hospital in Bingen at the Rhine uses a high available clustered firewall with state synchronization to separate several internal networks from each others. One of their applications is PACS (Picture Archiving and Communication System) for their central radiography laboratories. All departments use a terminal session to access the data. In case of an error the failover occurs. Since the connection table of the firewalls are synced the user experiences an small delay of the line but can go on working after about 3 seconds.

System: Two ordinary PCs, debian lenny, pacemaker and fwbuilder to manage the setup. They use about 20 different VLANs and also some routing controlled by the cluster. Please find a HOWTO to setup the HA firewall here.

Thanks to Matthias Thiele for this story.

GupShup, Free Group SMS (2009)

GupShup is India’s largest social messaging platform. Based in Mumbai it is mobile group SMS service that allows users to create mobile communities and broadcast messages to them. GupShup is growing rapidly with thousands of groups on topics such as finance, entertainment, lifestyle, health, sports and technology.

The cluster, two Ubuntu 8.04 Servers configured with Linux-HA version 2.1.3-2 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6), runs a Shorewall firewall in an Active/Active configuration. Each node of the cluster has 4 Gigs RAM with 250 GB Hard Drive and serves more than 12 million outgoing sms daily at the rate of 150 sms/sec.

Thanks to Kaushal Shriyan for this story.

GitHub (2012 incidents)

GoCardless (2017)

GoCardless/2017: Postgress, pacemaker, default-resource-stickiness,partial resource crash (via Adam Spiers on the users ML)

Press

There is an article that offers an overview all the way from heartbeat to pacemaker with openais or corosync in Linux Technical Review. Sorry, article is in German and a subscription is needed.

A German book "Clusterbau" by O'Reilly describes pacemaker, openais, corosync and LVS. It tells you how to set up clusters from the basics and also includes many useful examples.

Pacemaker was number 6 on ZDnet's list of 10 Open Source Projects Worth Checking out (Dec 2009)

Miscellaneous links

These are all old, but some may still be of value.

Step-by-step clustering guides from linode.com for Ubuntu, Debian, and Fedora. Guides include basic IP failover, DRBD and web applications.
Tips on avoiding STONITH Death-matches
Setup details for common Pacemaker use cases on Ubuntu
Setup guide for Pacemaker and OpenNebula (DRBD, MySQL, LVM)
Pacemaker project statistics
Using Pacemaker with Lustre
Options for clustering MySQL (slidedeck)
High Availability in 37 Easy Steps (slidedeck) (audio: ogg, mp3) (also available on slideshare.net)
MySQL with Pacemaker (Linbit webinar, requires registration)
RabbitMQ - High Availability with Pacemaker and DRBD
Making OpenNMS highly available with Pacemaker
Nice walkthrough of Xen+DRBD on Debian
Evoluzione dell’alta affidabilità su Linux (An Italian article series from miamammausalinux.org):

Event Timeline

kgaillot moved this document from From the HA Trenches.Jan 4 2024, 1:31 PM

kgaillot edited the content of this document. (Show Details)Jan 22 2024, 1:23 PM

there are a couple of dead links here. I didn't see how to raise an issue about it. I hope that the comment section is a good place:

Step-by-step clustering guides from linode.com
Pacemaker project statistics
MySQL with Pacemaker
RabbitMQ - High Availability with Pacemaker and DRBD
Making OpenNMS highly available with Pacemaker
Nice walkthrough of Xen+DRBD on Debian (server does not reply)

From the HA TrenchesUpdated 922 Days AgoPublicActions