Phriction Welcome to the ClusterLabs Wiki Cluster Administration From the HA Trenches History Version 1 vs 2
Version 1 vs 2
Version 1 vs 2
Content Changes
Content Changes
This is a place to collect some example success stories and failure post-mortems.
== CGGVeritas (2009) ==
[[http://www.cggveritas.com|CGGVeritas]], a global provider of geophysical services and equipment, did set up several clusters to provide seismic data to users. They attached a 37 TB JBODs to each node of the cluster so using a total of 72 TB XFS filesystem on each node. 10 of these clusters are set up with Linux-HA version 2.1.3 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6) exporting the data with NFS in an active/active setup.
Each node of the clusters has 16 GByte RAM, a 10 GBit Network interface toward the clients and a 4 GBit HBA direct attached storage. Each cluster serves more than 500 clients. The systems came into production 2006.
Minor hiccups caused by file system corruption were resolved after a failover and reboot of the node. Special hint: The admins did set up a uniq fsid. Otherwise the clients might get confused.
Thanks to Sachin Patel for this story.
== Heilig-Geist-Hospital, Bingen (2009) ==
The [[http://www.heilig-geist-hospital.de|Heilig-Geist-Hospital]] in Bingen at the Rhine uses a high available clustered firewall with state synchronization to separate several internal networks from each others. One of their applications is PACS (Picture Archiving and Communication System) for their central radiography laboratories. All departments use a terminal session to access the data. In case of an error the failover occurs. Since the connection table of the firewalls are synced the user experiences an small delay of the line but can go on working after about 3 seconds.
System: Two ordinary PCs, [[http://www.debian.or|debian]] lenny, pacemaker and [[http://www.fwbuilder.org|fwbuilder]] to manage the setup. They use about 20 different VLANs and also some routing controlled by the cluster. Please find a HOWTO to setup the HA firewall [[http://www.multinet.de/HAFirewall/HAFirewall.pdf|here]].
Thanks to Matthias Thiele for this story.
== GupShup, Free Group SMS (2009) ==
[[http://www.smsgupshup.com|GupShup]] is India’s largest social messaging platform. Based in Mumbai it is mobile group SMS service that allows users to create mobile communities and broadcast messages to them. GupShup is growing rapidly with thousands of groups on topics such as finance, entertainment, lifestyle, health, sports and technology.
The cluster, two Ubuntu 8.04 Servers configured with Linux-HA version 2.1.3-2 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6), runs a Shorewall firewall in an Active/Active configuration. Each node of the cluster has 4 Gigs RAM with 250 GB Hard Drive and serves more than 12 million outgoing sms daily at the rate of 150 sms/sec.
Thanks to Kaushal Shriyan for this story.
== GitHub (2012 incidents) ==
* [[https://github.blog/2012-09-14-github-availability-this-week/ | GitHub/2012: multi-role MySQL, pacemaker segfault]]
* [[https://github.blog/2012-12-26-downtime-last-saturday/|GitHub/2012: pacemaker, fence races]]
== GoCardless (2017) ==
* [[https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/|GoCardless/2017: Postgress, pacemaker, default-resource-stickiness,partial resource crash]] (via [[https://lists.clusterlabs.org/pipermail/users/2017-December/014210.html|Adam Spiers on the users ML]])
== Press ==
* There is an article that offers an overview all the way from heartbeat to pacemaker with openais or corosync in [[http://www.linuxtechnicalreview.de/Vorschau/(show)/Themen/High-Availability/Hochverfuegbarkeit-unter-Linux|Linux Technical Review]]. Sorry, article is in German and a subscription is needed.
* A German book "[[http://www.oreilly.de/catalog/linuxhacluster2ger/index.html|Clusterbau]]" by O'Reilly describes pacemaker, openais, corosync and LVS. It tells you how to set up clusters from the basics and also includes many useful examples.
* Pacemaker was number 6 on ZDnet's list of [[http://www.zdnetasia.com/10-open-source-projects-worth-checking-out-62059820.htm|10 Open Source Projects Worth Checking out]] (Dec 2009)
This is a place to collect some example success stories and failure post-mortems.
== CGGVeritas (2009) ==
[[http://www.cggveritas.com|CGGVeritas]], a global provider of geophysical services and equipment, did set up several clusters to provide seismic data to users. They attached a 37 TB JBODs to each node of the cluster so using a total of 72 TB XFS filesystem on each node. 10 of these clusters are set up with Linux-HA version 2.1.3 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6) exporting the data with NFS in an active/active setup.
Each node of the clusters has 16 GByte RAM, a 10 GBit Network interface toward the clients and a 4 GBit HBA direct attached storage. Each cluster serves more than 500 clients. The systems came into production 2006.
Minor hiccups caused by file system corruption were resolved after a failover and reboot of the node. Special hint: The admins did set up a uniq fsid. Otherwise the clients might get confused.
Thanks to Sachin Patel for this story.
== Heilig-Geist-Hospital, Bingen (2009) ==
The [[http://www.heilig-geist-hospital.de|Heilig-Geist-Hospital]] in Bingen at the Rhine uses a high available clustered firewall with state synchronization to separate several internal networks from each others. One of their applications is PACS (Picture Archiving and Communication System) for their central radiography laboratories. All departments use a terminal session to access the data. In case of an error the failover occurs. Since the connection table of the firewalls are synced the user experiences an small delay of the line but can go on working after about 3 seconds.
System: Two ordinary PCs, [[http://www.debian.or|debian]] lenny, pacemaker and [[http://www.fwbuilder.org|fwbuilder]] to manage the setup. They use about 20 different VLANs and also some routing controlled by the cluster. Please find a HOWTO to setup the HA firewall [[http://www.multinet.de/HAFirewall/HAFirewall.pdf|here]].
Thanks to Matthias Thiele for this story.
== GupShup, Free Group SMS (2009) ==
[[http://www.smsgupshup.com|GupShup]] is India’s largest social messaging platform. Based in Mumbai it is mobile group SMS service that allows users to create mobile communities and broadcast messages to them. GupShup is growing rapidly with thousands of groups on topics such as finance, entertainment, lifestyle, health, sports and technology.
The cluster, two Ubuntu 8.04 Servers configured with Linux-HA version 2.1.3-2 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6), runs a Shorewall firewall in an Active/Active configuration. Each node of the cluster has 4 Gigs RAM with 250 GB Hard Drive and serves more than 12 million outgoing sms daily at the rate of 150 sms/sec.
Thanks to Kaushal Shriyan for this story.
== GitHub (2012 incidents) ==
* [[https://github.blog/2012-09-14-github-availability-this-week/ | GitHub/2012: multi-role MySQL, pacemaker segfault]]
* [[https://github.blog/2012-12-26-downtime-last-saturday/|GitHub/2012: pacemaker, fence races]]
== GoCardless (2017) ==
* [[https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/|GoCardless/2017: Postgress, pacemaker, default-resource-stickiness,partial resource crash]] (via [[https://lists.clusterlabs.org/pipermail/users/2017-December/014210.html|Adam Spiers on the users ML]])
== Press ==
* There is an article that offers an overview all the way from heartbeat to pacemaker with openais or corosync in [[http://www.linuxtechnicalreview.de/Vorschau/(show)/Themen/High-Availability/Hochverfuegbarkeit-unter-Linux|Linux Technical Review]]. Sorry, article is in German and a subscription is needed.
* A German book "[[http://www.oreilly.de/catalog/linuxhacluster2ger/index.html|Clusterbau]]" by O'Reilly describes pacemaker, openais, corosync and LVS. It tells you how to set up clusters from the basics and also includes many useful examples.
* Pacemaker was number 6 on ZDnet's list of [[http://www.zdnetasia.com/10-open-source-projects-worth-checking-out-62059820.htm|10 Open Source Projects Worth Checking out]] (Dec 2009)
== Miscellaneous links ==
These are all old, but some may still be of value.
* [[http://library.linode.com/linux-ha/ | Step-by-step clustering guides from linode.com]] for Ubuntu, Debian, and Fedora. Guides include basic IP failover, DRBD and web applications.
* [[http://ourobengr.com/ha | Tips on avoiding STONITH Death-matches]]
* [[https://wiki.ubuntu.com/ClusterStack/LucidTesting | Setup details for common Pacemaker use cases on Ubuntu]]
* [[http://blog.opennebula.org/?p=1523 | Setup guide for Pacemaker and OpenNebula (DRBD, MySQL, LVM)]]
* [[http://cia.vc/stats/project/pacemaker | Pacemaker project statistics]]
* [[http://wiki.lustre.org/index.php/Using_Pacemaker_with_Lustre | Using Pacemaker with Lustre]]
* [[http://www.krisbuytaert.be/presentations/MySQL-PaceMaker.odp | Options for clustering MySQL (slidedeck)]]
* [[http://ourobengr.com/high-availability-in-37-easy-steps.odp | High Availability in 37 Easy Steps (slidedeck)]] (audio: [[http://ourobengr.com/high-availability-in-37-easy-steps.ogg | ogg]], [[http://ourobengr.com/high-availability-in-37-easy-steps.mp3 | mp3]]) (also available [[http://www.slideshare.net/tserong/high-availability-in-37-easy-steps | on slideshare.net]])
* [[https://linbit.webex.com/linbit-en/lsr.php?AT=pb&SP=EC&rID=7851587&rKey=3F2411374F8FC107 | MySQL with Pacemaker]] (Linbit webinar, requires registration)
* [[http://www.rabbitmq.com/pacemaker.html | RabbitMQ - High Availability with Pacemaker and DRBD]]
* [[http://opennms.org/wiki/Making_OpenNMS_highly_available | Making OpenNMS highly available with Pacemaker]]
* [[http://publications.jbfavre.org/virtualisation/cluster-xen-corosync-pacemaker-drbd-ocfs2.en.xhtml | Nice walkthrough of Xen+DRBD on Debian]]
* Evoluzione dell’alta affidabilità su Linux (An Italian article series from [[http://www.miamammausalinux.org | miamammausalinux.org]]):
** [[http://www.miamammausalinux.org/2010/04/evoluzione-dellalta-affidabilita-su-linux-come-orientarsi-fra-hertbeat-pacemaker-openais-e-corosync/ | Come orientarsi fra Hertbeat, Pacemaker, OpenAIS e Corosync]]
** [[http://www.miamammausalinux.org/2010/06/evoluzione-dellalta-affidabilita-su-linux-confronto-pratico-tra-heartbeat-classico-ed-heartbeat-con-pacemaker/ | Confronto pratico tra Heartbeat classico ed Heartbeat con Pacemaker]]
** [[http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/ | Realizzare un NAS con Pacemaker, DRBD ed exportfs]]
This is a place to collect some example success stories and failure post-mortems.
== CGGVeritas (2009) ==
[[http://www.cggveritas.com|CGGVeritas]], a global provider of geophysical services and equipment, did set up several clusters to provide seismic data to users. They attached a 37 TB JBODs to each node of the cluster so using a total of 72 TB XFS filesystem on each node. 10 of these clusters are set up with Linux-HA version 2.1.3 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6) exporting the data with NFS in an active/active setup.
Each node of the clusters has 16 GByte RAM, a 10 GBit Network interface toward the clients and a 4 GBit HBA direct attached storage. Each cluster serves more than 500 clients. The systems came into production 2006.
Minor hiccups caused by file system corruption were resolved after a failover and reboot of the node. Special hint: The admins did set up a uniq fsid. Otherwise the clients might get confused.
Thanks to Sachin Patel for this story.
== Heilig-Geist-Hospital, Bingen (2009) ==
The [[http://www.heilig-geist-hospital.de|Heilig-Geist-Hospital]] in Bingen at the Rhine uses a high available clustered firewall with state synchronization to separate several internal networks from each others. One of their applications is PACS (Picture Archiving and Communication System) for their central radiography laboratories. All departments use a terminal session to access the data. In case of an error the failover occurs. Since the connection table of the firewalls are synced the user experiences an small delay of the line but can go on working after about 3 seconds.
System: Two ordinary PCs, [[http://www.debian.or|debian]] lenny, pacemaker and [[http://www.fwbuilder.org|fwbuilder]] to manage the setup. They use about 20 different VLANs and also some routing controlled by the cluster. Please find a HOWTO to setup the HA firewall [[http://www.multinet.de/HAFirewall/HAFirewall.pdf|here]].
Thanks to Matthias Thiele for this story.
== GupShup, Free Group SMS (2009) ==
[[http://www.smsgupshup.com|GupShup]] is India’s largest social messaging platform. Based in Mumbai it is mobile group SMS service that allows users to create mobile communities and broadcast messages to them. GupShup is growing rapidly with thousands of groups on topics such as finance, entertainment, lifestyle, health, sports and technology.
The cluster, two Ubuntu 8.04 Servers configured with Linux-HA version 2.1.3-2 (the equivalent of Heartbeat 2.99 + Pacemaker 0.6), runs a Shorewall firewall in an Active/Active configuration. Each node of the cluster has 4 Gigs RAM with 250 GB Hard Drive and serves more than 12 million outgoing sms daily at the rate of 150 sms/sec.
Thanks to Kaushal Shriyan for this story.
== GitHub (2012 incidents) ==
* [[https://github.blog/2012-09-14-github-availability-this-week/ | GitHub/2012: multi-role MySQL, pacemaker segfault]]
* [[https://github.blog/2012-12-26-downtime-last-saturday/|GitHub/2012: pacemaker, fence races]]
== GoCardless (2017) ==
* [[https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/|GoCardless/2017: Postgress, pacemaker, default-resource-stickiness,partial resource crash]] (via [[https://lists.clusterlabs.org/pipermail/users/2017-December/014210.html|Adam Spiers on the users ML]])
== Press ==
* There is an article that offers an overview all the way from heartbeat to pacemaker with openais or corosync in [[http://www.linuxtechnicalreview.de/Vorschau/(show)/Themen/High-Availability/Hochverfuegbarkeit-unter-Linux|Linux Technical Review]]. Sorry, article is in German and a subscription is needed.
* A German book "[[http://www.oreilly.de/catalog/linuxhacluster2ger/index.html|Clusterbau]]" by O'Reilly describes pacemaker, openais, corosync and LVS. It tells you how to set up clusters from the basics and also includes many useful examples.
* Pacemaker was number 6 on ZDnet's list of [[http://www.zdnetasia.com/10-open-source-projects-worth-checking-out-62059820.htm|10 Open Source Projects Worth Checking out]] (Dec 2009)
== Miscellaneous links ==
These are all old, but some may still be of value.
* [[http://library.linode.com/linux-ha/ | Step-by-step clustering guides from linode.com]] for Ubuntu, Debian, and Fedora. Guides include basic IP failover, DRBD and web applications.
* [[http://ourobengr.com/ha | Tips on avoiding STONITH Death-matches]]
* [[https://wiki.ubuntu.com/ClusterStack/LucidTesting | Setup details for common Pacemaker use cases on Ubuntu]]
* [[http://blog.opennebula.org/?p=1523 | Setup guide for Pacemaker and OpenNebula (DRBD, MySQL, LVM)]]
* [[http://cia.vc/stats/project/pacemaker | Pacemaker project statistics]]
* [[http://wiki.lustre.org/index.php/Using_Pacemaker_with_Lustre | Using Pacemaker with Lustre]]
* [[http://www.krisbuytaert.be/presentations/MySQL-PaceMaker.odp | Options for clustering MySQL (slidedeck)]]
* [[http://ourobengr.com/high-availability-in-37-easy-steps.odp | High Availability in 37 Easy Steps (slidedeck)]] (audio: [[http://ourobengr.com/high-availability-in-37-easy-steps.ogg | ogg]], [[http://ourobengr.com/high-availability-in-37-easy-steps.mp3 | mp3]]) (also available [[http://www.slideshare.net/tserong/high-availability-in-37-easy-steps | on slideshare.net]])
* [[https://linbit.webex.com/linbit-en/lsr.php?AT=pb&SP=EC&rID=7851587&rKey=3F2411374F8FC107 | MySQL with Pacemaker]] (Linbit webinar, requires registration)
* [[http://www.rabbitmq.com/pacemaker.html | RabbitMQ - High Availability with Pacemaker and DRBD]]
* [[http://opennms.org/wiki/Making_OpenNMS_highly_available | Making OpenNMS highly available with Pacemaker]]
* [[http://publications.jbfavre.org/virtualisation/cluster-xen-corosync-pacemaker-drbd-ocfs2.en.xhtml | Nice walkthrough of Xen+DRBD on Debian]]
* Evoluzione dell’alta affidabilità su Linux (An Italian article series from [[http://www.miamammausalinux.org | miamammausalinux.org]]):
** [[http://www.miamammausalinux.org/2010/04/evoluzione-dellalta-affidabilita-su-linux-come-orientarsi-fra-hertbeat-pacemaker-openais-e-corosync/ | Come orientarsi fra Hertbeat, Pacemaker, OpenAIS e Corosync]]
** [[http://www.miamammausalinux.org/2010/06/evoluzione-dellalta-affidabilita-su-linux-confronto-pratico-tra-heartbeat-classico-ed-heartbeat-con-pacemaker/ | Confronto pratico tra Heartbeat classico ed Heartbeat con Pacemaker]]
** [[http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/ | Realizzare un NAS con Pacemaker, DRBD ed exportfs]]