Users of the services provided by the cluster require an unchanging address with which to access it. Additionally, we cloned the address so it will be active on both nodes. An iptables rule (created as part of the resource agent) is used to ensure that each request only processed by one of the two clone instances. The additional meta options tell the cluster that we want two instances of the clone (one “request bucket” for each node) and that if one node fails, then the remaining node should hold both.
meta globally-unique=”true” clone-max=”2” clone-node-max=”2”
</screen>
<note>
<para>
TODO: The RA should check for globally-unique=true when cloned
</para>
</note>
</section>
<section>
<title>Distributed lock manager</title>
<para>
Cluster filesystems like GFS2 require a lock manager. This service starts the daemon that provides user-space applications (such as the GFS2 daemon) with access to the in-kernel lock manager. Since we need it to be available on all nodes in the cluster, we have it cloned.
</para>
<screen>
primitive dlm ocf:pacemaker:controld \
op monitor interval="120s"
clone dlm-clone dlm \
meta interleave="true
</screen>
<note>
<para>
TODO: Confirm <literal>interleave</literal> is no longer needed
</para>
</note>
</section>
<section>
<title>GFS control daemon</title>
<para>
GFS2 also needs a user-space/kernel bridge that runs on every node. So here we have another clone, however this time we must also specify that it can only run on machines that are also running the DLM (colocation constraint) and that it can only be started after the DLM is running (order constraint). Additionally, the gfs-control clone should only care about the DLM instances it is paired with, so we need to set the interleave option.
</para>
<screen>
primitive gfs-control ocf:pacemaker:controld \
params daemon=”gfs_controld.pcmk” args=”-g 0” \
op monitor interval="120s"
clone gfs-clone gfs-control \
meta interleave="true"
colocation gfs-with-dlm inf: gfs-clone dlm-clone
order start-gfs-after-dlm inf: dlm-clone gfs-clone
</screen>
</section>
<section>
<title>DRBD - Shared Storage</title>
<para>
Here we define the DRBD service and specify which DRBD resource (from drbd.conf) it should manage. We make it a master/slave resource and, in order to have an active/active setup, allow both instances to be promoted by specifying master-max=2. We also set the notify option so that the cluster will tell DRBD agent when it’s peer changes state.
</para>
<screen>
primitive WebData ocf:linbit:drbd \
params drbd_resource="wwwdata" \
op monitor interval="60s"
ms WebDataClone WebData \
meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
</screen>
</section>
<section>
<title>Cluster Filesystem</title>
<para>
The cluster filesystem ensures that files are read and written correctly. We need to specify the block device (provided by DRBD), where we want it mounted and that we are using GFS2. Again it is a clone because it is intended to be active on both nodes. The additional constraints ensure that it can only be started on nodes with active gfs-control and drbd instances.
order WebFS-after-WebData inf: WebDataClone:promote WebFSClone:start
order start-WebFS-after-gfs-control inf: gfs-clone WebFSClone
</screen>
</section>
<section>
<title>Apache</title>
<para>
Lastly we have the actual service, Apache. We need only tell the cluster where to find it’s main configuration file and restrict it to running on nodes that have the required filesystem mounted and the IP address active.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Clusters_from_Scratch.ent">
%BOOK_ENTITIES;
]>
<chapter>
<title>Creating an Active/Passive Cluster</title>
<section>
<title>Exploring the Existing Configuration</title>
<para>
When Pacemaker starts up, it automatically records the number and details of the nodes in the cluster as well as which stack is being used and the version of Pacemaker being used.
</para>
<para>
This is what the base configuration should look like.
crm_verify[2195]: 2009/08/27_16:57:12 ERROR: unpack_resources: <emphasis>Resource start-up disabled since no STONITH resources have been defined</emphasis>
crm_verify[2195]: 2009/08/27_16:57:12 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
crm_verify[2195]: 2009/08/27_16:57:12 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
<emphasis>Errors found during check: config not valid</emphasis>
-V may provide more details
[root@pcmk-1 ~]#
</screen>
<para>
As you can see, the tool has found some errors.
</para>
<para>
In order to guarantee the safety of your data <footnote>
<para>
If the data is corrupt, there is little point in continuing to make it available
</para>
</footnote> , Pacemaker ships with STONITH <footnote>
<para>
A common node fencing mechanism. Used to ensure data integrity by powering off “bad” nodes.
</para>
</footnote> enabled. However it also knows when no STONITH configuration has been supplied and reports this as a problem (since the cluster would not be able to make progress if a situation requiring node fencing arose).
</para>
<para>
For now, we will disable this feature and configure it later in the Configuring STONITH section. It is important to note that the use of STONITH is highly encouraged, turning it off tells the cluster to simply pretend that failed nodes are safely powered off. Some vendors will even refuse to support clusters that have it disabled.
</para>
<para>
To disable STONITH, we set the stonith-enabled cluster option to false.
With the new cluster option set, the configuration is now valid.
</para>
+ <warning>
+ <para>
+ The use of <literal>stonith-enabled=false</literal> is completely inappropriate for a production cluster.
+ We use it here to defer the discussion of its configuration which can differ widely from one installation to the next.
+ See <xref linkend="ch-stonith"/> for information on why STONITH is important and details on how to configure it.
+ </para>
+ </warning>
</section>
<section>
<title>Adding a Resource</title>
<para>
The first thing we should do is configure an IP address. Regardless of where the cluster service(s) are running, we need a consistent address to contact them on. Here I will choose and add 192.168.122.101 as the floating address, give it the imaginative name ClusterIP and tell the cluster to check that its running every 30 seconds.
</para>
<important>
<para>
The chosen address must not be one already associated with a physical node
The other important piece of information here is ocf:heartbeat:IPaddr2. This tells Pacemaker three things about the resource you want to add. The first field, ocf, is the standard to which the resource script conforms to and where to find it. The second field is specific to OCF resources and tells the cluster which namespace to find the resource script in, in this case heartbeat. The last field indicates the name of the resource script.
</para>
<para>
To obtain a list of the available resource classes, run
</para>
<screen>
[root@pcmk-1 ~]# <userinput>crm ra classes</userinput>
heartbeat
lsb
<emphasis>ocf / heartbeat pacemaker</emphasis>
stonith
</screen>
<para>
To then find all the OCF resource agents provided by Pacemaker and Heartbeat, run
</para>
<screen>
[root@pcmk-1 ~]# <userinput>crm ra list ocf pacemaker</userinput>
There are three things to notice about the cluster’s current state. The first is that, as expected, pcmk-1 is now offline. However we can also see that ClusterIP isn’t running anywhere!
</para>
<section>
<title>Quorum and Two-Node Clusters</title>
<para>
This is because the cluster no longer has quorum, as can be seen by the text “partition WITHOUT quorum” (emphasised green) in the output above. In order to reduce the possibility of data corruption, Pacemaker’s default behavior is to stop all resources if the cluster does not have quorum.
</para>
<para>
A cluster is said to have quorum when more than half the known or expected nodes are online, or for the mathematically inclined, whenever the following equation is true:
</para>
<para>
total_nodes - 1 < 2 * active_nodes
</para>
<para>
Therefore a two-node cluster only has quorum when both nodes are running, which is no longer the case for our cluster. This would normally make the creation of a two-node cluster pointless<footnote>
<para>
Actually some would argue that two-node clusters are always pointless, but that is an argument for another time.
</para>
</footnote>, however it is possible to control how Pacemaker behaves when quorum is lost. In particular, we can tell the cluster to simply ignore quorum altogether.
Here we see something that some may consider surprising, the IP is back running at its original location!
</para>
</section>
<section>
<title>Prevent Resources from Moving after Recovery</title>
<para>
In some circumstances it is highly desirable to prevent healthy resources from being moved around the cluster. Move resources almost always requires a period of downtime and for complex services like Oracle databases, this period can be quite long.
</para>
<para>
To address this, Pacemaker has the concept of resource stickiness which controls how much a service prefers to stay running where it is. You may like to think of it as the “cost” of any downtime. By default, Pacemaker assumes there is zero cost associated with moving resources and will do so to achieve “optimal<footnote>
<para>
It should be noted that Pacemaker’s definition of optimal may not always agree with that of a human’s. The order in which Pacemaker processes lists of resources and nodes create implicit preferences (required in order to create a stabile solution) in situations where the administrator had not explicitly specified some.
</para>
</footnote>” resource placement. We can specify a different stickiness for every resource, but it is often sufficient to change the default.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Clusters_from_Scratch.ent">
%BOOK_ENTITIES;
]>
<chapter>
<title>Apache - Adding More Services</title>
<note>
<para>
Now that we have a basic but functional active/passive two-node cluster, we’re ready to add some real services. We’re going to start with Apache because its a feature of many clusters and relatively simple to configure.
</para>
</note>
<section>
<title>Installation</title>
<para>
Before continuing, we need to make sure Apache is installed on <emphasis>both</emphasis> hosts.
For the moment, we will simplify things by serving up only a static site and manually sync the data between the two nodes. So run the command again on pcmk-2.
In order to monitor the health of your Apache instance, and recover it if it fails, the resource agent used by Pacemaker assumes the server-status URL is available.
Look for the following in /etc/httpd/conf/httpd.conf and make sure it is not disabled or commented out:
</para>
<screen>
<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
</Location>
</screen>
</section>
<section>
<title>Update the Configuration</title>
<para>
At this point, Apache is ready to go, all that needs to be done is to add it to the cluster. Lets call the resource WebSite. We need to use an OCF script called apache in the heartbeat namespace <footnote>
<para>
Compare the key used here ocf:heartbeat:apache with the one we used earlier for the IP address: ocf:heartbeat:IPaddr2
</para>
</footnote> , the only required parameter is the path to the main Apache configuration file and we’ll tell the cluster to check once a minute that apache is still running.
Wait a moment, the WebSite resource isn’t running on the same host as our IP address!
</para>
</section>
<section>
<title>Ensuring Resources Run on the Same Host</title>
<para>
To reduce the load on any one machine, Pacemaker will generally try to spread the configured resources across the cluster nodes. However we can tell the cluster that two resources are related and need to run on the same host (or not at all). Here we instruct the cluster that WebSite can only run on the host that ClusterIP is active on. If ClusterIP is not active anywhere, WebSite will not be permitted to run anywhere.
When Apache starts, it binds to the available IP addresses. It doesn’t know about any addresses we add afterwards, so not only do they need to run on the same node, but we need to make sure ClusterIP is already active before we start WebSite. We do this by adding an ordering constraint. We need to give it a name (chose something descriptive like apache-after-ip), indicate that its mandatory (so that any recovery for ClusterIP will also trigger recovery of WebSite) and list the two resources in the order we need them to start.
</para>
<screen>
[root@pcmk-1 ~]# <userinput>crm configure order apache-after-ip mandatory: ClusterIP WebSite</userinput>
Pacemaker does not rely on any sort of hardware symmetry between nodes, so it may well be that one machine is more powerful than the other. In such cases it makes sense to host the resources there if it is available. To do this we create a location constraint. Again we give it a descriptive name (prefer-pcmk-1), specify the resource we want to run there (WebSite), how badly we’d like it to run there (we’ll use 50 for now, but in a two-node situation almost any value above 0 will do) and the host’s name.
Even though we now prefer pcmk-1 over pcmk-2, that preference is (intentionally) less than the resource stickiness (how much we preferred not to have unnecessary downtime).
</para>
<para>
To see the current placement scores, you can use a tool called ptest
</para>
<para>
ptest -sL
<note>
<para>
Include output
</para>
</note>
</para>
<para>
There is a way to force them to move though...
</para>
</section>
<section>
<title>Manually Moving Resources Around the Cluster</title>
<para>
There are always times when an administrator needs to override the cluster and force resources to move to a specific location. Underneath we use location constraints like the one we created above, happily you don’t need to care. Just provide the name of the resource and the intended location, we’ll do the rest.
Highlighted is the automated constraint used to move the resources to pcmk-1
</para>
<section>
<title>Giving Control Back to the Cluster</title>
<para>
Once we’ve finished whatever activity that required us to move the resources to pcmk-1, in our case nothing, we can then allow the cluster to resume normal operation with the unmove command. Since we previously configured a default stickiness, the resources will remain on pcmk-1.
Note that the automated constraint is now gone. If we check the cluster status, we can also see that as expected the resources are still active on pcmk-1.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Clusters_from_Scratch.ent">
%BOOK_ENTITIES;
]>
<chapter>
<title>Replicated Storage with DRBD</title>
<para>
Even if you’re serving up static websites, having to manually synchronize the contents of that website to all the machines in the cluster is not ideal.
For dynamic websites, such as a wiki, its not even an option.
Not everyone care afford network-attached storage but somehow the data needs to be kept in sync.
Enter DRBD which can be thought of as network based RAID-1.
See <ulink url="http://www.drbd.org/">http://www.drbd.org</ulink>/ for more details.
</para>
<para>
</para>
<section>
<title>Install the DRBD Packages</title>
<para>
Since its inclusion in the upstream 2.6.33 kernel, everything needed to use DRBD ships with &DISTRO; &DISTRO_VERSION;.
There is no series of commands for build a DRBD configuration, so simply copy the configuration below to /etc/drbd.conf
</para>
<para>
Detailed information on the directives used in this configuration (and other alternatives) is available from <ulink url="http://www.drbd.org/users-guide/ch-configure.html">http://www.drbd.org/users-guide/ch-configure.html</ulink>
</para>
<warning>
<para>
Be sure to use the names and addresses of <emphasis>your</emphasis> nodes if they differ from the ones used in this guide.
</para>
</warning>
<screen>
global {
usage-count yes;
}
common {
protocol C;
}
resource wwwdata {
meta-disk internal;
device /dev/drbd1;
syncer {
verify-alg sha1;
}
net {
allow-two-primaries;
}
<emphasis> on pcmk-1</emphasis> {
disk /dev/mapper/<emphasis>VolGroup</emphasis>-drbd--demo;
Now that DRBD is functioning we can configure a Filesystem resource to use it.
In addition to the filesystem’s definition, we also need to tell the cluster where it can be located (only on the DRBD Primary) and when it is allowed to start (after the Primary was promoted).
Once we’ve done everything we needed to on pcmk-1 (in this case nothing, we just wanted to see the resources move), we can allow the node to be a full cluster member again.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Clusters_from_Scratch.ent">
%BOOK_ENTITIES;
]>
-<chapter>
+<chapter id="ch-stonith">
<title>Configure STONITH</title>
<section>
<title>Why You Need STONITH</title>
<para>
STONITH is an acronym for Shoot-The-Other-Node-In-The-Head and it protects your data from being corrupted by rouge nodes or concurrent access.
</para>
<para>
Just because a node is unresponsive, this doesn’t mean it isn’t accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH so we can be certain that the node is truly offline, before allowing the data to be accessed from another node.
</para>
<para>
STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere.
</para>
</section>
<section>
<title>What STONITH Device Should You Use</title>
<para>
It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one.
</para>
<para>
The biggest mistake people make in choosing a STONITH device is to use remote power switch (such as many onboard IMPI controllers) that shares power with the node it controls. In such cases, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault.
</para>
<para>
Likewise, any device that relies on the machine being active (such as SSH-based “devices” used during testing) are inappropriate.
</para>
</section>
<section>
<title>Configuring STONITH</title>
<orderedlist>
<listitem>
<para>
Find the correct driver: stonith -L
</para>
</listitem>
<listitem>
<para>
Since every device is different, the parameters needed to configure it will vary. To find out the parameters required by the device: stonith -t {type} -n
</para>
</listitem>
</orderedlist>
<para>
Hopefully the developers chose names that make sense, if not you can query for some additional information by finding an active cluster node and running:
</para>
<screen>lrmadmin -M stonith {type} pacemaker
</screen>
<para>
The output should be XML formatted text containing additional parameter descriptions
</para>
<orderedlist>
<listitem>
<para>
Create a file called stonith.xml containing a primitive resource with a class of stonith, a type of {type} and a parameter for each of the values returned in step 2
</para>
</listitem>
<listitem>
<para>
Create a clone from the primitive resource if the device can shoot more than one node<emphasis> and supports multiple simultaneous connections</emphasis>.
</para>
</listitem>
<listitem>
<para>
Upload it into the CIB using cibadmin: cibadmin -C -o resources --xml-file stonith.xml
</para>
</listitem>
</orderedlist>
<section>
<title>Example</title>
<para>
Assuming we have an IBM BladeCenter containing our two nodes and the management interface is active on 192.168.122.31, then we would chose the external/ibmrsa driver in step 2 and obtain the following list of parameters
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY % BOOK_ENTITIES SYSTEM "Clusters_from_Scratch.ent">
%BOOK_ENTITIES;
]>
<chapter>
<title>Using Pacemaker Tools</title>
<para>
In the dark past, configuring Pacemaker required the administrator to read and write XML. In true UNIX style, there were also a number of different commands that specialized in different aspects of querying and updating the cluster.
</para>
<para>
Since Pacemaker 1.0, this has all changed and we have an integrated, scriptable, cluster shell that hides all the messy XML scaffolding. It even allows you to queue up several changes at once and commit them atomically.
</para>
<para>
Take some time to familiarize yourself with what it can do.
Use crm without arguments for an interactive session.
Supply one or more arguments for a "single-shot" use.
Specify with -f a file which contains a script. Use '-' for
standard input or use pipe/redirection.
crm displays cli format configurations using a color scheme
and/or in uppercase. Pick one of "color" or "uppercase", or
use "-D color,uppercase" if you want colorful uppercase.
Get plain output by "-D plain". The default may be set in
user preferences (options).
Examples:
# crm -f stopapp2.cli
# crm < stopapp2.cli
# crm resource stop global_www
# crm status
</screen>
<para>
The primary tool for monitoring the status of the cluster is crm_mon (also available as crm status). It can be run in a variety of modes and has a number of output options. To find out about any of the tools that come with Pacemaker, simply invoke them with the <command>--help</command> option or consult the included man pages. Both sets of output are created from the tool, and so will always be in sync with each other and the tool itself.
</para>
<para>
Additionally, the Pacemaker version and supported cluster stack(s) is available via the <command>--version</command> option.
If the SNMP and/or email options are not listed, then Pacemaker was not built to support them. This may be by the choice of your distribution or the required libraries may not have been available. Please contact whoever supplied you with the packages for more details.