diff --git a/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt b/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt index 6dc987c24c..04d57cd4d1 100644 --- a/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt +++ b/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt @@ -1,450 +1,451 @@ +:compat-mode: legacy [appendix] == Configuration Recap == === Final Cluster Configuration === ---- [root@pcmk-1 ~]# pcs resource Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started ClusterIP:1 (ocf::heartbeat:IPaddr2): Started Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] ---- ---- [root@pcmk-1 ~]# pcs resource op defaults timeout: 240s ---- ---- [root@pcmk-1 ~]# pcs stonith impi-fencing (stonith:fence_ipmilan) Started ---- ---- [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite-clone (kind:Mandatory) promote WebDataClone then start WebFS-clone (kind:Mandatory) start WebFS-clone then start WebSite-clone (kind:Mandatory) start dlm-clone then start WebFS-clone (kind:Mandatory) Colocation Constraints: WebSite-clone with ClusterIP-clone (score:INFINITY) WebFS-clone with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite-clone with WebFS-clone (score:INFINITY) WebFS-clone with dlm-clone (score:INFINITY) Ticket Constraints: ---- ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 12:05:37 2018 Last change: Fri Jan 12 11:49:29 2018 2 nodes configured 11 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: impi-fencing (stonith:fence_ipmilan): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started pcmk-2 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started pcmk-1 Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- ---- [root@pcmk-1 ~]# pcs cluster cib ---- [source,XML] ---- ---- === Node List === ---- [root@pcmk-1 ~]# pcs status nodes Pacemaker Nodes: Online: pcmk-1 pcmk-2 Standby: Offline: ---- === Cluster Options === ---- [root@pcmk-1 ~]# pcs property Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.16-12.el7_4.5-94ff4df have-watchdog: false last-lrm-refresh: 1439569053 stonith-enabled: true ---- The output shows state information automatically obtained about the cluster, including: * *cluster-infrastructure* - the cluster communications layer in use * *cluster-name* - the cluster name chosen by the administrator when the cluster was created * *dc-version* - the version (including upstream source-code hash) of Pacemaker used on the Designated Controller The output also shows options set by the administrator that control the way the cluster operates, including: * *stonith-enabled=true* - whether the cluster is allowed to use STONITH resources === Resources === ==== Default Options ==== ---- [root@pcmk-1 ~]# pcs resource defaults resource-stickiness: 100 ---- This shows cluster option defaults that apply to every resource that does not explicitly set the option itself. Above: * *resource-stickiness* - Specify the aversion to moving healthy resources to other machines ==== Fencing ==== ---- [root@pcmk-1 ~]# pcs stonith show ipmi-fencing (stonith:fence_ipmilan) Started [root@pcmk-1 ~]# pcs stonith show ipmi-fencing Resource: ipmi-fencing (class=stonith type=fence_ipmilan) Attributes: ipaddr="10.0.0.1" login="testuser" passwd="acd123" pcmk_host_list="pcmk-1 pcmk-2" Operations: monitor interval=60s (fence-monitor-interval-60s) ---- ==== Service Address ==== Users of the services provided by the cluster require an unchanging address with which to access it. Additionally, we cloned the address so it will be active on both nodes. An iptables rule (created as part of the resource agent) is used to ensure that each request only gets processed by one of the two clone instances. The additional meta options tell the cluster that we want two instances of the clone (one "request bucket" for each node) and that if one node fails, then the remaining node should hold both. ---- [root@pcmk-1 ~]# pcs resource show ClusterIP-clone Clone: ClusterIP-clone Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=192.168.122.120 cidr_netmask=32 clusterip_hash=sourceip Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s) stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s) monitor interval=30s (ClusterIP-monitor-interval-30s) ---- ==== DRBD - Shared Storage ==== Here, we define the DRBD service and specify which DRBD resource (from /etc/drbd.d/*.res) it should manage. We make it a promotable clone resource and, in order to have an active/active setup, allow both instances to be promoted to master at the same time. We also set the notify option so that the cluster will tell DRBD agent when its peer changes state. ---- [root@pcmk-1 ~]# pcs resource show WebDataClone Master: WebDataClone Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Resource: WebData (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=wwwdata Operations: start interval=0s timeout=240 (WebData-start-timeout-240) promote interval=0s timeout=90 (WebData-promote-timeout-90) demote interval=0s timeout=90 (WebData-demote-timeout-90) stop interval=0s timeout=100 (WebData-stop-timeout-100) monitor interval=60s (WebData-monitor-interval-60s) [root@pcmk-1 ~]# pcs constraint ref WebDataClone Resource: WebDataClone colocation-WebFS-WebDataClone-INFINITY order-WebDataClone-WebFS-mandatory ---- ==== Cluster Filesystem ==== The cluster filesystem ensures that files are read and written correctly. We need to specify the block device (provided by DRBD), where we want it mounted and that we are using GFS2. Again, it is a clone because it is intended to be active on both nodes. The additional constraints ensure that it can only be started on nodes with active DLM and DRBD instances. ---- [root@pcmk-1 ~]# pcs resource show WebFS-clone Clone: WebFS-clone Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=gfs2 Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) [root@pcmk-1 ~]# pcs constraint ref WebFS-clone Resource: WebFS-clone colocation-WebFS-WebDataClone-INFINITY colocation-WebSite-WebFS-INFINITY colocation-WebFS-clone-dlm-clone-INFINITY order-WebDataClone-WebFS-mandatory order-WebFS-WebSite-mandatory order-dlm-clone-WebFS-clone-mandatory ---- ==== Apache ==== Lastly, we have the actual service, Apache. We need only tell the cluster where to find its main configuration file and restrict it to running on nodes that have the required filesystem mounted and the IP address active. ---- [root@pcmk-1 ~]# pcs resource show WebSite-clone Clone: WebSite-clone Resource: WebSite (class=ocf provider=heartbeat type=apache) Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/server-status Operations: start interval=0s timeout=40s (WebSite-start-timeout-40s) stop interval=0s timeout=60s (WebSite-stop-timeout-60s) monitor interval=1min (WebSite-monitor-interval-1min) [root@pcmk-1 ~]# pcs constraint ref WebSite-clone Resource: WebSite-clone colocation-WebSite-ClusterIP-INFINITY colocation-WebSite-WebFS-INFINITY order-ClusterIP-WebSite-mandatory order-WebFS-WebSite-mandatory ---- diff --git a/doc/Clusters_from_Scratch/en-US/Ap-Corosync-Conf.txt b/doc/Clusters_from_Scratch/en-US/Ap-Corosync-Conf.txt index 87f4042a85..a00e9a2e5a 100644 --- a/doc/Clusters_from_Scratch/en-US/Ap-Corosync-Conf.txt +++ b/doc/Clusters_from_Scratch/en-US/Ap-Corosync-Conf.txt @@ -1,33 +1,34 @@ +:compat-mode: legacy [appendix] [[ap-corosync-conf]] == Sample Corosync Configuration == .Sample +corosync.conf+ for two-node cluster created by `pcs`. ..... totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: pcmk-1 nodeid: 1 } node { ring0_addr: pcmk-2 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_syslog: yes } ..... diff --git a/doc/Clusters_from_Scratch/en-US/Ap-Reading.txt b/doc/Clusters_from_Scratch/en-US/Ap-Reading.txt index 3b9367418d..eac4ad3b37 100644 --- a/doc/Clusters_from_Scratch/en-US/Ap-Reading.txt +++ b/doc/Clusters_from_Scratch/en-US/Ap-Reading.txt @@ -1,12 +1,13 @@ +:compat-mode: legacy [appendix] == Further Reading == - Project Website http://www.clusterlabs.org/ - SuSE has a comprehensive guide to cluster commands (though using the `crmsh` command-line shell rather than `pcs`) at: https://www.suse.com/documentation/sle_ha/book_sleha/data/book_sleha.html - Corosync http://www.corosync.org/ diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt b/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt index deecca3b43..a88643e887 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt @@ -1,374 +1,375 @@ +:compat-mode: legacy = Convert Cluster to Active/Active = The primary requirement for an Active/Active cluster is that the data required for your services is available, simultaneously, on both machines. Pacemaker makes no requirement on how this is achieved; you could use a SAN if you had one available, but since DRBD supports multiple Primaries, we can continue to use it here. == Install Cluster Filesystem Software == The only hitch is that we need to use a cluster-aware filesystem. The one we used earlier with DRBD, xfs, is not one of those. Both OCFS2 and GFS2 are supported; here, we will use GFS2. On both nodes, install the GFS2 command-line utilities and the Distributed Lock Manager (DLM) required by cluster filesystems: ---- # yum install -y gfs2-utils dlm ---- == Configure the Cluster for the DLM == The DLM needs to run on both nodes, so we'll start by creating a resource for it (using the *ocf:pacemaker:controld* resource script), and clone it: ---- [root@pcmk-1 ~]# pcs cluster cib dlm_cfg [root@pcmk-1 ~]# pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op monitor interval=60s [root@pcmk-1 ~]# pcs -f dlm_cfg resource clone dlm clone-max=2 clone-node-max=1 [root@pcmk-1 ~]# pcs -f dlm_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started Clone Set: dlm-clone [dlm] Stopped: [ pcmk-1 pcmk-2 ] ---- Activate our new configuration, and see how the cluster responds: ---- [root@pcmk-1 ~]# pcs cluster cib-push dlm_cfg CIB updated [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 11:19:36 2018 Last change: Fri Jan 12 11:19:28 2018 2 nodes configured 8 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 ipmi-fencing (stonith:fence_ipmilan): Started pcmk-1 Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- [[GFS2_prep]] == Create and Populate GFS2 Filesystem == Before we do anything to the existing partition, we need to make sure it is unmounted. We do this by telling the cluster to stop the WebFS resource. This will ensure that other resources (in our case, Apache) using WebFS are not only stopped, but stopped in the correct order. ---- [root@pcmk-1 ~]# pcs resource disable WebFS [root@pcmk-1 ~]# pcs resource ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Stopped Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Stopped Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] ---- You can see that both Apache and WebFS have been stopped, and that *pcmk-2* is the current master for the DRBD device. Now we can create a new GFS2 filesystem on the DRBD device. [WARNING] ========= This will erase all previous content stored on the DRBD device. Ensure you have a copy of any important data. ========= [IMPORTANT] =========== Run the next command on whichever node has the DRBD Primary role. Otherwise, you will receive the message: ----- /dev/drbd1: Read-only file system ----- =========== ----- [root@pcmk-2 ~]# mkfs.gfs2 -p lock_dlm -j 2 -t mycluster:web /dev/drbd1 It appears to contain an existing filesystem (xfs) This will destroy any data on /dev/drbd1 Are you sure you want to proceed? [y/n]y Device: /dev/drbd1 Block size: 4096 Device size: 1.00 GB (262127 blocks) Filesystem size: 1.00 GB (262126 blocks) Journals: 2 Resource groups: 5 Locking protocol: "lock_dlm" Lock table: "mycluster:web" UUID: 9a72c488-d8a7-24c9-ceee-add7a8ca52c2 ----- The `mkfs.gfs2` command required a number of additional parameters: * `-p lock_dlm` specifies that we want to use the kernel's DLM. * `-j 2` indicates that the filesystem should reserve enough space for two journals (one for each node that will access the filesystem). * `-t mycluster:web` specifies the lock table name. The format for this field is +pass:[clustername:fsname]+. For +pass:[clustername]+, we need to use the same value we specified originally with `pcs cluster setup --name` (which is also the value of *cluster_name* in +/etc/corosync/corosync.conf+). If you are unsure what your cluster name is, you can look in +/etc/corosync/corosync.conf+ or execute the command `pcs cluster corosync pcmk-1 | grep cluster_name`. Now we can (re-)populate the new filesystem with data (web pages). We'll create yet another variation on our home page. ----- [root@pcmk-2 ~]# mount /dev/drbd1 /mnt [root@pcmk-2 ~]# cat <<-END >/mnt/index.html My Test Site - GFS2 END [root@pcmk-2 ~]# chcon -R --reference=/var/www/html /mnt [root@pcmk-2 ~]# umount /dev/drbd1 [root@pcmk-2 ~]# drbdadm verify wwwdata ----- == Reconfigure the Cluster for GFS2 == With the WebFS resource stopped, let's update the configuration. ---- [root@pcmk-1 ~]# pcs resource show WebFS Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=xfs Meta Attrs: target-role=Stopped Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) ---- The fstype option needs to be updated to *gfs2* instead of *xfs*. ---- [root@pcmk-1 ~]# pcs resource update WebFS fstype=gfs2 [root@pcmk-1 ~]# pcs resource show WebFS Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=gfs2 Meta Attrs: target-role=Stopped Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) ---- GFS2 requires that DLM be running, so we also need to set up new colocation and ordering constraints for it: ---- [root@pcmk-1 ~]# pcs constraint colocation add WebFS with dlm-clone INFINITY [root@pcmk-1 ~]# pcs constraint order dlm-clone then WebFS Adding dlm-clone WebFS (kind: Mandatory) (Options: first-action=start then-action=start) ---- == Clone the IP address == There's no point making the services active on both locations if we can't reach them both, so let's clone the IP address. The *IPaddr2* resource agent has built-in intelligence for when it is configured as a clone. It will utilize a multicast MAC address to have the local switch send the relevant packets to all nodes in the cluster, together with *iptables clusterip* rules on the nodes so that any given packet will be grabbed by exactly one node. This will give us a simple but effective form of load-balancing requests between our two nodes. Let's start a new config, and clone our IP: ---- [root@pcmk-1 ~]# pcs cluster cib loadbalance_cfg [root@pcmk-1 ~]# pcs -f loadbalance_cfg resource clone ClusterIP \ clone-max=2 clone-node-max=2 globally-unique=true ---- * `clone-max=2` tells the resource agent to split packets this many ways. This should equal the number of nodes that can host the IP. * `clone-node-max=2` says that one node can run up to 2 instances of the clone. This should also equal the number of nodes that can host the IP, so that if any node goes down, another node can take over the failed node's "request bucket". Otherwise, requests intended for the failed node would be discarded. * `globally-unique=true` tells the cluster that one clone isn't identical to another (each handles a different "bucket"). This also tells the resource agent to insert *iptables* rules so each host only processes packets in its bucket(s). Notice that when the ClusterIP becomes a clone, the constraints referencing ClusterIP now reference the clone. This is done automatically by pcs. ---- [root@pcmk-1 ~]# pcs -f loadbalance_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite (kind:Mandatory) promote WebDataClone then start WebFS (kind:Mandatory) start WebFS then start WebSite (kind:Mandatory) start dlm-clone then start WebFS (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP-clone (score:INFINITY) WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite with WebFS (score:INFINITY) WebFS with dlm-clone (score:INFINITY) Ticket Constraints: ---- Now we must tell the resource how to decide which requests are processed by which hosts. To do this, we specify the *clusterip_hash* parameter. The value of *sourceip* means that the source IP address of incoming packets will be hashed; each node will process a certain range of hashes. ---- [root@pcmk-1 ~]# pcs -f loadbalance_cfg resource update ClusterIP clusterip_hash=sourceip ---- Load our configuration to the cluster, and see how it responds. ----- [root@pcmk-1 ~]# pcs cluster cib-push loadbalance_cfg CIB updated [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 11:32:07 2018 Last change: Fri Jan 12 11:32:04 2018 2 nodes configured 9 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: WebSite (ocf::heartbeat:apache): Stopped Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Stopped ipmi-fencing (stonith:fence_ipmilan): Started pcmk-1 Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started pcmk-1 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- If desired, you can demonstrate that all request buckets are working by using a tool such as `arping` from several source hosts to see which host responds to each. == Clone the Filesystem and Apache Resources == Now that we have a cluster filesystem ready to go, and our nodes can load-balance requests to a shared IP address, we can configure the cluster so both nodes mount the filesystem and respond to web requests. Clone the filesystem and Apache resources in a new configuration. Notice how pcs automatically updates the relevant constraints again. ---- [root@pcmk-1 ~]# pcs cluster cib active_cfg [root@pcmk-1 ~]# pcs -f active_cfg resource clone WebFS [root@pcmk-1 ~]# pcs -f active_cfg resource clone WebSite [root@pcmk-1 ~]# pcs -f active_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite-clone (kind:Mandatory) promote WebDataClone then start WebFS-clone (kind:Mandatory) start WebFS-clone then start WebSite-clone (kind:Mandatory) start dlm-clone then start WebFS-clone (kind:Mandatory) Colocation Constraints: WebSite-clone with ClusterIP-clone (score:INFINITY) WebFS-clone with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite-clone with WebFS-clone (score:INFINITY) WebFS-clone with dlm-clone (score:INFINITY) Ticket Constraints: ---- Tell the cluster that it is now allowed to promote both instances to be DRBD Primary (aka. master). ----- [root@pcmk-1 ~]# pcs -f active_cfg resource update WebDataClone master-max=2 ----- Finally, load our configuration to the cluster, and re-enable the WebFS resource (which we disabled earlier). ----- [root@pcmk-1 ~]# pcs cluster cib-push active_cfg CIB updated [root@pcmk-1 ~]# pcs resource enable WebFS ----- After all the processes are started, the status should look similar to this. ----- [root@pcmk-1 ~]# pcs resource Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started ClusterIP:1 (ocf::heartbeat:IPaddr2): Started Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] ----- == Test Failover == Testing failover is left as an exercise for the reader. For example, you can put one node into standby mode, use `pcs status` to confirm that its ClusterIP clone was moved to the other node, and use `arping` to verify that packets are not being lost from any source host. [NOTE] ==== You may find that when a failed node rejoins the cluster, both ClusterIP clones stay on one node, due to the resource stickiness. While this works fine, it effectively eliminates load-balancing and returns the cluster to an active-passive setup again. You can avoid this by disabling stickiness for the IP address resource: ---- [root@pcmk-1 ~]# pcs resource meta ClusterIP resource-stickiness=0 ---- ==== diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt b/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt index bb3586ab7d..31e9eac2ef 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt @@ -1,391 +1,392 @@ +:compat-mode: legacy = Create an Active/Passive Cluster = == Explore the Existing Configuration == When Pacemaker starts up, it automatically records the number and details of the nodes in the cluster, as well as which stack is being used and the version of Pacemaker being used. The first few lines of output should look like this: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 16:15:29 2018 Last change: Fri Jan 12 15:49:47 2018 2 nodes configured 0 resources configured Online: [ pcmk-1 pcmk-2 ] ---- For those who are not of afraid of XML, you can see the raw cluster configuration and status by using the `pcs cluster cib` command. .The last XML you'll see in this document ====== ---- [root@pcmk-1 ~]# pcs cluster cib ---- [source,XML] ---- ---- ====== Before we make any changes, it's a good idea to check the validity of the configuration. ---- [root@pcmk-1 ~]# crm_verify -L -V error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid ---- As you can see, the tool has found some errors. In order to guarantee the safety of your data, footnote:[If the data is corrupt, there is little point in continuing to make it available] the default for STONITH footnote:[A common node fencing mechanism. Used to ensure data integrity by powering off "bad" nodes] in Pacemaker is *enabled*. However, it also knows when no STONITH configuration has been supplied and reports this as a problem (since the cluster would not be able to make progress if a situation requiring node fencing arose). We will disable this feature for now and configure it later. To disable STONITH, set the *stonith-enabled* cluster option to false: ---- [root@pcmk-1 ~]# pcs property set stonith-enabled=false [root@pcmk-1 ~]# crm_verify -L ---- With the new cluster option set, the configuration is now valid. [WARNING] ========= The use of `stonith-enabled=false` is completely inappropriate for a production cluster. It tells the cluster to simply pretend that failed nodes are safely powered off. Some vendors will refuse to support clusters that have STONITH disabled. We disable STONITH here only to defer the discussion of its configuration, which can differ widely from one installation to the next. See <<_what_is_stonith>> for information on why STONITH is important and details on how to configure it. ========= == Add a Resource == Our first resource will be a unique IP address that the cluster can bring up on either node. Regardless of where any cluster service(s) are running, end users need a consistent address to contact them on. Here, I will choose 192.168.122.120 as the floating address, give it the imaginative name ClusterIP and tell the cluster to check whether it is running every 30 seconds. [WARNING] =========== The chosen address must not already be in use on the network. Do not reuse an IP address one of the nodes already has configured. =========== ---- [root@pcmk-1 ~]# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \ ip=192.168.122.120 cidr_netmask=32 op monitor interval=30s ---- Another important piece of information here is *ocf:heartbeat:IPaddr2*. This tells Pacemaker three things about the resource you want to add: * The first field (*ocf* in this case) is the standard to which the resource script conforms and where to find it. * The second field (*heartbeat* in this case) is standard-specific; for OCF resources, it tells the cluster which OCF namespace the resource script is in. * The third field (*IPaddr2* in this case) is the name of the resource script. To obtain a list of the available resource standards (the *ocf* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource standards lsb ocf service systemd ---- To obtain a list of the available OCF resource providers (the *heartbeat* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource providers heartbeat openstack pacemaker ---- Finally, if you want to see all the resource agents available for a specific OCF provider (the *IPaddr2* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource agents ocf:heartbeat apache clvm conntrackd CTDB db2 Delay . . (skipping lots of resources to save space) . symlink tomcat VirtualDomain Xinetd ---- Now, verify that the IP resource has been added, and display the cluster's status to see that it is now active: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 17:44:40 2018 Last change: Fri Jan 12 17:44:26 2018 2 nodes configured 1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Perform a Failover == Since our ultimate goal is high availability, we should test failover of our new resource before moving on. First, find the node on which the IP address is running. ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 17:44:40 2018 Last change: Fri Jan 12 17:44:26 2018 2 nodes configured 1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 ---- You can see that the status of the *ClusterIP* resource is *Started* on a particular node (in this example, *pcmk-1*). Shut down Pacemaker and Corosync on that machine to trigger a failover. ---- [root@pcmk-1 ~]# pcs cluster stop pcmk-1 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... ---- [NOTE] ====== A cluster command such as +pcs cluster stop pass:[nodename]+ can be run from any node in the cluster, not just the affected node. ====== Verify that pacemaker and corosync are no longer running: ---- [root@pcmk-1 ~]# pcs status Error: cluster is not currently running on this node ---- Go to the other node, and check the cluster status. ---- [root@pcmk-2 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 18:30:56 2018 Last change: Fri Jan 12 17:44:26 2018 2 nodes configured 1 resources configured Online: [ pcmk-2 ] OFFLINE: [ pcmk-1 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Notice that *pcmk-1* is *OFFLINE* for cluster purposes (its *pcsd* is still active, allowing it to receive `pcs` commands, but it is not participating in the cluster). Also notice that *ClusterIP* is now running on *pcmk-2* -- failover happened automatically, and no errors are reported. [IMPORTANT] .Quorum ==== If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka. _partitions_), _quorum_ is used to prevent resources from starting on more nodes than desired, which would risk data corruption. A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true: .... total_nodes < 2 * active_nodes .... For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker's default behavior in such cases is to stop all resources, in order to prevent data corruption. Two-node clusters are a special case. By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless, footnote:[Some would argue that two-node clusters are always pointless, but that is an argument for another time] but corosync has the ability to treat two-node clusters as if only one node is required for quorum. The `pcs cluster setup` command will automatically configure *two_node: 1* in +corosync.conf+, so a two-node cluster will "just work". If you are using a different cluster shell, you will have to configure +corosync.conf+ appropriately yourself. ==== Now, simulate node recovery by restarting the cluster stack on *pcmk-1*, and check the cluster's status. (It may take a little while before the cluster gets going on the node, but it eventually will look like the below.) ---- [root@pcmk-1 ~]# pcs cluster start pcmk-1 pcmk-1: Starting Cluster... [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 18:50:11 2018 Last change: Fri Jan 12 17:44:26 2018 2 nodes configured 1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Prevent Resources from Moving after Recovery == In most circumstances, it is highly desirable to prevent healthy resources from being moved around the cluster. Moving resources almost always requires a period of downtime. For complex services such as databases, this period can be quite long. To address this, Pacemaker has the concept of resource _stickiness_, which controls how strongly a service prefers to stay running where it is. You may like to think of it as the "cost" of any downtime. By default, Pacemaker assumes there is zero cost associated with moving resources and will do so to achieve "optimal" footnote:[Pacemaker's definition of optimal may not always agree with that of a human's. The order in which Pacemaker processes lists of resources and nodes creates implicit preferences in situations where the administrator has not explicitly specified them.] resource placement. We can specify a different stickiness for every resource, but it is often sufficient to change the default. ---- [root@pcmk-1 ~]# pcs resource defaults resource-stickiness=100 [root@pcmk-1 ~]# pcs resource defaults resource-stickiness: 100 ---- diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt b/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt index f460015de3..5d73526b83 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt @@ -1,415 +1,416 @@ +:compat-mode: legacy = Add Apache HTTP Server as a Cluster Service = indexterm:[Apache HTTP Server] Now that we have a basic but functional active/passive two-node cluster, we're ready to add some real services. We're going to start with Apache HTTP Server because it is a feature of many clusters and relatively simple to configure. == Install Apache == Before continuing, we need to make sure Apache is installed on both hosts. We also need the wget tool in order for the cluster to be able to check the status of the Apache server. ---- # yum install -y httpd wget # firewall-cmd --permanent --add-service=http # firewall-cmd --reload ---- [IMPORTANT] ==== Do *not* enable the httpd service. Services that are intended to be managed via the cluster software should never be managed by the OS. It is often useful, however, to manually start the service, verify that it works, then stop it again, before adding it to the cluster. This allows you to resolve any non-cluster-related problems before continuing. Since this is a simple example, we'll skip that step here. ==== == Create Website Documents == We need to create a page for Apache to serve. On &DISTRO; &DISTRO_VERSION;, the default Apache document root is /var/www/html, so we'll create an index file there. For the moment, we will simplify things by serving a static site and manually synchronizing the data between the two nodes, so run this command on both nodes: ----- # cat <<-END >/var/www/html/index.html My Test Site - $(hostname) END ----- == Enable the Apache status URL == indexterm:[Apache HTTP Server,/server-status] In order to monitor the health of your Apache instance, and recover it if it fails, the resource agent used by Pacemaker assumes the server-status URL is available. On both nodes, enable the URL with: ---- # cat <<-END >/etc/httpd/conf.d/status.conf SetHandler server-status Require local END ---- [NOTE] ====== If you are using a different operating system, server-status may already be enabled or may be configurable in a different location. If you are using a version of Apache HTTP Server less than 2.4, the syntax will be different. ====== == Configure the Cluster == indexterm:[Apache HTTP Server,Apache resource configuration] At this point, Apache is ready to go, and all that needs to be done is to add it to the cluster. Let's call the resource WebSite. We need to use an OCF resource script called apache in the heartbeat namespace. footnote:[Compare the key used here, *ocf:heartbeat:apache*, with the one we used earlier for the IP address, *ocf:heartbeat:IPaddr2*] The script's only required parameter is the path to the main Apache configuration file, and we'll tell the cluster to check once a minute that Apache is still running. ---- [root@pcmk-1 ~]# pcs resource create WebSite ocf:heartbeat:apache \ configfile=/etc/httpd/conf/httpd.conf \ statusurl="http://localhost/server-status" \ op monitor interval=1min ---- By default, the operation timeout for all resources' start, stop, and monitor operations is 20 seconds. In many cases, this timeout period is less than a particular resource's advised timeout period. For the purposes of this tutorial, we will adjust the global operation timeout default to 240 seconds. ---- [root@pcmk-1 ~]# pcs resource op defaults timeout=240s [root@pcmk-1 ~]# pcs resource op defaults timeout: 240s ---- [NOTE] ====== In a production cluster, it is usually better to adjust each resource's start, stop, and monitor timeouts to values that are appropriate to the behavior observed in your environment, rather than adjust the global default. ====== After a short delay, we should see the cluster start Apache. ----- [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 12:40:41 2018 Last change: Fri Jan 12 12:40:05 2018 2 nodes configured 2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Wait a moment, the WebSite resource isn't running on the same host as our IP address! [NOTE] ====== If, in the `pcs status` output, you see the WebSite resource has failed to start, then you've likely not enabled the status URL correctly. You can check whether this is the problem by running: .... wget -O - http://localhost/server-status .... If you see *Not Found* or *Forbidden* in the output, then this is likely the problem. Ensure that the ** block is correct. ====== == Ensure Resources Run on the Same Host == To reduce the load on any one machine, Pacemaker will generally try to spread the configured resources across the cluster nodes. However, we can tell the cluster that two resources are related and need to run on the same host (or not at all). Here, we instruct the cluster that WebSite can only run on the host that ClusterIP is active on. To achieve this, we use a _colocation constraint_ that indicates it is mandatory for WebSite to run on the same node as ClusterIP. The "mandatory" part of the colocation constraint is indicated by using a score of INFINITY. The INFINITY score also means that if ClusterIP is not active anywhere, WebSite will not be permitted to run. [NOTE] ======= If ClusterIP is not active anywhere, WebSite will not be permitted to run anywhere. ======= [IMPORTANT] =========== Colocation constraints are "directional", in that they imply certain things about the order in which the two resources will have a location chosen. In this case, we're saying that *WebSite* needs to be placed on the same machine as *ClusterIP*, which implies that the cluster must know the location of *ClusterIP* before choosing a location for *WebSite*. =========== ----- [root@pcmk-1 ~]# pcs constraint colocation add WebSite with ClusterIP INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: Colocation Constraints: WebSite with ClusterIP (score:INFINITY) Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 13:57:58 2018 Last change: Fri Jan 12 13:57:22 2018 2 nodes configured 2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- == Ensure Resources Start and Stop in Order == Like many services, Apache can be configured to bind to specific IP addresses on a host or to the wildcard IP address. If Apache binds to the wildcard, it doesn't matter whether an IP address is added before or after Apache starts; Apache will respond on that IP just the same. However, if Apache binds only to certain IP address(es), the order matters: If the address is added after Apache starts, Apache won't respond on that address. To be sure our WebSite responds regardless of Apache's address configuration, we need to make sure ClusterIP not only runs on the same node, but starts before WebSite. A colocation constraint only ensures the resources run together, not the order in which they are started and stopped. We do this by adding an ordering constraint. By default, all order constraints are mandatory, which means that the recovery of ClusterIP will also trigger the recovery of WebSite. ----- [root@pcmk-1 ~]# pcs constraint order ClusterIP then WebSite Adding ClusterIP WebSite (kind: Mandatory) (Options: first-action=start then-action=start) [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) Ticket Constraints: ----- == Prefer One Node Over Another == Pacemaker does not rely on any sort of hardware symmetry between nodes, so it may well be that one machine is more powerful than the other. In such cases, it makes sense to host the resources on the more powerful node if it is available. To do this, we create a location constraint. In the location constraint below, we are saying the WebSite resource prefers the node pcmk-1 with a score of 50. Here, the score indicates how badly we'd like the resource to run at this location. ----- [root@pcmk-1 ~]# pcs constraint location WebSite prefers pcmk-1=50 [root@pcmk-1 ~]# pcs constraint Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:50) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 14:11:49 2018 Last change: Fri Jan 12 14:11:20 2018 2 nodes configured 2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Wait a minute, the resources are still on pcmk-2! Even though WebSite now prefers to run on pcmk-1, that preference is (intentionally) less than the resource stickiness (how much we preferred not to have unnecessary downtime). To see the current placement scores, you can use a tool called crm_simulate. ---- [root@pcmk-1 ~]# crm_simulate -sL Current cluster status: Online: [ pcmk-1 pcmk-2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Allocation scores: native_color: ClusterIP allocation score on pcmk-1: 50 native_color: ClusterIP allocation score on pcmk-2: 200 native_color: WebSite allocation score on pcmk-1: -INFINITY native_color: WebSite allocation score on pcmk-2: 100 Transition Summary: ---- == Move Resources Manually == There are always times when an administrator needs to override the cluster and force resources to move to a specific location. In this example, we will force the WebSite to move to pcmk-1 by updating our previous location constraint with a score of INFINITY. ----- [root@pcmk-1 ~]# pcs constraint location WebSite prefers pcmk-1=INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:INFINITY) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 14:19:34 2018 Last change: Fri Jan 12 14:18:37 2018 2 nodes configured 2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Once we've finished whatever activity required us to move the resources to pcmk-1 (in our case nothing), we can then allow the cluster to resume normal operation by removing the new constraint. Since we previously configured a default stickiness, the resources will remain on pcmk-1. First, use the `--full` option to get the constraint's ID: ----- [root@pcmk-1 ~]# pcs constraint --full Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:INFINITY) (id:location-WebSite-pcmk-1-INFINITY) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) (id:order-ClusterIP-WebSite-mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) (id:colocation-WebSite-ClusterIP-INFINITY) Ticket Constraints: ----- Then remove the desired contraint using its ID: ----- [root@pcmk-1 ~]# pcs constraint remove location-WebSite-pcmk-1-INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) Ticket Constraints: ----- Note that the location constraint is now gone. If we check the cluster status, we can also see that (as expected) the resources are still active on pcmk-1. ----- # pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 14:25:21 2018 Last change: Fri Jan 12 14:24:29 2018 2 nodes configured 2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt b/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt index 974b8ff331..98d8f93bed 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt @@ -1,489 +1,490 @@ +:compat-mode: legacy = Installation = == Install &DISTRO; &DISTRO_VERSION; == === Boot the Install Image === Download the 4GB http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1708.iso[&DISTRO; &DISTRO_VERSION; DVD ISO]. Use the image to boot a virtual machine, or burn it to a DVD or USB drive and boot a physical server from that. After starting the installation, select your language and keyboard layout at the welcome screen. .&DISTRO; &DISTRO_VERSION; Installation Welcome Screen image::images/Welcome.png["Welcome to &DISTRO; &DISTRO_VERSION;",align="center",scaledwidth="100%"] === Installation Options === At this point, you get a chance to tweak the default installation options. .&DISTRO; &DISTRO_VERSION; Installation Summary Screen image::images/Installer.png["&DISTRO; &DISTRO_VERSION; Installation Summary",align="center",scaledwidth="100%"] Ignore the *SOFTWARE SELECTION* section (try saying that 10 times quickly). The *Infrastructure Server* environment does have add-ons with much of the software we need, but we will leave it as a *Minimal Install* here, so that we can see exactly what software is required later. === Configure Network === In the *NETWORK & HOSTNAME* section: - Edit *Host Name:* as desired. For this example, we will use *pcmk-1.localdomain*. - Select your network device, press *Configure...*, and manually assign a fixed IP address. For this example, we'll use 192.168.122.101 under *IPv4 Settings* (with an appropriate netmask, gateway and DNS server). - Flip the switch to turn your network device on. [IMPORTANT] =========== Do not accept the default network settings. Cluster machines should never obtain an IP address via DHCP, because DHCP's periodic address renewal will interfere with corosync. =========== === Configure Disk === By default, the installer's automatic partitioning will use LVM (which allows us to dynamically change the amount of space allocated to a given partition). However, it allocates all free space to the +/+ (aka. *root*) partition, which cannot be reduced in size later (dynamic increases are fine). In order to follow the DRBD and GFS2 portions of this guide, we need to reserve space on each machine for a replicated volume. Enter the *INSTALLATION DESTINATION* section, ensure the hard drive you want to install to is selected, select *I will configure partitioning*, and press *Done*. In the *MANUAL PARTITIONING* screen that comes next, click the option to create mountpoints automatically. Select the +/+ mountpoint, and reduce the desired capacity by 1GiB or so. Select *Modify...* by the volume group name, and change the *Size policy:* to *As large as possible*, to make the reclaimed space available inside the LVM volume group. We'll add the additional volume later. === Configure Time Synchronization === It is highly recommended to enable NTP on your cluster nodes. Doing so ensures all nodes agree on the current time and makes reading log files significantly easier. &DISTRO; will enable NTP automatically. If you want to change any time-related settings (such as time zone or NTP server), you can do this in the *TIME & DATE* section. === Finish Install === Select *Begin Installation*. Once it completes, set a root password, and reboot as instructed. For the purposes of this document, it is not necessary to create any additional users. After the node reboots, you'll see a login prompt on the console. Login using *root* and the password you created earlier. .&DISTRO; &DISTRO_VERSION; Console Prompt image::images/Console.png["&DISTRO; &DISTRO_VERSION; Console",align="center",scaledwidth="100%"] [NOTE] ====== From here on, we're going to be working exclusively from the terminal. ====== == Configure the OS == === Verify Networking === Ensure that the machine has the static IP address you configured earlier. ----- [root@pcmk-1 ~]# ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:d7:d6:08 brd ff:ff:ff:ff:ff:ff inet 192.168.122.101/24 brd 192.168.122.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fed7:d608/64 scope link valid_lft forever preferred_lft forever ----- [NOTE] ===== If you ever need to change the node's IP address from the command line, follow these instructions, replacing *${device}* with the name of your network device: .... [root@pcmk-1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-${device} # manually edit as desired [root@pcmk-1 ~]# nmcli dev disconnect ${device} [root@pcmk-1 ~]# nmcli con reload ${device} [root@pcmk-1 ~]# nmcli con up ${device} .... This makes *NetworkManager* aware that a change was made on the config file. ===== Next, ensure that the routes are as expected: ----- [root@pcmk-1 ~]# ip route default via 192.168.122.1 dev eth0 proto static metric 100 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.101 metric 100 ----- If there is no line beginning with *default via*, then you may need to add a line such as [source,Bash] GATEWAY="192.168.122.1" to the device configuration using the same process as described above for changing the IP address. Now, check for connectivity to the outside world. Start small by testing whether we can reach the gateway we configured. ----- [root@pcmk-1 ~]# ping -c 1 192.168.122.1 PING 192.168.122.1 (192.168.122.1) 56(84) bytes of data. 64 bytes from 192.168.122.1: icmp_req=1 ttl=64 time=0.249 ms --- 192.168.122.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.249/0.249/0.249/0.000 ms ----- Now try something external; choose a location you know should be available. ----- [root@pcmk-1 ~]# ping -c 1 www.google.com PING www.l.google.com (173.194.72.106) 56(84) bytes of data. 64 bytes from tf-in-f106.1e100.net (173.194.72.106): icmp_req=1 ttl=41 time=167 ms --- www.l.google.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 167.618/167.618/167.618/0.000 ms ----- === Login Remotely === The console isn't a very friendly place to work from, so we will now switch to accessing the machine remotely via SSH where we can use copy and paste, etc. From another host, check whether we can see the new host at all: ----- beekhof@f16 ~ # ping -c 1 192.168.122.101 PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. 64 bytes from 192.168.122.101: icmp_req=1 ttl=64 time=1.01 ms --- 192.168.122.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.012/1.012/1.012/0.000 ms ----- Next, login as root via SSH. ----- beekhof@f16 ~ # ssh -l root 192.168.122.101 The authenticity of host '192.168.122.101 (192.168.122.101)' can't be established. ECDSA key fingerprint is 6e:b7:8f:e2:4c:94:43:54:a8:53:cc:20:0f:29:a4:e0. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.122.101' (ECDSA) to the list of known hosts. root@192.168.122.101's password: Last login: Tue Aug 11 13:14:39 2015 [root@pcmk-1 ~]# ----- === Apply Updates === Apply any package updates released since your installation image was created: ---- [root@pcmk-1 ~]# yum update ---- === Use Short Node Names === During installation, we filled in the machine's fully qualified domain name (FQDN), which can be rather long when it appears in cluster logs and status output. See for yourself how the machine identifies itself: (((Nodes, short name))) ---- [root@pcmk-1 ~]# uname -n pcmk-1.localdomain ---- (((Nodes, Domain name (Query)))) We can use the `hostnamectl` tool to strip off the domain name: ---- [root@pcmk-1 ~]# hostnamectl set-hostname $(uname -n | sed s/\\..*//) ---- (((Nodes, Domain name (Remove from host name)))) Now, check that the machine is using the correct name: ---- [root@pcmk-1 ~]# uname -n pcmk-1 ---- == Repeat for Second Node == Repeat the Installation steps so far, so that you have two nodes ready to have the cluster software installed. For the purposes of this document, the additional node is called pcmk-2 with address 192.168.122.102. == Configure Communication Between Nodes == === Configure Host Name Resolution === Confirm that you can communicate between the two new nodes: ---- [root@pcmk-1 ~]# ping -c 3 192.168.122.102 PING 192.168.122.102 (192.168.122.102) 56(84) bytes of data. 64 bytes from 192.168.122.102: icmp_seq=1 ttl=64 time=0.343 ms 64 bytes from 192.168.122.102: icmp_seq=2 ttl=64 time=0.402 ms 64 bytes from 192.168.122.102: icmp_seq=3 ttl=64 time=0.558 ms --- 192.168.122.102 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.343/0.434/0.558/0.092 ms ---- Now we need to make sure we can communicate with the machines by their name. If you have a DNS server, add additional entries for the two machines. Otherwise, you'll need to add the machines to +/etc/hosts+ on both nodes. Below are the entries for my cluster nodes: ---- [root@pcmk-1 ~]# grep pcmk /etc/hosts 192.168.122.101 pcmk-1.clusterlabs.org pcmk-1 192.168.122.102 pcmk-2.clusterlabs.org pcmk-2 ---- We can now verify the setup by again using ping: ---- [root@pcmk-1 ~]# ping -c 3 pcmk-2 PING pcmk-2.clusterlabs.org (192.168.122.101) 56(84) bytes of data. 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=1 ttl=64 time=0.164 ms 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=2 ttl=64 time=0.475 ms 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=3 ttl=64 time=0.186 ms --- pcmk-2.clusterlabs.org ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 0.164/0.275/0.475/0.141 ms ---- === Configure SSH === SSH is a convenient and secure way to copy files and perform commands remotely. For the purposes of this guide, we will create a key without a password (using the -N option) so that we can perform remote actions without being prompted. (((SSH))) [WARNING] ========= Unprotected SSH keys (those without a password) are not recommended for servers exposed to the outside world. We use them here only to simplify the demo. ========= Create a new key and allow anyone with that key to log in: .Creating and Activating a new SSH Key ---- [root@pcmk-1 ~]# ssh-keygen -t dsa -f ~/.ssh/id_dsa -N "" Generating public/private dsa key pair. Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 91:09:5c:82:5a:6a:50:08:4e:b2:0c:62:de:cc:74:44 root@pcmk-1.clusterlabs.org The key's randomart image is: +--[ DSA 1024]----+ |==.ooEo.. | |X O + .o o | | * A + | | + . | | . S | | | | | | | | | +-----------------+ [root@pcmk-1 ~]# cp ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys ---- (((Creating and Activating a new SSH Key))) Install the key on the other node: ---- [root@pcmk-1 ~]# scp -r ~/.ssh pcmk-2: The authenticity of host 'pcmk-2 (192.168.122.102)' can't be established. ECDSA key fingerprint is a4:f5:b2:34:9d:86:2b:34:a2:87:37:b9:ca:68:52:ec. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'pcmk-2,192.168.122.102' (ECDSA) to the list of known hosts. root@pcmk-2's password: id_dsa.pub 100% 616 0.6KB/s 00:00 id_dsa 100% 672 0.7KB/s 00:00 known_hosts 100% 400 0.4KB/s 00:00 authorized_keys 100% 616 0.6KB/s 00:00 ---- Test that you can now run commands remotely, without being prompted: ---- [root@pcmk-1 ~]# ssh pcmk-2 -- uname -n pcmk-2 ---- == Install the Cluster Software == Fire up a shell on both nodes and run the following to install pacemaker, and while we're at it, some command-line tools to make our lives easier: ---- # yum install -y pacemaker pcs psmisc policycoreutils-python ---- [IMPORTANT] =========== This document will show commands that need to be executed on both nodes with a simple `#` prompt. Be sure to run them on each node individually. =========== [NOTE] =========== This document uses `pcs` for cluster management. Other alternatives, such as `crmsh`, are available, but their syntax will differ from the examples used here. =========== == Configure the Cluster Software == === Allow cluster services through firewall === On each node, allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- [root@pcmk-1 ~]# setenforce 0 [root@pcmk-1 ~]# sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config [root@pcmk-1 ~]# systemctl mask firewalld.service [root@pcmk-1 ~]# systemctl stop firewalld.service [root@pcmk-1 ~]# iptables --flush ---- ====== === Enable pcs Daemon === Before the cluster can be configured, the pcs daemon must be started and enabled to start at boot time on each node. This daemon works with the pcs command-line interface to manage synchronizing the corosync configuration across all nodes in the cluster. Start and enable the daemon by issuing the following commands on each node: ---- # systemctl start pcsd.service # systemctl enable pcsd.service ln -s '/usr/lib/systemd/system/pcsd.service' '/etc/systemd/system/multi-user.target.wants/pcsd.service' ---- The installed packages will create a *hacluster* user with a disabled password. While this is fine for running `pcs` commands locally, the account needs a login password in order to perform such tasks as syncing the corosync configuration, or starting and stopping the cluster on other nodes. This tutorial will make use of such commands, so now we will set a password for the *hacluster* user, using the same password on both nodes: ---- # passwd hacluster Changing password for user hacluster. New password: Retype new password: passwd: all authentication tokens updated successfully. ---- [NOTE] =========== Alternatively, to script this process or set the password on a different machine from the one you're logged into, you can use the `--stdin` option for `passwd`: ---- [root@pcmk-1 ~]# ssh pcmk-2 -- 'echo mysupersecretpassword | passwd --stdin hacluster' ---- =========== === Configure Corosync === On either node, use `pcs cluster auth` to authenticate as the *hacluster* user: ---- [root@pcmk-1 ~]# pcs cluster auth pcmk-1 pcmk-2 Username: hacluster Password: pcmk-1: Authorized pcmk-2: Authorized ---- Next, use `pcs cluster setup` on the same node to generate and synchronize the corosync configuration: ---- [root@pcmk-1 ~]# pcs cluster setup --name mycluster pcmk-1 pcmk-2 Shutting down pacemaker/corosync services... Redirecting to /bin/systemctl stop pacemaker.service Redirecting to /bin/systemctl stop corosync.service Killing any remaining services... Removing all cluster configuration files... pcmk-1: Succeeded pcmk-2: Succeeded ---- If you received an authorization error for either of those commands, make sure you configured the *hacluster* user account on each node with the same password. [NOTE] ====== If you are not using `pcs` for cluster administration, follow whatever procedures are appropriate for your tools to create a corosync.conf and copy it to all nodes. The `pcs` command will configure corosync to use UDP unicast transport; if you choose to use multicast instead, choose a multicast address carefully. footnote:[For some subtle issues, see http://web.archive.org/web/20101211210054/http://29west.com/docs/THPM/multicast-address-assignment.html[Topics in High-Performance Messaging: Multicast Address Assignment] or the more detailed treatment in https://www.cisco.com/c/dam/en/us/support/docs/ip/ip-multicast/ipmlt_wp.pdf[Cisco's Guidelines for Enterprise IP Multicast Address Allocation].] ====== The final corosync.conf configuration on each node should look something like the sample in <>. diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt b/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt index d8582b77e6..60ca19e900 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt @@ -1,27 +1,28 @@ +:compat-mode: legacy = Read-Me-First = == The Scope of this Document == Computer clusters can be used to provide highly available services or resources. The redundancy of multiple machines is used to guard against failures of many types. This document will walk through the installation and setup of simple clusters using the &DISTRO; distribution, version &DISTRO_VERSION;. The clusters described here will use Pacemaker and Corosync to provide resource management and messaging. Required packages and modifications to their configuration files are described along with the use of the Pacemaker command line tool for generating the XML used for cluster control. Pacemaker is a central component and provides the resource management required in these systems. This management includes detecting and recovering from the failure of various nodes, resources and services under its control. When more in-depth information is required, and for real-world usage, please refer to the https://www.clusterlabs.org/pacemaker/doc/[Pacemaker Explained] manual. include::../../shared/en-US/pacemaker-intro.txt[] diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt b/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt index d756fa2d63..2481bad389 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt @@ -1,529 +1,530 @@ +:compat-mode: legacy = Replicate Storage Using DRBD = Even if you're serving up static websites, having to manually synchronize the contents of that website to all the machines in the cluster is not ideal. For dynamic websites, such as a wiki, it's not even an option. Not everyone care afford network-attached storage, but somehow the data needs to be kept in sync. Enter DRBD, which can be thought of as network-based RAID-1. footnote:[See http://www.drbd.org/ for details.] == Install the DRBD Packages == DRBD itself is included in the upstream kernel,footnote:[Since version 2.6.33] but we do need some utilities to use it effectively. CentOS does not ship these utilities, so we need to enable a third-party repository to get them. Supported packages for many OSes are available from DRBD's maker http://www.linbit.com/[LINBIT], but here we'll use the free http://elrepo.org/[ELRepo] repository. On both nodes, import the ELRepo package signing key, and enable the repository: ---- # rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org # rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm ---- Now, we can install the DRBD kernel module and utilities: ---- # yum install -y kmod-drbd84 drbd84-utils ---- DRBD will not be able to run under the default SELinux security policies. If you are familiar with SELinux, you can modify the policies in a more fine-grained manner, but here we will simply exempt DRBD processes from SELinux control: ---- # semanage permissive -a drbd_t ---- We will configure DRBD to use port 7789, so allow that port from each host to the other: ---- [root@pcmk-1 ~]# firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.122.102" port port="7789" protocol="tcp" accept' success [root@pcmk-1 ~]# firewall-cmd --reload success ---- ---- [root@pcmk-2 ~]# firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.122.101" port port="7789" protocol="tcp" accept' success [root@pcmk-2 ~]# firewall-cmd --reload success ---- [NOTE] ====== In this example, we have only two nodes, and all network traffic is on the same LAN. In production, it is recommended to use a dedicated, isolated network for cluster-related traffic, so the firewall configuration would likely be different; one approach would be to add the dedicated network interfaces to the trusted zone. ====== == Allocate a Disk Volume for DRBD == DRBD will need its own block device on each node. This can be a physical disk partition or logical volume, of whatever size you need for your data. For this document, we will use a 1GiB logical volume, which is more than sufficient for a single HTML file and (later) GFS2 metadata. ---- [root@pcmk-1 ~]# vgdisplay | grep -e Name -e Free VG Name centos_pcmk-1 Free PE / Size 382 / 1.49 GiB [root@pcmk-1 ~]# lvcreate --name drbd-demo --size 1G centos_pcmk-1 Logical volume "drbd-demo" created [root@pcmk-1 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert drbd-demo centos_pcmk-1 -wi-a----- 1.00g root centos_pcmk-1 -wi-ao---- 5.00g swap centos_pcmk-1 -wi-ao---- 1.00g ---- Repeat for the second node, making sure to use the same size: ---- [root@pcmk-1 ~]# ssh pcmk-2 -- lvcreate --name drbd-demo --size 1G centos_pcmk-2 Logical volume "drbd-demo" created ---- == Configure DRBD == There is no series of commands for building a DRBD configuration, so simply run this on both nodes to use this sample configuration: ---- # cat </etc/drbd.d/wwwdata.res resource wwwdata { protocol C; meta-disk internal; device /dev/drbd1; syncer { verify-alg sha1; } net { allow-two-primaries; } on pcmk-1 { disk /dev/centos_pcmk-1/drbd-demo; address 192.168.122.101:7789; } on pcmk-2 { disk /dev/centos_pcmk-2/drbd-demo; address 192.168.122.102:7789; } } END ---- [IMPORTANT] ========= Edit the file to use the hostnames, IP addresses and logical volume paths of your nodes if they differ from the ones used in this guide. ========= [NOTE] ======= Detailed information on the directives used in this configuration (and other alternatives) is available at http://www.drbd.org/users-guide/ch-configure.html The *allow-two-primaries* option would not normally be used in an active/passive cluster. We are adding it here for the convenience of changing to an active/active cluster later. ======= == Initialize DRBD == With the configuration in place, we can now get DRBD running. These commands create the local metadata for the DRBD resource, ensure the DRBD kernel module is loaded, and bring up the DRBD resource. Run them on one node: ---- [root@pcmk-1 ~]# drbdadm create-md wwwdata initializing activity log NOT initializing bitmap Writing meta data... New drbd meta data block successfully created. [root@pcmk-1 ~]# modprobe drbd [root@pcmk-1 ~]# drbdadm up wwwdata ---- We can confirm DRBD's status on this node: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----s ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1048508 ---- Because we have not yet initialized the data, this node's data is marked as *Inconsistent*. Because we have not yet initialized the second node, the local state is *WFConnection* (waiting for connection), and the partner node's status is marked as *Unknown*. Now, repeat the above commands on the second node. This time, when we check the status, it shows: ---- [root@pcmk-2 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1048508 ---- You can see the state has changed to *Connected*, meaning the two DRBD nodes are communicating properly, and both nodes are in *Secondary* role with *Inconsistent* data. To make the data consistent, we need to tell DRBD which node should be considered to have the correct data. In this case, since we are creating a new resource, both have garbage, so we'll just pick pcmk-1 and run this command on it: ---- [root@pcmk-1 ~]# drbdadm primary --force wwwdata ---- [NOTE] ====== If you are using a different version of DRBD, the required syntax may be different. See the documentation for your version for how to perform these commands. ====== If we check the status immediately, we'll see something like this: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:2872 nr:0 dw:0 dr:3784 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1045636 [>....................] sync'ed: 0.4% (1045636/1048508)K finish: 0:10:53 speed: 1,436 (1,436) K/sec ---- We can see that this node has the *Primary* role, the partner node has the *Secondary* role, this node's data is now considered *UpToDate*, the partner node's data is still *Inconsistent*, and a progress bar shows how far along the partner node is in synchronizing the data. After a while, the sync should finish, and you'll see something like: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:1048508 nr:0 dw:0 dr:1049420 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 ---- Both sets of data are now *UpToDate*, and we can proceed to creating and populating a filesystem for our WebSite resource's documents. == Populate the DRBD Disk == On the node with the primary role (pcmk-1 in this example), create a filesystem on the DRBD device: ---- [root@pcmk-1 ~]# mkfs.xfs /dev/drbd1 meta-data=/dev/drbd1 isize=256 agcount=4, agsize=65532 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=262127, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=853, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 ---- [NOTE] ==== In this example, we create an xfs filesystem with no special options. In a production environment, you should choose a filesystem type and options that are suitable for your application. ==== Mount the newly created filesystem, populate it with our web document, give it the same SELinux policy as the web document root, then unmount it (the cluster will handle mounting and unmounting it later): ---- [root@pcmk-1 ~]# mount /dev/drbd1 /mnt [root@pcmk-1 ~]# cat <<-END >/mnt/index.html My Test Site - DRBD END [root@pcmk-1 ~]# chcon -R --reference=/var/www/html /mnt [root@pcmk-1 ~]# umount /dev/drbd1 ---- == Configure the Cluster for the DRBD device == One handy feature `pcs` has is the ability to queue up several changes into a file and commit those changes all at once. To do this, start by populating the file with the current raw XML config from the CIB. ---- [root@pcmk-1 ~]# pcs cluster cib drbd_cfg ---- Using the `pcs -f` option, make changes to the configuration saved in the +drbd_cfg+ file. These changes will not be seen by the cluster until the +drbd_cfg+ file is pushed into the live cluster's CIB later. Here, we create a cluster resource for the DRBD device, and an additional _clone_ resource to allow the resource to run on both nodes at the same time. ---- [root@pcmk-1 ~]# pcs -f drbd_cfg resource create WebData ocf:linbit:drbd \ drbd_resource=wwwdata op monitor interval=60s [root@pcmk-1 ~]# pcs -f drbd_cfg resource master WebDataClone WebData \ master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \ notify=true [root@pcmk-1 ~]# pcs -f drbd_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Stopped: [ pcmk-1 pcmk-2 ] ---- After you are satisfied with all the changes, you can commit them all at once by pushing the drbd_cfg file into the live CIB. ---- [root@pcmk-1 ~]# pcs cluster cib-push drbd_cfg CIB updated ---- Let's see what the cluster did with the new configuration: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 09:29:41 2018 Last change: Fri Jan 12 09:29:25 2018 2 nodes configured 4 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- We can see that *WebDataClone* (our DRBD device) is running as master (DRBD's primary role) on *pcmk-1* and slave (DRBD's secondary role) on *pcmk-2*. [IMPORTANT] ==== The resource agent should load the DRBD module when needed if it's not already loaded. If that does not happen, configure your operating system to load the module at boot time. For &DISTRO; &DISTRO_VERSION;, you would run this on both nodes: ---- # echo drbd >/etc/modules-load.d/drbd.conf ---- ==== == Configure the Cluster for the Filesystem == Now that we have a working DRBD device, we need to mount its filesystem. In addition to defining the filesystem, we also need to tell the cluster where it can be located (only on the DRBD Primary) and when it is allowed to start (after the Primary was promoted). We are going to take a shortcut when creating the resource this time. Instead of explicitly saying we want the *ocf:heartbeat:Filesystem* script, we are only going to ask for *Filesystem*. We can do this because we know there is only one resource script named *Filesystem* available to pacemaker, and that pcs is smart enough to fill in the *ocf:heartbeat:* portion for us correctly in the configuration. If there were multiple *Filesystem* scripts from different OCF providers, we would need to specify the exact one we wanted. Once again, we will queue our changes to a file and then push the new configuration to the cluster as the final step. ---- [root@pcmk-1 ~]# pcs cluster cib fs_cfg [root@pcmk-1 ~]# pcs -f fs_cfg resource create WebFS Filesystem \ device="/dev/drbd1" directory="/var/www/html" fstype="xfs" [root@pcmk-1 ~]# pcs -f fs_cfg constraint colocation add WebFS with WebDataClone INFINITY with-rsc-role=Master [root@pcmk-1 ~]# pcs -f fs_cfg constraint order promote WebDataClone then start WebFS Adding WebDataClone WebFS (kind: Mandatory) (Options: first-action=promote then-action=start) ---- We also need to tell the cluster that Apache needs to run on the same machine as the filesystem and that it must be active before Apache can start. ---- [root@pcmk-1 ~]# pcs -f fs_cfg constraint colocation add WebSite with WebFS INFINITY [root@pcmk-1 ~]# pcs -f fs_cfg constraint order WebFS then WebSite Adding WebFS WebSite (kind: Mandatory) (Options: first-action=start then-action=start) ---- Review the updated configuration. ---- [root@pcmk-1 ~]# pcs -f fs_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) promote WebDataClone then start WebFS (kind:Mandatory) start WebFS then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite with WebFS (score:INFINITY) Ticket Constraints: ---- ---- [root@pcmk-1 ~]# pcs -f fs_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Stopped ---- After reviewing the new configuration, upload it and watch the cluster put it into effect. ---- [root@pcmk-1 ~]# pcs cluster cib-push fs_cfg [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 09:34:11 2018 Last change: Fri Jan 12 09:34:09 2018 2 nodes configured 5 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Test Cluster Failover == Previously, we used `pcs cluster stop pcmk-1` to stop all cluster services on *pcmk-1*, failing over the cluster resources, but there is another way to safely simulate node failure. We can put the node into _standby mode_. Nodes in this state continue to run corosync and pacemaker but are not allowed to run resources. Any resources found active there will be moved elsewhere. This feature can be particularly useful when performing system administration tasks such as updating packages used by cluster resources. Put the active node into standby mode, and observe the cluster move all the resources to the other node. The node's status will change to indicate that it can no longer host resources. ---- [root@pcmk-1 ~]# pcs cluster standby pcmk-1 [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 09:36:49 2018 Last change: Fri Jan 12 09:36:43 2018 2 nodes configured 5 resources configured Node pcmk-1 (1): standby Online: [ pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Stopped: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Once we've done everything we needed to on pcmk-1 (in this case nothing, we just wanted to see the resources move), we can allow the node to be a full cluster member again. ---- [root@pcmk-1 ~]# pcs cluster unstandby pcmk-1 [root@pcmk-1 ~]# pcs status Cluster name: mycluster Stack: corosync Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 09:38:02 2018 Last change: Fri Jan 12 09:37:56 2018 2 nodes configured 5 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Notice that *pcmk-1* is back to the *Online* state, and that the cluster resources stay where they are due to our resource stickiness settings configured earlier. diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt b/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt index baaebeff52..51eb5a1a1a 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt @@ -1,165 +1,166 @@ +:compat-mode: legacy = Configure STONITH = == What is STONITH? == STONITH (Shoot The Other Node In The Head aka. fencing) protects your data from being corrupted by rogue nodes or unintended concurrent access. Just because a node is unresponsive doesn't mean it has stopped accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH to ensure that the node is truly offline before allowing the data to be accessed from another node. STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere. == Choose a STONITH Device == It is crucial that your STONITH device can allow the cluster to differentiate between a node failure and a network failure. A common mistake people make when choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. If the power fails in such a case, the cluster cannot be sure whether the node is really offline, or active and suffering from a network fault, so the cluster will stop all resources to avoid a possible split-brain situation. Likewise, any device that relies on the machine being active (such as SSH-based "devices" sometimes used during testing) is inappropriate. == Configure the Cluster for STONITH == . Install the STONITH agent(s). To see what packages are available, run `yum search fence-`. Be sure to install the package(s) on all cluster nodes. . Configure the STONITH device itself to be able to fence your nodes and accept fencing requests. This includes any necessary configuration on the device and on the nodes, and any firewall or SELinux changes needed. Test the communication between the device and your nodes. . Find the correct STONITH agent script: `pcs stonith list` . Find the parameters associated with the device: +pcs stonith describe pass:[agent_name]+ . Create a local copy of the CIB: `pcs cluster cib stonith_cfg` . Create the fencing resource: +pcs -f stonith_cfg stonith create pass:[stonith_id stonith_device_type [stonith_device_options]]+ + Any flags that do not take arguments, such as +--ssl+, should be passed as +ssl=1+. . Enable STONITH in the cluster: `pcs -f stonith_cfg property set stonith-enabled=true` . If the device does not know how to fence nodes based on their uname, you may also need to set the special *pcmk_host_map* parameter. See `man pacemaker-fenced` for details. . If the device does not support the *list* command, you may also need to set the special *pcmk_host_list* and/or *pcmk_host_check* parameters. See `man pacemaker-fenced` for details. . If the device does not expect the victim to be specified with the *port* parameter, you may also need to set the special *pcmk_host_argument* parameter. See `man pacemaker-fenced` for details. . Commit the new configuration: `pcs cluster cib-push stonith_cfg` . Once the STONITH resource is running, test it (you might want to stop the cluster on that machine first): +stonith_admin --reboot pass:[nodename]+ == Example == For this example, assume we have a chassis containing four nodes and an IPMI device active on 10.0.0.1. Following the steps above would go something like this: Step 1: Install the *fence-agents-ipmilan* package on both nodes. Step 2: Configure the IP address, authentication credentials, etc. in the IPMI device itself. Step 3: Choose the *fence_ipmilan* STONITH agent. Step 4: Obtain the agent's possible parameters: ---- [root@pcmk-1 ~]# pcs stonith describe fence_ipmilan fence_ipmilan - Fence agent for IPMI fence_ipmilan is an I/O Fencing agentwhich can be used with machines controlled by IPMI.This agent calls support software ipmitool (http://ipmitool.sf.net/). WARNING! This fence agent might report success before the node is powered off. You should use -m/method onoff if your fence device works correctly with that option. Stonith options: ipport: TCP/UDP port to use for connection with device port: IP address or hostname of fencing device (together with --port-as-ip) inet6_only: Forces agent to use IPv6 addresses only ipaddr: IP Address or Hostname passwd_script: Script to retrieve password method: Method to fence (onoff|cycle) inet4_only: Forces agent to use IPv4 addresses only passwd: Login password or passphrase lanplus: Use Lanplus to improve security of connection auth: IPMI Lan Auth type. action: Fencing Action WARNING: specifying 'action' is deprecated and not necessary with current Pacemaker versions. cipher: Ciphersuite to use (same as ipmitool -C parameter) target: Bridge IPMI requests to the remote target address privlvl: Privilege level on IPMI device timeout: Timeout (sec) for IPMI operation login: Login Name power_wait: Wait X seconds after issuing ON/OFF login_timeout: Wait X seconds for cmd prompt after login delay: Wait X seconds before fencing is started power_timeout: Test X seconds for status change after ON/OFF ipmitool_path: Path to ipmitool binary shell_timeout: Wait X seconds for cmd prompt after issuing command port_as_ip: Make "port/plug" to be an alias to IP address retry_on: Count of attempts to retry power on sudo: Use sudo (without password) when calling 3rd party sotfware. priority: The priority of the stonith resource. Devices are tried in order of highest priority to lowest. pcmk_host_map: A mapping of host names to ports numbers for devices that do not support host names. Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and ports 2 and 3 for node2 pcmk_host_list: A list of machines controlled by this device (Optional unless pcmk_host_check=static-list). pcmk_host_check: How to determine which machines are controlled by the device. Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), none (assume every device can fence every machine) pcmk_delay_max: Enable random delay for stonith actions and specify the maximum of random delay This prevents double fencing when using slow devices such as sbd. Use this to enable random delay for stonith actions and specify the maximum of random delay. pcmk_action_limit: The maximum number of actions can be performed in parallel on this device Cluster property concurrent-fencing=true needs to be configured first. Then use this to specify the maximum number of actions can be performed in parallel on this device. -1 is unlimited. Default operations: monitor: interval=60s ---- Step 5: `pcs cluster cib stonith_cfg` Step 6: Here are example parameters for creating our STONITH resource: ---- [root@pcmk-1 ~]# pcs -f stonith_cfg stonith create ipmi-fencing fence_ipmilan \ pcmk_host_list="pcmk-1 pcmk-2" ipaddr=10.0.0.1 login=testuser \ passwd=acd123 op monitor interval=60s [root@pcmk-1 ~]# pcs -f stonith_cfg stonith ipmi-fencing (stonith:fence_ipmilan): Stopped ---- Steps 7-10: Enable STONITH in the cluster: ---- [root@pcmk-1 ~]# pcs -f stonith_cfg property set stonith-enabled=true [root@pcmk-1 ~]# pcs -f stonith_cfg property Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.16-12.el7_4.5-94ff4df have-watchdog: false stonith-enabled: true ---- Step 11: `pcs cluster cib-push stonith_cfg` Step 12: Test: ---- [root@pcmk-1 ~]# pcs cluster stop pcmk-2 [root@pcmk-1 ~]# stonith_admin --reboot pcmk-2 ---- After a successful test, login to any rebooted nodes, and start the cluster (with `pcs cluster start`). diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt b/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt index fda3476caa..c396c0010f 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt @@ -1,131 +1,132 @@ +:compat-mode: legacy = Pacemaker Tools = == Simplify administration using a cluster shell == In the dark past, configuring Pacemaker required the administrator to read and write XML. In true UNIX style, there were also a number of different commands that specialized in different aspects of querying and updating the cluster. All of that has been greatly simplified with the creation of unified command-line shells (and GUIs) that hide all the messy XML scaffolding. These shells take all the individual aspects required for managing and configuring a cluster, and pack them into one simple-to-use command line tool. They even allow you to queue up several changes at once and commit them all at once. Two popular command-line shells are `pcs` and `crmsh`. This edition of Clusters from Scratch is based on `pcs`. [NOTE] =========== The two shells share many concepts but the scope, layout and syntax does differ, so make sure you read the version of this guide that corresponds to the software installed on your system. =========== == Explore pcs == Start by taking some time to familiarize yourself with what `pcs` can do. ---- [root@pcmk-1 ~]# pcs Usage: pcs [-f file] [-h] [commands]... Control and configure pacemaker and corosync. Options: -h, --help Display usage and exit. -f file Perform actions on file instead of active CIB. --debug Print all network traffic and external commands run. --version Print pcs version information. --request-timeout Timeout for each outgoing request to another node in seconds. Default is 60s. Commands: cluster Configure cluster options and nodes. resource Manage cluster resources. stonith Manage fence devices. constraint Manage resource constraints. property Manage pacemaker properties. acl Manage pacemaker access control lists. qdevice Manage quorum device provider on the local host. quorum Manage cluster quorum settings. booth Manage booth (cluster ticket manager). status View cluster status. config View and manage cluster configuration. pcsd Manage pcs daemon. node Manage cluster nodes. alert Manage pacemaker alerts. ---- As you can see, the different aspects of cluster management are separated into categories. To discover the functionality available in each of these categories, one can issue the command +pcs pass:[category] help+. Below is an example of all the options available under the status category. ---- [root@pcmk-1 ~]# pcs status help Usage: pcs status [commands]... View current cluster and resource status Commands: [status] [--full | --hide-inactive] View all information about the cluster and resources (--full provides more details, --hide-inactive hides inactive resources). resources [ | --full | --groups | --hide-inactive] Show all currently configured resources or if a resource is specified show the options for the configured resource. If --full is specified, all configured resource options will be displayed. If --groups is specified, only show groups (and their resources). If --hide-inactive is specified, only show active resources. groups View currently configured groups and their resources. cluster View current cluster status. corosync View current membership information as seen by corosync. quorum View current quorum status. qdevice [--full] [] Show runtime status of specified model of quorum device provider. Using --full will give more detailed output. If is specified, only information about the specified cluster will be displayed. nodes [corosync | both | config] View current status of nodes from pacemaker. If 'corosync' is specified, view current status of nodes from corosync instead. If 'both' is specified, view current status of nodes from both corosync & pacemaker. If 'config' is specified, print nodes from corosync & pacemaker configuration. pcsd []... Show current status of pcsd on nodes specified, or on all nodes configured in the local cluster if no nodes are specified. xml View xml version of status (output from crm_mon -r -1 -X). ---- Additionally, if you are interested in the version and supported cluster stack(s) available with your Pacemaker installation, run: ---- [root@pcmk-1 ~]# pacemakerd --features Pacemaker 1.1.16-12.el7_4.5 (Build: 94ff4df) Supporting v3.0.12: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls ---- diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt b/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt index b13f228754..19fcdf172e 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt @@ -1,147 +1,148 @@ +:compat-mode: legacy = Start and Verify Cluster = == Start the Cluster == Now that corosync is configured, it is time to start the cluster. The command below will start corosync and pacemaker on both nodes in the cluster. If you are issuing the start command from a different node than the one you ran the `pcs cluster auth` command on earlier, you must authenticate on the current node you are logged into before you will be allowed to start the cluster. ---- [root@pcmk-1 ~]# pcs cluster start --all pcmk-1: Starting Cluster... pcmk-2: Starting Cluster... ---- [NOTE] ====== An alternative to using the `pcs cluster start --all` command is to issue either of the below command sequences on each node in the cluster separately: ---- # pcs cluster start Starting Cluster... ---- or ---- # systemctl start corosync.service # systemctl start pacemaker.service ---- ====== [IMPORTANT] ==== In this example, we are not enabling the corosync and pacemaker services to start at boot. If a cluster node fails or is rebooted, you will need to run +pcs cluster start pass:[nodename]+ (or `--all`) to start the cluster on it. While you could enable the services to start at boot, requiring a manual start of cluster services gives you the opportunity to do a post-mortem investigation of a node failure before returning it to the cluster. ==== == Verify Corosync Installation == First, use `corosync-cfgtool` to check whether cluster communication is happy: ---- [root@pcmk-1 ~]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.122.101 status = ring 0 active with no faults ---- We can see here that everything appears normal with our fixed IP address (not a 127.0.0.x loopback address) listed as the *id*, and *no faults* for the status. If you see something different, you might want to start by checking the node's network, firewall and selinux configurations. Next, check the membership and quorum APIs: ---- [root@pcmk-1 ~]# corosync-cmapctl | grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.122.101) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.122.102) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 2 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@pcmk-1 ~]# pcs status corosync Membership information -------------------------- Nodeid Votes Name 1 1 pcmk-1 (local) 2 1 pcmk-2 ---- You should see both nodes have joined the cluster. == Verify Pacemaker Installation == Now that we have confirmed that Corosync is functional, we can check the rest of the stack. Pacemaker has already been started, so verify the necessary processes are running: ---- [root@pcmk-1 ~]# ps axf PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] ...lots of processes... 1362 ? Ssl 0:35 corosync 1379 ? Ss 0:00 /usr/sbin/pacemakerd -f 1380 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-based 1381 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-fenced 1382 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-execd 1383 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-attrd 1384 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-schedulerd 1385 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-controld ---- If that looks OK, check the `pcs status` output: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 16:15:29 2018 Last change: Fri Jan 12 15:49:47 2018 2 nodes configured 0 resources configured Online: [ pcmk-1 pcmk-2 ] No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Finally, ensure there are no startup errors (aside from messages relating to not having STONITH configured, which are OK at this point): ---- [root@pcmk-1 ~]# journalctl | grep -i error ---- [NOTE] ====== Other operating systems may report startup errors in other locations, for example +/var/log/messages+. ====== Repeat these checks on the other node. The results should be the same. diff --git a/doc/Pacemaker_Administration/en-US/Ch-Agents.txt b/doc/Pacemaker_Administration/en-US/Ch-Agents.txt index 1cb2e252a3..c5afcb6b4a 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Agents.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Agents.txt @@ -1,337 +1,338 @@ +:compat-mode: legacy = Resource Agents = == OCF Resource Agents == === Location of Custom Scripts === indexterm:[OCF Resource Agents] OCF Resource Agents are found in +/usr/lib/ocf/resource.d/pass:[provider]+ When creating your own agents, you are encouraged to create a new directory under +/usr/lib/ocf/resource.d/+ so that they are not confused with (or overwritten by) the agents shipped by existing providers. So, for example, if you choose the provider name of bigCorp and want a new resource named bigApp, you would create a resource agent called +/usr/lib/ocf/resource.d/bigCorp/bigApp+ and define a resource: [source,XML] ---- ---- === Actions === All OCF resource agents are required to implement the following actions. .Required Actions for OCF Agents [width="95%",cols="3m,3,7",options="header",align="center"] |========================================================= |Action |Description |Instructions |start |Start the resource |Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully active. indexterm:[start,OCF Action] indexterm:[OCF,Action,start] |stop |Stop the resource |Return 0 on success and an appropriate error code otherwise. Must not report success until the resource is fully stopped. indexterm:[stop,OCF Action] indexterm:[OCF,Action,stop] |monitor |Check the resource's state |Exit 0 if the resource is running, 7 if it is stopped, and anything else if it is failed. indexterm:[monitor,OCF Action] indexterm:[OCF,Action,monitor] NOTE: The monitor script should test the state of the resource on the local machine only. |meta-data |Describe the resource |Provide information about this resource as an XML snippet. Exit with 0. indexterm:[meta-data,OCF Action] indexterm:[OCF,Action,meta-data] NOTE: This is _not_ performed as root. |validate-all |Verify the supplied parameters |Return 0 if parameters are valid, 2 if not valid, and 6 if resource is not configured. indexterm:[validate-all,OCF Action] indexterm:[OCF,Action,validate-all] |========================================================= Additional requirements (not part of the OCF specification) are placed on agents that will be used for advanced concepts such as clone resources. .Optional Actions for OCF Resource Agents [width="95%",cols="2m,6,3",options="header",align="center"] |========================================================= |Action |Description |Instructions |promote |Promote the local instance of a promotable clone resource to the master (primary) state. |Return 0 on success indexterm:[promote,OCF Action] indexterm:[OCF,Action,promote] |demote |Demote the local instance of a promotable clone resource to the slave (secondary) state. |Return 0 on success indexterm:[demote,OCF Action] indexterm:[OCF,Action,demote] |notify |Used by the cluster to send the agent pre- and post-notification events telling the resource what has happened and will happen. |Must not fail. Must exit with 0 indexterm:[notify,OCF Action] indexterm:[OCF,Action,notify] |========================================================= One action specified in the OCF specs, +recover+, is not currently used by the cluster. It is intended to be a variant of the +start+ action that tries to recover a resource locally. [IMPORTANT] ==== If you create a new OCF resource agent, use indexterm:[ocf-tester]`ocf-tester` to verify that the agent complies with the OCF standard properly. ==== === How are OCF Return Codes Interpreted? === The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed, and recovery action is initiated. There are three types of failure recovery: .Types of recovery performed by the cluster [width="95%",cols="1m,4,4",options="header",align="center"] |========================================================= |Type |Description |Action Taken by the Cluster |soft |A transient error occurred |Restart the resource or move it to a new location indexterm:[soft,OCF error] indexterm:[OCF,error,soft] |hard |A non-transient error that may be specific to the current node occurred |Move the resource elsewhere and prevent it from being retried on the current node indexterm:[hard,OCF error] indexterm:[OCF,error,hard] |fatal |A non-transient error that will be common to all cluster nodes (e.g. a bad configuration was specified) |Stop the resource and prevent it from being started on any cluster node indexterm:[fatal,OCF error] indexterm:[OCF,error,fatal] |========================================================= [[s-ocf-return-codes]] === OCF Return Codes === The following table outlines the different OCF return codes and the type of recovery the cluster will initiate when a failure code is received. Although counterintuitive, even actions that return 0 (aka. +OCF_SUCCESS+) can be considered to have failed, if 0 was not the expected return value. .OCF Return Codes and their Recovery Types [width="95%",cols="1m,<4m,<6,1m",options="header",align="center"] |========================================================= |RC |OCF Alias |Description |RT |0 |OCF_SUCCESS |Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands. indexterm:[Return Code,OCF_SUCCESS] indexterm:[Return Code,0,OCF_SUCCESS] |soft |1 |OCF_ERR_GENERIC |Generic "there was a problem" error code. indexterm:[Return Code,OCF_ERR_GENERIC] indexterm:[Return Code,1,OCF_ERR_GENERIC] |soft |2 |OCF_ERR_ARGS |The resource's configuration is not valid on this machine. E.g. it refers to a location not found on the node. indexterm:[Return Code,OCF_ERR_ARGS] indexterm:[Return Code,2,OCF_ERR_ARGS] |hard |3 |OCF_ERR_UNIMPLEMENTED |The requested action is not implemented. indexterm:[Return Code,OCF_ERR_UNIMPLEMENTED] indexterm:[Return Code,3,OCF_ERR_UNIMPLEMENTED] |hard |4 |OCF_ERR_PERM |The resource agent does not have sufficient privileges to complete the task. indexterm:[Return Code,OCF_ERR_PERM] indexterm:[Return Code,4,OCF_ERR_PERM] |hard |5 |OCF_ERR_INSTALLED |The tools required by the resource are not installed on this machine. indexterm:[Return Code,OCF_ERR_INSTALLED] indexterm:[Return Code,5,OCF_ERR_INSTALLED] |hard |6 |OCF_ERR_CONFIGURED |The resource's configuration is invalid. E.g. required parameters are missing. indexterm:[Return Code,OCF_ERR_CONFIGURED] indexterm:[Return Code,6,OCF_ERR_CONFIGURED] |fatal |7 |OCF_NOT_RUNNING |The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action. indexterm:[Return Code,OCF_NOT_RUNNING] indexterm:[Return Code,7,OCF_NOT_RUNNING] |N/A |8 |OCF_RUNNING_MASTER |The resource is running in master mode. indexterm:[Return Code,OCF_RUNNING_MASTER] indexterm:[Return Code,8,OCF_RUNNING_MASTER] |soft |9 |OCF_FAILED_MASTER |The resource is in master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again. indexterm:[Return Code,OCF_FAILED_MASTER] indexterm:[Return Code,9,OCF_FAILED_MASTER] |soft |other |N/A |Custom error code. indexterm:[Return Code,other] |soft |========================================================= Exceptions to the recovery handling described above: * Probes (non-recurring monitor actions) that find a resource active (or in master mode) will not result in recovery action unless it is also found active elsewhere. * The recovery action taken when a resource is found active more than once is determined by the resource's +multiple-active+ property. * Recurring actions that return +OCF_ERR_UNIMPLEMENTED+ do not cause any type of recovery. == Init Script LSB Compliance == The relevant part of the http://refspecs.linuxfoundation.org/lsb.shtml[LSB specifications] includes a description of all the return codes listed here. Assuming `some_service` is configured correctly and currently inactive, the following sequence will help you determine if it is LSB-compatible: . Start (stopped): + ---- # /etc/init.d/some_service start ; echo "result: $?" ---- + .. Did the service start? .. Did the command print *result: 0* (in addition to its usual output)? + . Status (running): + ---- # /etc/init.d/some_service status ; echo "result: $?" ---- + .. Did the script accept the command? .. Did the script indicate the service was running? .. Did the command print *result: 0* (in addition to its usual output)? + . Start (running): + ---- # /etc/init.d/some_service start ; echo "result: $?" ---- + .. Is the service still running? .. Did the command print *result: 0* (in addition to its usual output)? + . Stop (running): + ---- # /etc/init.d/some_service stop ; echo "result: $?" ---- + .. Was the service stopped? .. Did the command print *result: 0* (in addition to its usual output)? + . Status (stopped): + ---- # /etc/init.d/some_service status ; echo "result: $?" ---- + .. Did the script accept the command? .. Did the script indicate the service was not running? .. Did the command print *result: 3* (in addition to its usual output)? + . Stop (stopped): + ---- # /etc/init.d/some_service stop ; echo "result: $?" ---- + .. Is the service still stopped? .. Did the command print *result: 0* (in addition to its usual output)? + . Status (failed): + .. This step is not readily testable and relies on manual inspection of the script. + The script can use one of the error codes (other than 3) listed in the LSB spec to indicate that it is active but failed. This tells the cluster that before moving the resource to another node, it needs to stop it on the existing one first. If the answer to any of the above questions is no, then the script is not LSB-compliant. Your options are then to either fix the script or write an OCF agent based on the existing script. diff --git a/doc/Pacemaker_Administration/en-US/Ch-Cluster.txt b/doc/Pacemaker_Administration/en-US/Ch-Cluster.txt index 3a14d7cdf3..c346d1ab7f 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Cluster.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Cluster.txt @@ -1,58 +1,59 @@ +:compat-mode: legacy = The Cluster Layer = == Pacemaker and the Cluster Layer == Pacemaker utilizes an underlying cluster layer for two purposes: * obtaining quorum * messaging between nodes Currently, only Corosync 2 and later is supported for this layer. == Managing Nodes in a Corosync-Based Cluster == === Adding a New Corosync Node === indexterm:[Corosync,Add Cluster Node] indexterm:[Add Cluster Node,Corosync] To add a new node: . Install Corosync and Pacemaker on the new host. . Copy +/etc/corosync/corosync.conf+ and +/etc/corosync/authkey+ (if it exists) from an existing node. You may need to modify the *mcastaddr* option to match the new node's IP address. . Start the cluster software on the new host. If a log message containing "Invalid digest" appears from Corosync, the keys are not consistent between the machines. === Removing a Corosync Node === indexterm:[Corosync,Remove Cluster Node] indexterm:[Remove Cluster Node,Corosync] Because the messaging and membership layers are the authoritative source for cluster nodes, deleting them from the CIB is not a complete solution. First, one must arrange for corosync to forget about the node (*pcmk-1* in the example below). . Stop the cluster on the host to be removed. How to do this will vary with your operating system and installed versions of cluster software, for example, `pcs cluster stop` if you are using pcs for cluster management. . From one of the remaining active cluster nodes, tell Pacemaker to forget about the removed host, which will also delete the node from the CIB: + ---- # crm_node -R pcmk-1 ---- === Replacing a Corosync Node === indexterm:[Corosync,Replace Cluster Node] indexterm:[Replace Cluster Node,Corosync] To replace an existing cluster node: . Make sure the old node is completely stopped. . Give the new machine the same hostname and IP address as the old one. . Follow the procedure above for adding a node. diff --git a/doc/Pacemaker_Administration/en-US/Ch-Configuring.txt b/doc/Pacemaker_Administration/en-US/Ch-Configuring.txt index 473e5b5299..5ca9dfc32e 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Configuring.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Configuring.txt @@ -1,435 +1,436 @@ +:compat-mode: legacy = Configuring Pacemaker = == How Should the Configuration be Updated? == There are three basic rules for updating the cluster configuration: * Rule 1 - Never edit the +cib.xml+ file manually. Ever. I'm not making this up. * Rule 2 - Read Rule 1 again. * Rule 3 - The cluster will notice if you ignored rules 1 & 2 and refuse to use the configuration. Now that it is clear how 'not' to update the configuration, we can begin to explain how you 'should'. === Editing the CIB Using XML === The most powerful tool for modifying the configuration is the +cibadmin+ command. With +cibadmin+, you can query, add, remove, update or replace any part of the configuration. All changes take effect immediately, so there is no need to perform a reload-like operation. The simplest way of using `cibadmin` is to use it to save the current configuration to a temporary file, edit that file with your favorite text or XML editor, and then upload the revised configuration. footnote:[This process might appear to risk overwriting changes that happen after the initial cibadmin call, but pacemaker will reject any update that is "too old". If the CIB is updated in some other fashion after the initial cibadmin, the second cibadmin will be rejected because the version number will be too low.] .Safely using an editor to modify the cluster configuration ====== -------- # cibadmin --query > tmp.xml # vi tmp.xml # cibadmin --replace --xml-file tmp.xml -------- ====== Some of the better XML editors can make use of a Relax NG schema to help make sure any changes you make are valid. The schema describing the configuration can be found in +pacemaker.rng+, which may be deployed in a location such as +/usr/share/pacemaker+ or +/usr/lib/heartbeat+ depending on your operating system and how you installed the software. If you want to modify just one section of the configuration, you can query and replace just that section to avoid modifying any others. .Safely using an editor to modify only the resources section ====== -------- # cibadmin --query --scope resources > tmp.xml # vi tmp.xml # cibadmin --replace --scope resources --xml-file tmp.xml -------- ====== === Quickly Deleting Part of the Configuration === Identify the object you wish to delete by XML tag and id. For example, you might search the CIB for all STONITH-related configuration: .Searching for STONITH-related configuration items ====== ---- # cibadmin -Q | grep stonith ---- ====== If you wanted to delete the +primitive+ tag with id +child_DoFencing+, you would run: ---- # cibadmin --delete --xml-text '' ---- === Updating the Configuration Without Using XML === Most tasks can be performed with one of the other command-line tools provided with pacemaker, avoiding the need to read or edit XML. To enable STONITH for example, one could run: ---- # crm_attribute --name stonith-enabled --update 1 ---- Or, to check whether *somenode* is allowed to run resources, there is: ---- # crm_standby --query --node somenode ---- Or, to find the current location of *my-test-rsc*, one can use: ---- # crm_resource --locate --resource my-test-rsc ---- Examples of using these tools for specific cases will be given throughout this document where appropriate. [[s-config-sandboxes]] == Making Configuration Changes in a Sandbox == Often it is desirable to preview the effects of a series of changes before updating the configuration all at once. For this purpose, we have created `crm_shadow` which creates a "shadow" copy of the configuration and arranges for all the command line tools to use it. To begin, simply invoke `crm_shadow --create` with the name of a configuration to create footnote:[Shadow copies are identified with a name, making it possible to have more than one.], and follow the simple on-screen instructions. [WARNING] ==== Read this section and the on-screen instructions carefully; failure to do so could result in destroying the cluster's active configuration! ==== .Creating and displaying the active sandbox ====== ---- # crm_shadow --create test Setting up shadow instance Type Ctrl-D to exit the crm_shadow shell shadow[test]: shadow[test] # crm_shadow --which test ---- ====== From this point on, all cluster commands will automatically use the shadow copy instead of talking to the cluster's active configuration. Once you have finished experimenting, you can either make the changes active via the `--commit` option, or discard them using the `--delete` option. Again, be sure to follow the on-screen instructions carefully! For a full list of `crm_shadow` options and commands, invoke it with the `--help` option. .Use sandbox to make multiple changes all at once, discard them, and verify real configuration is untouched ====== ---- shadow[test] # crm_failcount -r rsc_c001n01 -G scope=status name=fail-count-rsc_c001n01 value=0 shadow[test] # crm_standby --node c001n02 -v on shadow[test] # crm_standby --node c001n02 -G scope=nodes name=standby value=on shadow[test] # cibadmin --erase --force shadow[test] # cibadmin --query shadow[test] # crm_shadow --delete test --force Now type Ctrl-D to exit the crm_shadow shell shadow[test] # exit # crm_shadow --which No active shadow configuration defined # cibadmin -Q ---- ====== [[s-config-testing-changes]] == Testing Your Configuration Changes == We saw previously how to make a series of changes to a "shadow" copy of the configuration. Before loading the changes back into the cluster (e.g. `crm_shadow --commit mytest --force`), it is often advisable to simulate the effect of the changes with +crm_simulate+. For example: ---- # crm_simulate --live-check -VVVVV --save-graph tmp.graph --save-dotfile tmp.dot ---- This tool uses the same library as the live cluster to show what it would have done given the supplied input. Its output, in addition to a significant amount of logging, is stored in two files +tmp.graph+ and +tmp.dot+. Both files are representations of the same thing: the cluster's response to your changes. The graph file stores the complete transition from the existing cluster state to your desired new state, containing a list of all the actions, their parameters and their pre-requisites. Because the transition graph is not terribly easy to read, the tool also generates a Graphviz footnote:[Graph visualization software. See http://www.graphviz.org/ for details.] dot-file representing the same information. For information on the options supported by `crm_simulate`, use its `--help` option. .Interpreting the Graphviz output * Arrows indicate ordering dependencies * Dashed arrows indicate dependencies that are not present in the transition graph * Actions with a dashed border of any color do not form part of the transition graph * Actions with a green border form part of the transition graph * Actions with a red border are ones the cluster would like to execute but cannot run * Actions with a blue border are ones the cluster does not feel need to be executed * Actions with orange text are pseudo/pretend actions that the cluster uses to simplify the graph * Actions with black text are sent to the LRM * Resource actions have text of the form pass:[rsc]_pass:[action]_pass:[interval] pass:[node] * Any action depending on an action with a red border will not be able to execute. * Loops are _really_ bad. Please report them to the development team. === Small Cluster Transition === image::images/Policy-Engine-small.png["An example transition graph as represented by Graphviz",width="16cm",height="6cm",align="center"] In the above example, it appears that a new node, *pcmk-2*, has come online and that the cluster is checking to make sure *rsc1*, *rsc2* and *rsc3* are not already running there (Indicated by the *rscN_monitor_0* entries). Once it did that, and assuming the resources were not active there, it would have liked to stop *rsc1* and *rsc2* on *pcmk-1* and move them to *pcmk-2*. However, there appears to be some problem and the cluster cannot or is not permitted to perform the stop actions which implies it also cannot perform the start actions. For some reason the cluster does not want to start *rsc3* anywhere. === Complex Cluster Transition === image::images/Policy-Engine-big.png["Another, slightly more complex, transition graph that you're not expected to be able to read",width="16cm",height="20cm",align="center"] == Do I Need to Update the Configuration on All Cluster Nodes? == No. Any changes are immediately synchronized to the other active members of the cluster. To reduce bandwidth, the cluster only broadcasts the incremental updates that result from your changes and uses MD5 checksums to ensure that each copy is completely consistent. == Working with CIB Properties == Although these fields can be written to by the user, in most cases the cluster will overwrite any values specified by the user with the "correct" ones. To change the ones that can be specified by the user, for example +admin_epoch+, one should use: ---- # cibadmin --modify --xml-text '' ---- A complete set of CIB properties will look something like this: .Attributes set for a cib object ====== [source,XML] ------- ------- ====== == Querying and Setting Cluster Options == indexterm:[Querying,Cluster Option] indexterm:[Setting,Cluster Option] indexterm:[Cluster,Querying Options] indexterm:[Cluster,Setting Options] Cluster options can be queried and modified using the `crm_attribute` tool. To get the current value of +cluster-delay+, you can run: ---- # crm_attribute --query --name cluster-delay ---- which is more simply written as ---- # crm_attribute -G -n cluster-delay ---- If a value is found, you'll see a result like this: ---- # crm_attribute -G -n cluster-delay scope=crm_config name=cluster-delay value=60s ---- If no value is found, the tool will display an error: ---- # crm_attribute -G -n clusta-deway scope=crm_config name=clusta-deway value=(null) Error performing operation: No such device or address ---- To use a different value (for example, 30 seconds), simply run: ---- # crm_attribute --name cluster-delay --update 30s ---- To go back to the cluster's default value, you can delete the value, for example: ---- # crm_attribute --name cluster-delay --delete Deleted crm_config option: id=cib-bootstrap-options-cluster-delay name=cluster-delay ---- === When Options are Listed More Than Once === If you ever see something like the following, it means that the option you're modifying is present more than once. .Deleting an option that is listed twice ======= ------ # crm_attribute --name batch-limit --delete Multiple attributes match name=batch-limit in crm_config: Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit) Value: 100 (set=custom, id=custom-batch-limit) Please choose from one of the matches above and supply the 'id' with --id ------- ======= In such cases, follow the on-screen instructions to perform the requested action. To determine which value is currently being used by the cluster, refer to the 'Rules' chapter of 'Pacemaker Explained'. [[s-remote-connection]] == Connecting from a Remote Machine == indexterm:[Cluster,Remote connection] indexterm:[Cluster,Remote administration] Provided Pacemaker is installed on a machine, it is possible to connect to the cluster even if the machine itself is not in the same cluster. To do this, one simply sets up a number of environment variables and runs the same commands as when working on a cluster node. .Environment Variables Used to Connect to Remote Instances of the CIB [width="95%",cols="1m,1,<3",options="header",align="center"] |========================================================= |Environment Variable |Default |Description |CIB_user |$USER |The user to connect as. Needs to be part of the +haclient+ group on the target host. indexterm:[Environment Variable,CIB_user] |CIB_passwd | |The user's password. Read from the command line if unset. indexterm:[Environment Variable,CIB_passwd] |CIB_server |localhost |The host to contact indexterm:[Environment Variable,CIB_server] |CIB_port | |The port on which to contact the server; required. indexterm:[Environment Variable,CIB_port] |CIB_encrypted |TRUE |Whether to encrypt network traffic indexterm:[Environment Variable,CIB_encrypted] |========================================================= So, if *c001n01* is an active cluster node and is listening on port 1234 for connections, and *someuser* is a member of the *haclient* group, then the following would prompt for *someuser*'s password and return the cluster's current configuration: ---- # export CIB_port=1234; export CIB_server=c001n01; export CIB_user=someuser; # cibadmin -Q ---- For security reasons, the cluster does not listen for remote connections by default. If you wish to allow remote access, you need to set the +remote-tls-port+ (encrypted) or +remote-clear-port+ (unencrypted) CIB properties (i.e., those kept in the +cib+ tag, like +num_updates+ and +epoch+). .Extra top-level CIB properties for remote access [width="95%",cols="1m,1,<3",options="header",align="center"] |========================================================= |Field |Default |Description |remote-tls-port |_none_ |Listen for encrypted remote connections on this port. indexterm:[remote-tls-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-tls-port] |remote-clear-port |_none_ |Listen for plaintext remote connections on this port. indexterm:[remote-clear-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-clear-port] |========================================================= diff --git a/doc/Pacemaker_Administration/en-US/Ch-Installing.txt b/doc/Pacemaker_Administration/en-US/Ch-Installing.txt index dd227b32d8..75aa566c2d 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Installing.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Installing.txt @@ -1,104 +1,105 @@ +:compat-mode: legacy = Installing Cluster Software = == Installing the Software == Most major Linux distributions have pacemaker packages in their standard package repositories, or the software can be built from source code. See the http://clusterlabs.org/wiki/Install[Install wiki page] for details. == Enabling Pacemaker == === Enabling Pacemaker For Corosync version 2 and greater === High-level cluster management tools are available that can configure corosync for you. This document focuses on the lower-level details if you want to configure corosync yourself. Corosync configuration is normally located in +/etc/corosync/corosync.conf+. .Corosync configuration file for two nodes *myhost1* and *myhost2* ==== ---- totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: myhost1 nodeid: 1 } node { ring0_addr: myhost2 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_syslog: yes } ---- ==== .Corosync configuration file for three nodes *myhost1*, *myhost2* and *myhost3* ==== ---- totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: myhost1 nodeid: 1 } node { ring0_addr: myhost2 nodeid: 2 } node { ring0_addr: myhost3 nodeid: 3 } } quorum { provider: corosync_votequorum } logging { to_syslog: yes } ---- ==== In the above examples, the +totem+ section defines what protocol version and options (including encryption) to use, footnote:[ Please consult the Corosync website (http://www.corosync.org/) and documentation for details on enabling encryption and peer authentication for the cluster. ] and gives the cluster a unique name (+mycluster+ in these examples). The +node+ section lists the nodes in this cluster. The +quorum+ section defines how the cluster uses quorum. The important thing is that two-node clusters must be handled specially, so +two_node: 1+ must be defined for two-node clusters (and only for two-node clusters). The +logging+ section should be self-explanatory. diff --git a/doc/Pacemaker_Administration/en-US/Ch-Intro.txt b/doc/Pacemaker_Administration/en-US/Ch-Intro.txt index 60b750761c..2686733e2c 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Intro.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Intro.txt @@ -1,19 +1,20 @@ +:compat-mode: legacy = Read-Me-First = == The Scope of this Document == The purpose of this document is to help system administrators learn how to manage a Pacemaker cluster. System administrators may be interested in other parts of the https://www.clusterlabs.org/pacemaker/doc/[Pacemaker documentation set], such as 'Clusters from Scratch', a step-by-step guide to setting up an example cluster, and 'Pacemaker Explained', an exhaustive reference for cluster configuration. Multiple higher-level tools (both command-line and GUI) are available to simplify cluster management. However, this document focuses on the lower-level command-line tools that come with Pacemaker itself. The concepts are applicable to the higher-level tools, though the syntax would differ. include::../../shared/en-US/pacemaker-intro.txt[] diff --git a/doc/Pacemaker_Administration/en-US/Ch-Monitoring.txt b/doc/Pacemaker_Administration/en-US/Ch-Monitoring.txt index b9edabae2a..9792d5ceff 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Monitoring.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Monitoring.txt @@ -1,60 +1,61 @@ +:compat-mode: legacy = Monitoring a Pacemaker Cluster = == Using crm_mon == The `crm_mon` utility displays the current state of an active cluster. It can show the cluster status organized by node or by resource, and can be used in either single-shot or dynamically updating mode. It can also display operations performed and information about failures. Using this tool, you can examine the state of the cluster for irregularities, and see how it responds when you cause or simulate failures. See the manual page or the output of `crm_mon --help` for a full description of its many options. .Sample output from crm_mon -1 ====== ------- Stack: corosync Current DC: node2 (version 2.0.0-1) - partition with quorum Last updated: Mon Jan 29 12:18:42 2018 Last change: Mon Jan 29 12:18:40 2018 by root via crm_attribute on node3 5 nodes configured 2 resources configured Online: [ node1 node2 node3 node4 node5 ] Active resources: Fencing (stonith:fence_xvm): Started node1 IP (ocf:heartbeat:IPaddr2): Started node2 ------- ====== .Sample output from crm_mon -n -1 ====== ------- Stack: corosync Current DC: node2 (version 2.0.0-1) - partition with quorum Last updated: Mon Jan 29 12:21:48 2018 Last change: Mon Jan 29 12:18:40 2018 by root via crm_attribute on node3 5 nodes configured 2 resources configured Node node1: online Fencing (stonith:fence_xvm): Started Node node2: online IP (ocf:heartbeat:IPaddr2): Started Node node3: online Node node4: online Node node5: online ------- ====== As mentioned in an earlier chapter, the DC is the node is where decisions are made. The cluster elects a node to be DC as needed. The only significance of the choice of DC to an administrator is the fact that its logs will have the most information about why decisions were made. diff --git a/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt b/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt index e6c7ecc38a..166a98c4f7 100644 --- a/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt +++ b/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt @@ -1,454 +1,455 @@ +:compat-mode: legacy = Upgrading a Pacemaker Cluster = == Pacemaker Versioning == Pacemaker has an overall release version, plus separate version numbers for certain internal components. * *Pacemaker release version:* This version consists of three numbers (_x.y.z_). + The major version number (the _x_ in _x.y.z_) increases when at least some rolling upgrades are not possible from the previous major version. For example, a rolling upgrade from 1.0.8 to 1.1.15 should always be supported, but a rolling upgrade from 1.0.8 to 2.0.0 may not be possible. + The minor version (the _y_ in _x.y.z_) increases when there are significant changes in cluster default behavior, tool behavior, and/or the API interface (for software that utilizes Pacemaker libraries). The main benefit is to alert you to pay closer attention to the release notes, to see if you might be affected. + The release counter (the _z_ in _x.y.z_) is increased with all public releases of Pacemaker, which typically include both bug fixes and new features. * *CRM feature set:* This version number applies to the communication between full cluster nodes, and is used to avoid problems in mixed-version clusters. + The major version number increases when nodes with different versions would not work (rolling upgrades are not allowed). The minor version number increases when mixed-version clusters are allowed only during rolling upgrades. The minor-minor version number is ignored, but allows resource agents to detect cluster support for various features. footnote:[ Before CRM feature set 3.1.0 (Pacemaker 2.0.0), the minor-minor version number was treated the same as the minor version. ] + Pacemaker ensures that the longest-running node is the cluster's DC. This ensures new features are not enabled until all nodes are upgraded to support them. * *LRMD protocol version:* This version applies to communication between a Pacemaker Remote node and the cluster. It increases when an older cluster node would have problems hosting the connection to a newer Pacemaker Remote node. To avoid these problems, Pacemaker Remote nodes will accept connections only from cluster nodes with the same or newer LRMD protocol version. + Unlike with CRM feature set differences between full cluster nodes, mixed LRMD protocol versions between Pacemaker Remote nodes and full cluster nodes are fine, as long as the Pacemaker Remote nodes have the older version. This can be useful, for example, to host a legacy application in an older operating system version used as a Pacemaker Remote node. * *XML schema version:* Pacemaker’s configuration syntax — what's allowed in the Configuration Information Base (CIB) — has its own version. This allows the configuration syntax to evolve over time while still allowing clusters with older configurations to work without change. == Upgrading Cluster Software == There are three approaches to upgrading a cluster, each with advantages and disadvantages. .Upgrade Methods [width="95%",cols="s,6*",options="header",align="center"] |========================================================= |Method |Available between all versions |Can be used with Pacemaker Remote nodes |Service outage during upgrade |Service recovery during upgrade |Exercises failover logic |Allows change of messaging layer indexterm:[Cluster,switching between stacks] indexterm:[Changing cluster stack] footnote:[Currently, Corosync version 2 and greater is the only supported cluster stack, but other stacks have been supported by past versions, and may be supported by future versions.] |Complete cluster shutdown indexterm:[upgrade,shutdown] indexterm:[shutdown upgrade] |yes |yes |always |N/A |no |yes |Rolling (node by node) indexterm:[upgrade,rolling] indexterm:[rolling upgrade] |no |yes |always footnote:[Any active resources will be moved off the node being upgraded, so there will be at least a brief outage unless all resources can be migrated "live".] |yes |yes |no |Detach and reattach indexterm:[upgrade,reattach] indexterm:[reattach upgrade] |yes |no |only due to failure |no |no |yes |========================================================= === Complete Cluster Shutdown === In this scenario, one shuts down all cluster nodes and resources, then upgrades all the nodes before restarting the cluster. . On each node: .. Shutdown the cluster software (pacemaker and the messaging layer). .. Upgrade the Pacemaker software. This may also include upgrading the messaging layer and/or the underlying operating system. .. Check the configuration with the `crm_verify` tool. . On each node: .. Start the cluster software. Currently, only Corosync version 2 and greater is supported as the cluster layer, but if another stack is supported in the future, the stack does not need to be the same one before the upgrade. One variation of this approach is to build a new cluster on new hosts. This allows the new version to be tested beforehand, and minimizes downtime by having the new nodes ready to be placed in production as soon as the old nodes are shut down. === Rolling (node by node) === In this scenario, each node is removed from the cluster, upgraded, and then brought back online, until all nodes are running the newest version. Special considerations when planning a rolling upgrade: * If you plan to upgrade other cluster software -- such as the messaging layer -- at the same time, consult that software's documentation for its compatibility with a rolling upgrade. * If the major version number is changing in the Pacemaker version you are upgrading to, a rolling upgrade may not be possible. Read the new version's release notes (as well the information here) for what limitations may exist. * If the CRM feature set is changing in the Pacemaker version you are upgrading to, you should run a mixed-version cluster only during a small rolling upgrade window. If one of the older nodes drops out of the cluster for any reason, it will not be able to rejoin until it is upgraded. * If the LRMD protocol version is changing, all cluster nodes should be upgraded before upgrading any Pacemaker Remote nodes. See the ClusterLabs wiki's http://clusterlabs.org/wiki/ReleaseCalendar[Release Calendar] to figure out whether the CRM feature set and/or LRMD protocol version changed between the the Pacemaker release versions in your rolling upgrade. To perform a rolling upgrade, on each node in turn: . Put the node into standby mode, and wait for any active resources to be moved cleanly to another node. (This step is optional, but allows you to deal with any resource issues before the upgrade.) . Shutdown the cluster software (pacemaker and the messaging layer) on the node. . Upgrade the Pacemaker software. This may also include upgrading the messaging layer and/or the underlying operating system. . If this is the first node to be upgraded, check the configuration with the `crm_verify` tool. . Start the messaging layer. This must be the same messaging layer (currently only Corosync version 2 and greater is supported) that the rest of the cluster is using. [NOTE] ==== Even if a rolling upgrade from the current version of the cluster to the newest version is not directly possible, it may be possible to perform a rolling upgrade in multiple steps, by upgrading to an intermediate version first. .Version Compatibility Table [width="95%",cols="2*",options="header",align="center"] |========================================================= |Version being Installed |Oldest Compatible Version |Pacemaker 2.y.z |Pacemaker 1.1.11 footnote:[Rolling upgrades from Pacemaker 1.1.z to 2.y.z are possible only if the cluster uses corosync version 2 or greater as its messaging layer, and the Cluster Information Base (CIB) uses schema 1.0 or higher in its validate-with property.] |Pacemaker 1.y.z |Pacemaker 1.0.0 |Pacemaker 0.7.z |Pacemaker 0.6.z |========================================================= ==== === Detach and Reattach === The reattach method is a variant of a complete cluster shutdown, where the resources are left active and get re-detected when the cluster is restarted. This method may not be used if the cluster contains any Pacemaker Remote nodes. . Tell the cluster to stop managing services. This is required to allow the services to remain active after the cluster shuts down. + ---- # crm_attribute --name maintenance-mode --update true ---- . On each node, shutdown the cluster software (pacemaker and the messaging layer), and upgrade the Pacemaker software. This may also include upgrading the messaging layer. While the underlying operating system may be upgraded at the same time, that will be more likely to cause outages in the detached services (certainly, if a reboot is required). . Check the configuration with the `crm_verify` tool. . On each node, start the cluster software. Currently, only Corosync version 2 and greater is supported as the cluster layer, but if another stack is supported in the future, the stack does not need to be the same one before the upgrade. . Verify that the cluster re-detected all resources correctly. . Allow the cluster to resume managing resources again: + ---- # crm_attribute --name maintenance-mode --delete ---- == Upgrading the Configuration == indexterm:[upgrade,Configuration] indexterm:[Configuration,upgrading] The CIB schema version can change from one Pacemaker version to another. After cluster software is upgraded, the cluster will continue to use the older schema version that it was previously using. This can be useful, for example, when administrators have written tools that modify the configuration, and are based on the older syntax. footnote:[As of Pacemaker 2.0.0, only schema versions pacemaker-1.0 and higher are supported (excluding pacemaker-1.1, which was an experimental schema now known as pacemaker-next).] However, when using an older syntax, new features may be unavailable, and there is a performance impact, since the cluster must do a non-persistent configuration upgrade before each transition. So while using the old syntax is possible, it is not advisable to continue using it indefinitely. Even if you wish to continue using the old syntax, it is a good idea to follow the upgrade procedure outlined below, except for the last step, to ensure that the new software has no problems with your existing configuration (since it will perform much the same task internally). If you are brave, it is sufficient simply to run `cibadmin --upgrade`. A more cautious approach would proceed like this: . Create a shadow copy of the configuration. The later commands will automatically operate on this copy, rather than the live configuration. + ----- # crm_shadow --create shadow ----- . Verify the configuration is valid with the new software (which may be stricter about syntax mistakes, or may have dropped support for deprecated features): indexterm:[Configuration,verify] indexterm:[verify,Configuration] + ----- # crm_verify --live-check ----- . Fix any errors or warnings. . Perform the upgrade: + ----- # cibadmin --upgrade ----- . If this step fails, there are three main possibilities: .. The configuration was not valid to start with (did you do steps 2 and 3?). .. The transformation failed - http://bugs.clusterlabs.org/[report a bug] or mailto:users@clusterlabs.org?subject=Transformation%20failed%20during%20upgrade[email the project]. .. The transformation was successful but produced an invalid result. + If the result of the transformation is invalid, you may see a number of errors from the validation library. If these are not helpful, visit the http://clusterlabs.org/wiki/Validation_FAQ[Validation FAQ wiki page] and/or try the manual upgrade procedure described below. + . Check the changes: + ----- # crm_shadow --diff ----- + If at this point there is anything about the upgrade that you wish to fine-tune (for example, to change some of the automatic IDs), now is the time to do so: + ----- # crm_shadow --edit ----- + This will open the configuration in your favorite editor (whichever is specified by the standard *$EDITOR* environment variable). + . Preview how the cluster will react: + ------ # crm_simulate --live-check --save-dotfile shadow.dot -S # graphviz shadow.dot ------ + Verify that either no resource actions will occur or that you are happy with any that are scheduled. If the output contains actions you do not expect (possibly due to changes to the score calculations), you may need to make further manual changes. See <> for further details on how to interpret the output of `crm_simulate` and `graphviz`. + . Upload the changes: + ----- # crm_shadow --commit shadow --force ----- + In the unlikely event this step fails, please report a bug. [NOTE] ==== indexterm:[Configuration,upgrade manually] It is also possible to perform the configuration upgrade steps manually: . Locate the +upgrade*.xsl+ conversion scripts provided with the source code. These will often be installed in a location such as +/usr/share/pacemaker+, or may be obtained from the https://github.com/ClusterLabs/pacemaker/tree/master/xml[source repository]. . Run the conversion scripts that apply to your older version, for example: indexterm:[XML,convert] + ----- # xsltproc /path/to/upgrade06.xsl config06.xml > config10.xml ----- + . Locate the +pacemaker.rng+ script (from the same location as the xsl files). . Check the XML validity: indexterm:[validate configuration]indexterm:[Configuration,validate XML] + ---- # xmllint --relaxng /path/to/pacemaker.rng config10.xml ---- The advantage of this method is that it can be performed without the cluster running, and any validation errors are often more informative. ==== == What Changed in 2.0 == The main goal of the 2.0 release was to remove support for deprecated syntax, along with some small changes in default configuration behavior and tool behavior. Highlights: * Only Corosync version 2 and greater is now supported as the underlying cluster layer. Support for Heartbeat and Corosync 1 (including CMAN) is removed. * The Pacemaker detail log file is now stored in /var/log/pacemaker/pacemaker.log by default. * The record-pending cluster property now defaults to true, which allows status tools such as crm_mon to show operations that are in progress. * Support for a number of deprecated build options, environment variables, and configuration settings has been removed. * The +master+ tag has been deprecated in favor of using a +clone+ tag with the new +promotable+ meta-attribute set to +true+. "Master/slave" clone resources are now referred to as "promotable" clone resources, though it will take longer for the full terminology change to be completed. * The public API for Pacemaker libraries that software applications can use has changed significantly. For a detailed list of changes, see the release notes and the https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes[Pacemaker 2.0 Changes] page on the ClusterLabs wiki. == What Changed in 1.0 == === New === * Failure timeouts. * New section for resource and operation defaults. * Tool for making offline configuration changes. * +Rules, instance_attributes, meta_attributes+ and sets of operations can be defined once and referenced in multiple places. * The CIB now accepts XPath-based create/modify/delete operations. See the pass:[cibadmin] help text. * Multi-dimensional colocation and ordering constraints. * The ability to connect to the CIB from non-cluster machines. * Allow recurring actions to be triggered at known times. === Changed === * Syntax ** All resource and cluster options now use dashes (-) instead of underscores (_) ** +master_slave+ was renamed to +master+ ** The +attributes+ container tag was removed ** The operation field +pre-req+ has been renamed +requires+ ** All operations must have an +interval+, +start+/+stop+ must have it set to zero * The +stonith-enabled+ option now defaults to true. * The cluster will refuse to start resources if +stonith-enabled+ is true (or unset) and no STONITH resources have been defined * The attributes of colocation and ordering constraints were renamed for clarity. * +resource-failure-stickiness+ has been replaced by +migration-threshold+. * The parameters for command-line tools have been made consistent * Switched to 'RelaxNG' schema validation and 'libxml2' parser ** id fields are now XML IDs which have the following limitations: *** id's cannot contain colons (:) *** id's cannot begin with a number *** id's must be globally unique (not just unique for that tag) ** Some fields (such as those in constraints that refer to resources) are IDREFs. + This means that they must reference existing resources or objects in order for the configuration to be valid. Removing an object which is referenced elsewhere will therefore fail. + ** The CIB representation, from which a MD5 digest is calculated to verify CIBs on the nodes, has changed. + This means that every CIB update will require a full refresh on any upgraded nodes until the cluster is fully upgraded to 1.0. This will result in significant performance degradation and it is therefore highly inadvisable to run a mixed 1.0/0.6 cluster for any longer than absolutely necessary. + * Ping node information no longer needs to be added to _ha.cf_. + Simply include the lists of hosts in your ping resource(s). === Removed === * Syntax ** It is no longer possible to set resource meta options as top-level attributes. Use meta attributes instead. ** Resource and operation defaults are no longer read from +crm_config+. diff --git a/doc/Pacemaker_Development/en-US/Ch-Coding.txt b/doc/Pacemaker_Development/en-US/Ch-Coding.txt index ecb228ae39..c0bfde984c 100644 --- a/doc/Pacemaker_Development/en-US/Ch-Coding.txt +++ b/doc/Pacemaker_Development/en-US/Ch-Coding.txt @@ -1,198 +1,199 @@ +:compat-mode: legacy = C Coding Guidelines = //// We prefer [[ch-NAME]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-c-coding[Chapter 2, C Coding Guidelines] == C Boilerplate == indexterm:[C,boilerplate] indexterm:[licensing,C boilerplate] Every C file should start like this: ==== [source,C] ---- /* * Copyright Andrew Beekhof * * This source code is licensed under WITHOUT ANY WARRANTY. */ ---- ==== ++ is the year the code was 'originally' created. footnote:[ See the U.S. Copyright Office's https://www.copyright.gov/comp3/["Compendium of U.S. Copyright Office Practices"], particularly "Chapter 2200: Notice of Copyright", sections 2205.1(A) and 2205.1(F), or https://techwhirl.com/updating-copyright-notices/["Updating Copyright Notices"] for a more readable summary. ] If the code is modified in later years, add +-YYYY+ with the most recent year of modification. ++ should follow the policy set forth in the https://github.com/ClusterLabs/pacemaker/blob/master/COPYING[+COPYING+] file, generally one of "GNU General Public License version 2 or later (GPLv2+)" or "GNU Lesser General Public License version 2.1 or later (LGPLv2.1+)". == Formatting == === Whitespace === indexterm:[C,whitespace] - Indentation must be 4 spaces, no tabs. - Do not leave trailing whitespace. === Line Length === - Lines should be no longer than 80 characters unless limiting line length significantly impacts readability. === Pointers === indexterm:[C,pointers] - The +*+ goes by the variable name, not the type: ==== [source,C] ---- char *foo; ---- ==== - Use a space before the +*+ and after the closing parenthesis in a cast: ==== [source,C] ---- char *foo = (char *) bar; ---- ==== === Functions === indexterm:[C,functions] - In the function definition, put the return type on its own line, and place the opening brace by itself on a line: ==== [source,C] ---- static int foo(void) { ---- ==== - For functions with enough arguments that they must break to the next line, align arguments with the first argument: ==== [source,C] ---- static int function_name(int bar, const char *a, const char *b, const char *c, const char *d) { ---- ==== - If a function name gets really long, start the arguments on their own line with 8 spaces of indentation: ==== [source,C] ---- static int really_really_long_function_name_this_is_getting_silly_now( int bar, const char *a, const char *b, const char *c, const char *d) { ---- ==== === Control Statements (if, else, while, for, switch) === - The keyword is followed by one space, then left parenthesis without space, condition, right parenthesis, space, opening bracket on the same line. +else+ and +else if+ are on the same line with the ending brace and opening brace, separated by a space: ==== [source,C] ---- if (condition1) { statement1; } else if (condition2) { statement2; } else { statement3; } ---- ==== - In a +switch+ statement, +case+ is indented one level, and the body of each +case+ is indented by another level. The opening brace is on the same line as +switch+. ==== [source,C] ---- switch (expression) { case 0: command1; break; case 1: command2; break; default: command3; } ---- ==== === Operators === indexterm:[C,operators] - Operators have spaces from both sides. Do not rely on operator precedence; use parentheses when mixing operators with different priority. - No space is used after opening parenthesis and before closing parenthesis. ==== [source,C] ---- x = a + b - (c * d); ---- ==== == Naming Conventions == indexterm:[C,naming] - Any exposed symbols in libraries (non-+static+ function names, type names, etc.) must begin with a prefix appropriate to the library, for example, +crm_+, +pe_+, +st_+, +lrm_+. == vim Settings == indexterm:[vim] Developers who use +vim+ to edit source code can add the following settings to their +~/.vimrc+ file to follow Pacemaker C coding guidelines: ---- " follow Pacemaker coding guidelines when editing C source code files filetype plugin indent on au FileType c setlocal expandtab tabstop=4 softtabstop=4 shiftwidth=4 textwidth=80 autocmd BufNewFile,BufRead *.h set filetype=c let c_space_errors = 1 ---- diff --git a/doc/Pacemaker_Development/en-US/Ch-FAQ.txt b/doc/Pacemaker_Development/en-US/Ch-FAQ.txt index 065ba04d94..26490e5a84 100644 --- a/doc/Pacemaker_Development/en-US/Ch-FAQ.txt +++ b/doc/Pacemaker_Development/en-US/Ch-FAQ.txt @@ -1,112 +1,113 @@ +:compat-mode: legacy = Frequently Asked Questions = [qanda] Who is this document intended for?:: Anyone who wishes to read and/or edit the Pacemaker source code. Casual contributors should feel free to read just this FAQ, and consult other chapters as needed. Where is the source code for Pacemaker?:: indexterm:[downloads] indexterm:[source code] indexterm:[git,GitHub] The https://github.com/ClusterLabs/pacemaker[source code for Pacemaker] is kept on https://github.com/[GitHub], as are all software projects under the https://github.com/ClusterLabs[ClusterLabs] umbrella. Pacemaker uses https://git-scm.com/[Git] for source code management. If you are a Git newbie, the http://schacon.github.io/git/gittutorial.html[gittutorial(7) man page] is an excellent starting point. If you're familiar with using Git from the command line, you can create a local copy of the Pacemaker source code with: `git clone https://github.com/ClusterLabs/pacemaker.git pacemaker` What are the different Git branches and repositories used for?:: indexterm:[branches] * The https://github.com/ClusterLabs/pacemaker/tree/master[master branch] is the primary branch used for development. * The https://github.com/ClusterLabs/pacemaker/tree/1.1[1.1 branch] contains the latest official release, and normally does not receive any changes. During the release cycle, it will contain release candidates for the next official release, and will receive only bug fixes. * The https://github.com/ClusterLabs/pacemaker-1.0[1.0 repository] is a frozen snapshot of the 1.0 release series, and is no longer developed. * Messages will be posted to the http://clusterlabs.org/mailman/listinfo/developers[developers@clusterlabs.org] mailing list during the release cycle, with instructions about which branches to use when submitting requests. How do I build from the source code?:: See https://github.com/ClusterLabs/pacemaker/blob/master/INSTALL.md[INSTALL.md] in the main checkout directory. What coding style should I follow?:: You'll be mostly fine if you simply follow the example of existing code. When unsure, see the relevant chapter of this document for language-specific recommendations. Pacemaker has grown and evolved organically over many years, so you will see much code that doesn't conform to the current guidelines. We discourage making changes solely to bring code into conformance, as any change requires developer time for review and opens the possibility of adding bugs. However, new code should follow the guidelines, and it is fine to bring lines of older code into conformance when modifying that code for other reasons. How should I format my Git commit messages?:: indexterm:[git,commit messages] See existing examples in the git log. The first line should look like +change-type: affected-code: explanation+ where +change-type+ can be +Fix+ or +Bug+ for most bug fixes, +Feature+ for new features, +Log+ for changes to log messages or handling, +Doc+ for changes to documentation or comments, or +Test+ for changes in CTS and regression tests. You will sometimes see +Low+, +Med+ (or +Mid+) and +High+ used instead for bug fixes, to indicate the severity. The important thing is that only commits with +Feature+, +Fix+, +Bug+, or +High+ will automatically be included in the change log for the next release. The +affected-code+ is the name of the component(s) being changed, for example, +pacemaker-controld+ or +libcrmcommon+ (it's more free-form, so don't sweat getting it exact). The +explanation+ briefly describes the change. The git project recommends the entire summary line stay under 50 characters, but more is fine if needed for clarity. Except for the most simple and obvious of changes, the summary should be followed by a blank line and then a longer explanation of 'why' the change was made. How can I test my changes?:: Most importantly, Pacemaker has regression tests for most major components; these will automatically be run for any pull requests submitted through GitHub. Additionally, Pacemaker's Cluster Test Suite (CTS) can be used to set up a test cluster and run a wide variety of complex tests. This document will have more detail on testing in the future. What is Pacemaker's license?:: indexterm:[licensing] Except where noted otherwise in the file itself, the source code for all Pacemaker programs is licensed under version 2 or later of the GNU General Public License (https://www.gnu.org/licenses/gpl-2.0.html[GPLv2+]), its headers and libraries under version 2.1 or later of the less restrictive GNU Lesser General Public License (https://www.gnu.org/licenses/lgpl-2.1.html[LGPLv2.1+]), its documentation under version 4.0 or later of the Creative Commons Attribution-ShareAlike International Public License (https://creativecommons.org/licenses/by-sa/4.0/legalcode[CC-BY-SA]), and its init scripts under the https://opensource.org/licenses/BSD-3-Clause[Revised BSD] license. If you find any deviations from this policy, or wish to inquire about alternate licensing arrangements, please e-mail mailto:andrew@beekhof.net[andrew@beekhof.net]. Licensing issues are also discussed on the http://clusterlabs.org/wiki/License[ClusterLabs wiki]. How can I contribute my changes to the project?:: Contributions of bug fixes or new features are very much appreciated! Patches can be submitted as https://help.github.com/articles/using-pull-requests/[pull requests] via GitHub (the preferred method, due to its excellent https://github.com/features/[features]), or e-mailed to the http://clusterlabs.org/mailman/listinfo/developers[developers@clusterlabs.org] mailing list as an attachment in a format Git can import. What if I still have questions?:: indexterm:[mailing lists] Ask on the http://clusterlabs.org/mailman/listinfo/developers[developers@clusterlabs.org] mailing list for development-related questions, or on the http://clusterlabs.org/mailman/listinfo/users[users@clusterlabs.org] mailing list for general questions about using Pacemaker. Developers often also hang out on http://freenode.net/[freenode's] #clusterlabs IRC channel. diff --git a/doc/Pacemaker_Development/en-US/Ch-Python.txt b/doc/Pacemaker_Development/en-US/Ch-Python.txt index f372dd87d8..bd450fc3c6 100644 --- a/doc/Pacemaker_Development/en-US/Ch-Python.txt +++ b/doc/Pacemaker_Development/en-US/Ch-Python.txt @@ -1,154 +1,155 @@ +:compat-mode: legacy = Python Coding Guidelines = //// We prefer [[ch-NAME]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-python-coding[Chapter 3, Python Coding Guidelines] [[s-python-boilerplate]] == Python Boilerplate == indexterm:[Python,boilerplate] indexterm:[licensing,Python boilerplate] If a Python file is meant to be executed (as opposed to imported), it should have a +.in+ extension, and its first line should be: ==== ---- #!@PYTHON@ ---- ==== which will be replaced with the appropriate python executable when Pacemaker is built. To make that happen, add an AC_CONFIG_FILES() line to configure.ac, and add the file name without .in to .gitignore (see existing examples). After the above line if any, every Python file should start like this: ==== [source,Python] ---- """ """ # Pacemaker targets compatibility with Python 2.7 and 3.2+ from __future__ import print_function, unicode_literals, absolute_import, division __copyright__ = "Copyright Andrew Beekhof " __license__ = " WITHOUT ANY WARRANTY" ---- ==== If the file is meant to be directly executed, the first line (++) should be +#!/usr/bin/python+. If it is meant to be imported, omit this line. ++ is obviously a brief description of the file's purpose. The string may contain any other information typically used in a Python file https://www.python.org/dev/peps/pep-0257/[docstring]. The +import+ statement is discussed further in <>. ++ is the year the code was 'originally' created. footnote:[ See the U.S. Copyright Office's https://www.copyright.gov/comp3/["Compendium of U.S. Copyright Office Practices"], particularly "Chapter 2200: Notice of Copyright", sections 2205.1(A) and 2205.1(F), or https://techwhirl.com/updating-copyright-notices/["Updating Copyright Notices"] for a more readable summary. ] If the code is modified in later years, add +-YYYY+ with the most recent year of modification. ++ should follow the policy set forth in the https://github.com/ClusterLabs/pacemaker/blob/master/COPYING[+COPYING+] file, generally one of "GNU General Public License version 2 or later (GPLv2+)" or "GNU Lesser General Public License version 2.1 or later (LGPLv2.1+)". == Python Compatibility == indexterm:[Python,2] indexterm:[Python,3] indexterm:[Python,versions] Pacemaker targets compatibility with Python 2.7, and Python 3.2 and later. These versions have added features to be more compatible with each other, allowing us to support both the 2 and 3 series with the same code. It is a good idea to test any changes with both Python 2 and 3. [[s-python-future-imports]] === Python Future Imports === The future imports used in <> mean: * All print statements must use parentheses, and printing without a newline is accomplished with the +end=' '+ parameter rather than a trailing comma. * All string literals will be treated as Unicode (the +u+ prefix is unnecessary, and must not be used, because it is not available in Python 3.2). * Local modules must be imported using +from . import+ (rather than just +import+). To import one item from a local module, use +from .modulename import+ (rather than +from modulename import+). * Division using +/+ will always return a floating-point result (use +//+ if you want the integer floor instead). === Other Python Compatibility Requirements === * When specifying an exception variable, always use +as+ instead of a comma (e.g. +except Exception as e+ or +except (TypeError, IOError) as e+). Use +e.args+ to access the error arguments (instead of iterating over or subscripting +e+). * Use +in+ (not +has_key()+) to determine if a dictionary has a particular key. * Always use the I/O functions from the +io+ module rather than the native I/O functions (e.g. +io.open()+ rather than +open()+). * When opening a file, always use the +t+ (text) or +b+ (binary) mode flag. * When creating classes, always specify a parent class to ensure that it is a "new-style" class (e.g. +class Foo(object):+ rather than +class Foo:+) * Be aware of the bytes type added in Python 3. Many places where strings are used in Python 2 use bytes or bytearrays in Python 3 (for example, the pipes used with +subprocess.Popen()+). Code should handle both possibilities. * Be aware that the +items()+, +keys()+, and +values()+ methods of dictionaries return lists in Python 2 and views in Python 3. In many case, no special handling is required, but if the code needs to use list methods on the result, cast the result to list first. * Do not raise or catch strings as exceptions (e.g. +raise "Bad thing"+). * Do not use the +cmp+ parameter of sorting functions (use +key+ instead, if needed) or the +$$__cmp__()$$+ method of classes (implement rich comparison methods such as +$$__lt__()$$+ instead, if needed). * Do not use the +buffer+ type. * Do not use features not available in all targeted Python versions. Common examples include: ** The +html+, +ipaddress+, and +UserDict+ modules ** The +subprocess.run()+ function ** The +subprocess.DEVNULL+ constant ** +subprocess+ module-specific exceptions === Python Usages to Avoid === Avoid the following if possible, otherwise research the compatibility issues involved (hacky workarounds are often available): * long integers * octal integer literals * mixed binary and string data in one data file or variable * metaclasses * +locale.strcoll+ and +locale.strxfrm+ * the +configparser+ and +ConfigParser+ modules * importing compatibility modules such as +six+ (so we don't have to add them to Pacemaker's dependencies) == Formatting Python Code == indexterm:[Python,formatting] * Indentation must be 4 spaces, no tabs. * Do not leave trailing whitespace. * Lines should be no longer than 80 characters unless limiting line length significantly impacts readability. For Python, this limitation is flexible since breaking a line often impacts readability, but definitely keep it under 120 characters. * Where not conflicting with this style guide, it is recommended (but not required) to follow https://www.python.org/dev/peps/pep-0008/[PEP 8]. * It is recommended (but not required) to format Python code such that `pylint --disable=line-too-long,too-many-lines,too-many-instance-attributes,too-many-arguments,too-many-statements` produces minimal complaints (even better if you don't need to disable all those checks). diff --git a/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt b/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt index 2e4228f541..b89bf4af04 100644 --- a/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt +++ b/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt @@ -1,59 +1,60 @@ +:compat-mode: legacy [appendix] [[ap-faq]] == FAQ == [qanda] Why is the Project Called Pacemaker?:: indexterm:[Pacemaker] First of all, the reason it's not called the CRM is because of the abundance of terms footnote:[http://en.wikipedia.org/wiki/CRM] that are commonly abbreviated to those three letters. The Pacemaker name came from Kham, footnote:[http://khamsouk.souvanlasy.com/] a good friend of Pacemaker developer Andrew Beekhof's, and was originally used by a Java GUI that Beekhof was prototyping in early 2007. Alas, other commitments prevented the GUI from progressing much and, when it came time to choose a name for this project, Lars Marowsky-Bree suggested it was an even better fit for an independent CRM. The idea stems from the analogy between the role of this software and that of the little device that keeps the human heart pumping. Pacemaker monitors the cluster and intervenes when necessary to ensure the smooth operation of the services it provides. There were a number of other names (and acronyms) tossed around, but suffice to say "Pacemaker" was the best. Why was the Pacemaker Project Created?:: Pacemaker was spun off from an earlier project called http://linux-ha.org/[Heartbeat], which combined a cluster layer and a cluster resource manager. The CRM was made into its own project, Pacemaker, in order to: * support both the Corosync and Heartbeat cluster stacks equally (Heartbeat support was dropped in Pacemaker 2.0, as the project had faded out by then) * decouple the release cycles of the cluster layer and the cluster resource manager at very different stages of their life-cycles * foster clearer package boundaries, thus leading to better and more stable interfaces What Messaging Layers are Supported?:: indexterm:[Messaging Layers] * http://www.corosync.org/[Corosync] version 2 and greater * Historically, Pacemaker 1 also supported Corosync version 1 (with either CMAN or a pacemaker plugin) and Heartbeat. Support for these legacy stacks was dropped with Pacemaker 2.0. Where Can I Get Pre-built Packages?:: Most major Linux distributions have pacemaker packages in their standard package repositories. See the http://clusterlabs.org/wiki/Install[Install wiki page] for details. What Versions of Pacemaker Are Supported?:: Some Linux distributions (such as Red Hat Enterprise Linux and SUSE Linux Enterprise) offer technical support for their customers; contact them for details of such support. For help within the community (mailing lists, IRC, etc.) from Pacemaker developers and users, refer to the http://clusterlabs.org/wiki/Releases[Releases wiki page] for an up-to-date list of versions considered to be supported by the project. When seeking assistance, please try to ensure you have one of these versions. diff --git a/doc/Pacemaker_Explained/en-US/Ap-Samples.txt b/doc/Pacemaker_Explained/en-US/Ap-Samples.txt index 4494c18d55..f1dadec145 100644 --- a/doc/Pacemaker_Explained/en-US/Ap-Samples.txt +++ b/doc/Pacemaker_Explained/en-US/Ap-Samples.txt @@ -1,152 +1,153 @@ +:compat-mode: legacy [appendix] == Sample Configurations == === Empty === .An Empty Configuration ======= [source,XML] ------- ------- ======= === Simple === .A simple configuration with two nodes, some cluster options and a resource ======= [source,XML] ------- ------- ======= In the above example, we have one resource (an IP address) that we check every five minutes and will run on host +c001n01+ until either the resource fails 10 times or the host shuts down. === Advanced Configuration === .An advanced configuration with groups, clones and STONITH ======= [source,XML] ------- ------- ======= diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt index c662c60a49..d0aba3914f 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt @@ -1,728 +1,729 @@ +:compat-mode: legacy = Advanced Configuration = [[s-recurring-start]] == Specifying When Recurring Actions are Performed == By default, recurring actions are scheduled relative to when the resource started. So if your resource was last started at 14:32 and you have a backup set to be performed every 24 hours, then the backup will always run in the middle of the business day -- hardly desirable. To specify a date and time that the operation should be relative to, set the operation's +interval-origin+. The cluster uses this point to calculate the correct +start-delay+ such that the operation will occur at _origin + (interval * N)_. So, if the operation's interval is 24h, its interval-origin is set to 02:00 and it is currently 14:32, then the cluster would initiate the operation with a start delay of 11 hours and 28 minutes. If the resource is moved to another node before 2am, then the operation is cancelled. The value specified for +interval+ and +interval-origin+ can be any date/time conforming to the http://en.wikipedia.org/wiki/ISO_8601[ISO8601 standard]. By way of example, to specify an operation that would run on the first Monday of 2009 and every Monday after that, you would add: .Specifying a Base for Recurring Action Intervals ===== [source,XML] ===== [[s-failure-handling]] == Handling Resource Failure == By default, Pacemaker will attempt to recover failed resources by restarting them. However, failure recovery is highly configurable. === Failure Counts === Pacemaker tracks resource failures for each combination of node, resource, and operation (start, stop, monitor, etc.). You can query the fail count for a particular node, resource, and/or operation using the `crm_failcount` command. For example, to see how many times the 10-second monitor for +myrsc+ has failed on +node1+, run: ---- # crm_failcount --query -r myrsc -N node1 -n monitor -I 10s ---- If you omit the node, `crm_failcount` will use the local node. If you omit the operation and interval, `crm_failcount` will display the sum of the fail counts for all operations on the resource. You can use `crm_resource --cleanup` or `crm_failcount --delete` to clear fail counts. For example, to clear the above monitor failures, run: ---- # crm_resource --cleanup -r myrsc -N node1 -n monitor -I 10s ---- If you omit the resource, `crm_resource --cleanup` will clear failures for all resources. If you omit the node, it will clear failures on all nodes. If you omit the operation and interval, it will clear the failures for all operations on the resource. [NOTE] ==== Even when cleaning up only a single operation, all failed operations will disappear from the status display. This allows us to trigger a re-check of the resource's current status. ==== Higher-level tools may provide other commands for querying and clearing fail counts. The `crm_mon` tool shows the current cluster status, including any failed operations. To see the current fail counts for any failed resources, call `crm_mon` with the `--failcounts` option. This shows the fail counts per resource (that is, the sum of any operation fail counts for the resource). === Failure Response === Normally, if a running resource fails, pacemaker will try to stop it and start it again. Pacemaker will choose the best location to start it each time, which may be the same node that it failed on. However, if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case. Pacemaker allows you to set your preference via the +migration-threshold+ resource meta-attribute. footnote:[ The naming of this option was perhaps unfortunate as it is easily confused with live migration, the process of moving a resource from one node to another without stopping it. Xen virtual guests are the most common example of resources that can be migrated in this manner. ] If you define +migration-threshold=pass:[N]+ for a resource, it will be banned from the original node after 'N' failures. [NOTE] ==== The +migration-threshold+ is per 'resource', even though fail counts are tracked per 'operation'. The operation fail counts are added together to compare against the +migration-threshold+. ==== By default, fail counts remain until manually cleared by an administrator using `crm_resource --cleanup` or `crm_failcount --delete` (hopefully after first fixing the failure's cause). It is possible to have fail counts expire automatically by setting the +failure-timeout+ resource meta-attribute. [IMPORTANT] ==== A successful operation does not clear past failures. If a recurring monitor operation fails once, succeeds many times, then fails again days later, its fail count is 2. Fail counts are cleared only by manual intervention or falure timeout. ==== For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+ would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute. [NOTE] ==== +failure-timeout+ is measured since the most recent failure. That is, older failures do not individually time out and lower the fail count. Instead, all failures are timed out simultaneously (and the fail count is reset to 0) if there is no new failure for the timeout period. ==== There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop. If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the default), start failures cause the fail count to be set to +INFINITY+ and thus always cause the resource to move immediately. Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout. [IMPORTANT] Please read <> to understand how timeouts work before configuring a +failure-timeout+. == Moving Resources == indexterm:[Moving,Resources] indexterm:[Resource,Moving] === Moving Resources Manually === There are primarily two occasions when you would want to move a resource from its current location: when the whole node is under maintenance, and when a single resource needs to be moved. ==== Standby Mode ==== Since everything eventually comes down to a score, you could create constraints for every resource to prevent them from running on one node. While pacemaker configuration can seem convoluted at times, not even we would require this of administrators. Instead, one can set a special node attribute which tells the cluster "don't let anything run here". There is even a helpful tool to help query and set it, called `crm_standby`. To check the standby status of the current machine, run: ---- # crm_standby -G ---- A value of +on+ indicates that the node is _not_ able to host any resources, while a value of +off+ says that it _can_. You can also check the status of other nodes in the cluster by specifying the `--node` option: ---- # crm_standby -G --node sles-2 ---- To change the current node's standby status, use `-v` instead of `-G`: ---- # crm_standby -v on ---- Again, you can change another host's value by supplying a hostname with `--node`. ==== Moving One Resource ==== When only one resource is required to move, we could do this by creating location constraints. However, once again we provide a user-friendly shortcut as part of the `crm_resource` command, which creates and modifies the extra constraints for you. If +Email+ were running on +sles-1+ and you wanted it moved to a specific location, the command would look something like: ---- # crm_resource -M -r Email -H sles-2 ---- Behind the scenes, the tool will create the following location constraint: [source,XML] It is important to note that subsequent invocations of `crm_resource -M` are not cumulative. So, if you ran these commands ---- # crm_resource -M -r Email -H sles-2 # crm_resource -M -r Email -H sles-3 ---- then it is as if you had never performed the first command. To allow the resource to move back again, use: ---- # crm_resource -U -r Email ---- Note the use of the word _allow_. The resource can move back to its original location but, depending on +resource-stickiness+, it might stay where it is. To be absolutely certain that it moves back to +sles-1+, move it there before issuing the call to `crm_resource -U`: ---- # crm_resource -M -r Email -H sles-1 # crm_resource -U -r Email ---- Alternatively, if you only care that the resource should be moved from its current location, try: ---- # crm_resource -B -r Email ---- Which will instead create a negative constraint, like [source,XML] This will achieve the desired effect, but will also have long-term consequences. As the tool will warn you, the creation of a +-INFINITY+ constraint will prevent the resource from running on that node until `crm_resource -U` is used. This includes the situation where every other cluster node is no longer available! In some cases, such as when +resource-stickiness+ is set to +INFINITY+, it is possible that you will end up with the problem described in <>. The tool can detect some of these cases and deals with them by creating both positive and negative constraints. E.g. +Email+ prefers +sles-1+ with a score of +-INFINITY+ +Email+ prefers +sles-2+ with a score of +INFINITY+ which has the same long-term consequences as discussed earlier. === Moving Resources Due to Connectivity Changes === You can configure the cluster to move resources when external connectivity is lost in two steps. ==== Tell Pacemaker to Monitor Connectivity ==== First, add an *ocf:pacemaker:ping* resource to the cluster. The *ping* resource uses the system utility of the same name to a test whether list of machines (specified by DNS hostname or IPv4/IPv6 address) are reachable and uses the results to maintain a node attribute called +pingd+ by default. footnote:[ The attribute name is customizable, in order to allow multiple ping groups to be defined. ] [NOTE] =========== Older versions of Pacemaker used a different agent *ocf:pacemaker:pingd* which is now deprecated in favor of *ping*. If your version of Pacemaker does not contain the *ping* resource agent, download the latest version from https://github.com/ClusterLabs/pacemaker/tree/master/extra/resources/ping =========== Normally, the ping resource should run on all cluster nodes, which means that you'll need to create a clone. A template for this can be found below along with a description of the most interesting parameters. .Common Options for a 'ping' Resource [width="95%",cols="1m,<4",options="header",align="center"] |========================================================= |Field |Description |dampen |The time to wait (dampening) for further changes to occur. Use this to prevent a resource from bouncing around the cluster when cluster nodes notice the loss of connectivity at slightly different times. indexterm:[dampen,Ping Resource Option] indexterm:[Ping Resource,Option,dampen] |multiplier |The number of connected ping nodes gets multiplied by this value to get a score. Useful when there are multiple ping nodes configured. indexterm:[multiplier,Ping Resource Option] indexterm:[Ping Resource,Option,multiplier] |host_list |The machines to contact in order to determine the current connectivity status. Allowed values include resolvable DNS host names, IPv4 and IPv6 addresses. indexterm:[host_list,Ping Resource Option] indexterm:[Ping Resource,Option,host_list] |========================================================= .An example ping cluster resource that checks node connectivity once every minute ===== [source,XML] ------------ ------------ ===== [IMPORTANT] =========== You're only half done. The next section deals with telling Pacemaker how to deal with the connectivity status that +ocf:pacemaker:ping+ is recording. =========== ==== Tell Pacemaker How to Interpret the Connectivity Data ==== [IMPORTANT] ====== Before attempting the following, make sure you understand <>. ====== There are a number of ways to use the connectivity data. The most common setup is for people to have a single ping target (e.g. the service network's default gateway), to prevent the cluster from running a resource on any unconnected node. .Don't run a resource on unconnected nodes ===== [source,XML] ------- ------- ===== A more complex setup is to have a number of ping targets configured. You can require the cluster to only run resources on nodes that can connect to all (or a minimum subset) of them. .Run only on nodes connected to three or more ping targets. ===== [source,XML] ------- ... ... ... ------- ===== Alternatively, you can tell the cluster only to _prefer_ nodes with the best connectivity. Just be sure to set +multiplier+ to a value higher than that of +resource-stickiness+ (and don't set either of them to +INFINITY+). .Prefer the node with the most connected ping nodes ===== [source,XML] ------- ------- ===== It is perhaps easier to think of this in terms of the simple constraints that the cluster translates it into. For example, if *sles-1* is connected to all five ping nodes but *sles-2* is only connected to two, then it would be as if you instead had the following constraints in your configuration: .How the cluster translates the above location constraint ===== [source,XML] ------- ------- ===== The advantage is that you don't have to manually update any constraints whenever your network connectivity changes. You can also combine the concepts above into something even more complex. The example below shows how you can prefer the node with the most connected ping nodes provided they have connectivity to at least three (again assuming that +multiplier+ is set to 1000). .A more complex example of choosing a location based on connectivity ===== [source,XML] ------- ------- ===== [[s-migrating-resources]] === Migrating Resources === Normally, when the cluster needs to move a resource, it fully restarts the resource (i.e. stops the resource on the current node and starts it on the new node). However, some types of resources, such as Xen virtual guests, are able to move to another location without loss of state (often referred to as live migration or hot migration). In pacemaker, this is called resource migration. Pacemaker can be configured to migrate a resource when moving it, rather than restarting it. Not all resources are able to migrate; see the Migration Checklist below, and those that can, won't do so in all situations. Conceptually, there are two requirements from which the other prerequisites follow: * The resource must be active and healthy at the old location; and * everything required for the resource to run must be available on both the old and new locations. The cluster is able to accommodate both 'push' and 'pull' migration models by requiring the resource agent to support two special actions: +migrate_to+ (performed on the current location) and +migrate_from+ (performed on the destination). In push migration, the process on the current location transfers the resource to the new location where is it later activated. In this scenario, most of the work would be done in the +migrate_to+ action and, if anything, the activation would occur during +migrate_from+. Conversely for pull, the +migrate_to+ action is practically empty and +migrate_from+ does most of the work, extracting the relevant resource state from the old location and activating it. There is no wrong or right way for a resource agent to implement migration, as long as it works. .Migration Checklist * The resource may not be a clone. * The resource must use an OCF style agent. * The resource must not be in a failed or degraded state. * The resource agent must support +migrate_to+ and +migrate_from+ actions, and advertise them in its metadata. * The resource must have the +allow-migrate+ meta-attribute set to +true+ (which is not the default). If an otherwise migratable resource depends on another resource via an ordering constraint, there are special situations in which it will be restarted rather than migrated. For example, if the resource depends on a clone, and at the time the resource needs to be moved, the clone has instances that are stopping and instances that are starting, then the resource will be restarted. The scheduler is not yet able to model this situation correctly and so takes the safer (if less optimal) path. Also, if a migratable resource depends on a non-migratable resource, and both need to be moved, the migratable resource will be restarted. [[s-node-health]] == Tracking Node Health == A node may be functioning adequately as far as cluster membership is concerned, and yet be "unhealthy" in some respect that makes it an undesirable location for resources. For example, a disk drive may be reporting SMART errors, or the CPU may be highly loaded. Pacemaker offers a way to automatically move resources off unhealthy nodes. === Node Health Attributes === Pacemaker will treat any node attribute whose name starts with +#health+ as an indicator of node health. Node health attributes may have one of the following values: .Allowed Values for Node Health Attributes [width="95%",cols="1,<3",options="header",align="center"] |========================================================= |Value |Intended significance |+red+ |This indicator is unhealthy indexterm:[Node health,red] |+yellow+ |This indicator is becoming unhealthy indexterm:[Node health,yellow] |+green+ |This indicator is healthy indexterm:[Node health,green] |'integer' |A numeric score to apply to all resources on this node (0 or positive is healthy, negative is unhealthy) indexterm:[Node health,score] |========================================================= === Node Health Strategy === Pacemaker assigns a node health score to each node, as the sum of the values of all its node health attributes. This score will be used as a location constraint applied to this node for all resources. The +node-health-strategy+ cluster option controls how Pacemaker responds to changes in node health attributes, and how it translates +red+, +yellow+, and +green+ to scores. Allowed values are: .Node Health Strategies [width="95%",cols="1m,<3",options="header",align="center"] |========================================================= |Value |Effect |none |Do not track node health attributes at all. indexterm:[Node health,none] |migrate-on-red |Assign the value of +-INFINITY+ to +red+, and 0 to +yellow+ and +green+. This will cause all resources to move off the node if any attribute is +red+. indexterm:[Node health,migrate-on-red] |only-green |Assign the value of +-INFINITY+ to +red+ and +yellow+, and 0 to +green+. This will cause all resources to move off the node if any attribute is +red+ or +yellow+. indexterm:[Node health,only-green] |progressive |Assign the value of the +node-health-red+ cluster option to +red+, the value of +node-health-yellow+ to +yellow+, and the value of +node-health-green+ to +green+. Each node is additionally assigned a score of +node-health-base+ (this allows resources to start even if some attributes are +yellow+). This strategy gives the administrator finer control over how important each value is. indexterm:[Node health,progressive] |custom |Track node health attributes using the same values as +progressive+ for +red+, +yellow+, and +green+, but do not take them into account. The administrator is expected to implement a policy by defining rules (see <>) referencing node health attributes. indexterm:[Node health,custom] |========================================================= === Measuring Node Health === Since Pacemaker calculates node health based on node attributes, any method that sets node attributes may be used to measure node health. The most common ways are resource agents or separate daemons. Pacemaker provides examples that can be used directly or as a basis for custom code. The +ocf:pacemaker:HealthCPU+ and +ocf:pacemaker:HealthSMART+ resource agents set node health attributes based on CPU and disk parameters. The +ipmiservicelogd+ daemon sets node health attributes based on IPMI values (the +ocf:pacemaker:SystemHealth+ resource agent can be used to manage the daemon as a cluster resource). == Reloading Services After a Definition Change == The cluster automatically detects changes to the definition of services it manages. The normal response is to stop the service (using the old definition) and start it again (with the new definition). This works well, but some services are smarter and can be told to use a new set of options without restarting. To take advantage of this capability, the resource agent must: . Accept the +reload+ operation and perform any required actions. _The actions here depend completely on your application!_ + .The DRBD agent's logic for supporting +reload+ ===== [source,Bash] ------- case $1 in start) drbd_start ;; stop) drbd_stop ;; reload) drbd_reload ;; monitor) drbd_monitor ;; *) drbd_usage exit $OCF_ERR_UNIMPLEMENTED ;; esac exit $? ------- ===== . Advertise the +reload+ operation in the +actions+ section of its metadata + .The DRBD Agent Advertising Support for the +reload+ Operation ===== [source,XML] ------- 1.1 Master/Slave OCF Resource Agent for DRBD ... ------- ===== . Advertise one or more parameters that can take effect using +reload+. + Any parameter with the +unique+ set to 0 is eligible to be used in this way. + .Parameter that can be changed using reload ===== [source,XML] ------- Full path to the drbd.conf file. Path to drbd.conf ------- ===== Once these requirements are satisfied, the cluster will automatically know to reload the resource (instead of restarting) when a non-unique field changes. [NOTE] ====== Metadata will not be re-read unless the resource needs to be started. This may mean that the resource will be restarted the first time, even though you changed a parameter with +unique=0+. ====== [NOTE] ====== If both a unique and non-unique field are changed simultaneously, the resource will still be restarted. ====== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt index 4c401d1dd1..c41be61a6f 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt @@ -1,1454 +1,1455 @@ +:compat-mode: legacy = Advanced Resource Types = [[group-resources]] == Groups - A Syntactic Shortcut == indexterm:[Group Resources] indexterm:[Resource,Groups] One of the most common elements of a cluster is a set of resources that need to be located together, start sequentially, and stop in the reverse order. To simplify this configuration, we support the concept of groups. .A group of two primitive resources ====== [source,XML] ------- ------- ====== Although the example above contains only two resources, there is no limit to the number of resources a group can contain. The example is also sufficient to explain the fundamental properties of a group: * Resources are started in the order they appear in (+Public-IP+ first, then +Email+) * Resources are stopped in the reverse order to which they appear in (+Email+ first, then +Public-IP+) If a resource in the group can't run anywhere, then nothing after that is allowed to run, too. * If +Public-IP+ can't run anywhere, neither can +Email+; * but if +Email+ can't run anywhere, this does not affect +Public-IP+ in any way The group above is logically equivalent to writing: .How the cluster sees a group resource ====== [source,XML] ------- ------- ====== Obviously as the group grows bigger, the reduced configuration effort can become significant. Another (typical) example of a group is a DRBD volume, the filesystem mount, an IP address, and an application that uses them. === Group Properties === .Properties of a Group Resource [width="95%",cols="3m,<5",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the group indexterm:[id,Group Resource Property] indexterm:[Resource,Group Property,id] |========================================================= === Group Options === Groups inherit the +priority+, +target-role+, and +is-managed+ properties from primitive resources. See <> for information about those properties. === Group Instance Attributes === Groups have no instance attributes. However, any that are set for the group object will be inherited by the group's children. === Group Contents === Groups may only contain a collection of cluster resources (see <>). To refer to a child of a group resource, just use the child's +id+ instead of the group's. === Group Constraints === Although it is possible to reference a group's children in constraints, it is usually preferable to reference the group itself. .Some constraints involving groups ====== [source,XML] ------- ------- ====== === Group Stickiness === indexterm:[resource-stickiness,Groups] Stickiness, the measure of how much a resource wants to stay where it is, is additive in groups. Every active resource of the group will contribute its stickiness value to the group's total. So if the default +resource-stickiness+ is 100, and a group has seven members, five of which are active, then the group as a whole will prefer its current location with a score of 500. [[s-resource-clone]] == Clones - Resources That Can Have Multiple Active Instances == indexterm:[Clone Resources] indexterm:[Resource,Clones] 'Clone' resources are resources that can have more than one copy active at the same time. This allows you, for example, to run a copy of a daemon on every node. You can clone any primitive or group resource. footnote:[ Of course, the service must support running multiple instances. ] === Anonymous versus Unique Clones === A clone resource is configured to be either 'anonymous' or 'globally unique'. Anonymous clones are the simplest. These behave completely identically everywhere they are running. Because of this, there can be only one instance of an anonymous clone active per node. The instances of globally unique clones are distinct entities. All instances are launched identically, but one instance of the clone is not identical to any other instance, whether running on the same node or a different node. As an example, a cloned IP address can use special kernel functionality such that each instance handles a subset of requests for the same IP address. [[s-resource-promotable]] === Promotable clones === indexterm:[Promotable Clone Resources] indexterm:[Resource,Promotable] If a clone is 'promotable', its instances can perform a special role that Pacemaker will manage via the +promote+ and +demote+ actions of the resource agent. Services that support such a special role have various terms for the special role and the default role: primary and secondary, master and replica, controller and worker, etc. Pacemaker uses the terms 'master' and 'slave', footnote:[ These are historical terms that will eventually be replaced, but the extensive use of them and the need for backward compatibility makes it a long process. You may see examples using a +master+ tag instead of a +clone+ tag with the +promotable+ meta-attribute set to +true+; the +master+ tag is supported, but deprecated, and will be removed in a future version. You may also see such services referred to as 'multi-state' or 'stateful'; these means the same thing as 'promotable'. ] but is agnostic to what the service calls them or what they do. All that Pacemaker cares about is that an instance comes up in the default role when started, and the resource agent supports the +promote+ and +demote+ actions to manage entering and exiting the special role. === Clone Properties === .Properties of a Clone Resource [width="95%",cols="3m,<5",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the clone indexterm:[id,Clone Property] indexterm:[Clone,Property,id] |========================================================= === Clone Options === <> inherited from primitive resources: +priority, target-role, is-managed+ .Clone-specific configuration options [width="95%",cols="1m,1,<3",options="header",align="center"] |========================================================= |Field |Default |Description |globally-unique |false |If +true+, each clone instance performs a distinct function indexterm:[globally-unique,Clone Option] indexterm:[Clone,Option,globally-unique] |clone-max |number of nodes in cluster |The maximum number of clone instances that can be started across the entire cluster indexterm:[clone-max,Clone Option] indexterm:[Clone,Option,clone-max] |clone-node-max |1 |If +globally-unique+ is +true+, the maximum number of clone instances that can be started on a single node indexterm:[clone-node-max,Clone Option] indexterm:[Clone,Option,clone-node-max] |clone-min |0 |Require at least this number of clone instances to be runnable before allowing resources depending on the clone to be runnable. A value of 0 means require all clone instances to be runnable. indexterm:[clone-min,Clone Option] indexterm:[Clone,Option,clone-min] |notify |false |Call the resource agent's +notify+ action for all active instances, before and after starting or stopping any clone instance. The resource agent must support this action. Allowed values: +false+, +true+ indexterm:[notify,Clone Option] indexterm:[Clone,Option,notify] |ordered |false |If +true+, clone instances must be started sequentially instead of in parallel Allowed values: +false+, +true+ indexterm:[ordered,Clone Option] indexterm:[Clone,Option,ordered] |interleave |false |When this clone is ordered relative to another clone, if this option is +false+ (the default), the ordering is relative to 'all' instances of the other clone, whereas if this option is +true+, the ordering is relative only to instances on the same node. Allowed values: +false+, +true+ indexterm:[interleave,Clone Option] indexterm:[Clone,Option,interleave] |promotable |false |If +true+, clone instances can perform a special role that Pacemaker will manage via the resource agent's +promote+ and +demote+ actions. The resource agent must support these actions. Allowed values: +false+, +true+ indexterm:[promotable,Clone Option] indexterm:[Clone,Option,promotable] |promoted-max |1 |If +promotable+ is +true+, the number of instances that can be promoted at one time across the entire cluster indexterm:[promoted-max,Clone Option] indexterm:[Clone,Option,promoted-max] |promoted-node-max |1 |If +promotable+ is +true+ and +globally-unique+ is +false+, the number of clone instances can be promoted at one time on a single node indexterm:[promoted-node-max,Clone Option] indexterm:[Clone,Option,promoted-node-max] |========================================================= For backward compatibility, +master-max+ and +master-node-max+ are accepted as aliases for +promoted-max+ and +promoted-node-max+, but are deprecated since 2.0.0, and support for them will be removed in a future version. === Clone Contents === Clones must contain exactly one primitive or group resource. .A clone that runs a web server on all nodes ==== [source,XML] ---- ---- ==== [WARNING] You should never reference the name of a clone's child (the primitive or group resource being cloned). If you think you need to do this, you probably need to re-evaluate your design. === Clone Instance Attributes === Clones have no instance attributes; however, any that are set here will be inherited by the clone's child. === Clone Constraints === In most cases, a clone will have a single instance on each active cluster node. If this is not the case, you can indicate which nodes the cluster should preferentially assign copies to with resource location constraints. These constraints are written no differently from those for primitive resources except that the clone's +id+ is used. .Some constraints involving clones ====== [source,XML] ------- ------- ====== Ordering constraints behave slightly differently for clones. In the example above, +apache-stats+ will wait until all copies of +apache-clone+ that need to be started have done so before being started itself. Only if _no_ copies can be started will +apache-stats+ be prevented from being active. Additionally, the clone will wait for +apache-stats+ to be stopped before stopping itself. Colocation of a primitive or group resource with a clone means that the resource can run on any node with an active instance of the clone. The cluster will choose an instance based on where the clone is running and the resource's own location preferences. Colocation between clones is also possible. If one clone +A+ is colocated with another clone +B+, the set of allowed locations for +A+ is limited to nodes on which +B+ is (or will be) active. Placement is then performed normally. ==== Promotable Clone Constraints ==== For promotable clone resources, the +first-action+ and/or +then-action+ fields for ordering constraints may be set to +promote+ or +demote+ to constrain the master role, and colocation constraints may contain +rsc-role+ and/or +with-rsc-role+ fields. .Additional colocation constraint options for promotable clone resources [width="95%",cols="1m,1,<3",options="header",align="center"] |========================================================= |Field |Default |Description |rsc-role |Started |An additional attribute of colocation constraints that specifies the role that +rsc+ must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[rsc-role,Ordering Constraints] indexterm:[Constraints,Ordering,rsc-role] |with-rsc-role |Started |An additional attribute of colocation constraints that specifies the role that +with-rsc+ must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[with-rsc-role,Ordering Constraints] indexterm:[Constraints,Ordering,with-rsc-role] |========================================================= .Constraints involving promotable clone resources ====== [source,XML] ------- ------- ====== In the example above, +myApp+ will wait until one of the database copies has been started and promoted to master before being started itself on the same node. Only if no copies can be promoted will +myApp+ be prevented from being active. Additionally, the cluster will wait for +myApp+ to be stopped before demoting the database. Colocation of a primitive or group resource with a promotable clone resource means that it can run on any node with an active instance of the promotable clone resource that has the specified role (+master+ or +slave+). In the example above, the cluster will choose a location based on where database is running as a +master+, and if there are multiple +master+ instances it will also factor in +myApp+'s own location preferences when deciding which location to choose. Colocation with regular clones and other promotable clone resources is also possible. In such cases, the set of allowed locations for the +rsc+ clone is (after role filtering) limited to nodes on which the +with-rsc+ promotable clone resource is (or will be) in the specified role. Placement is then performed as normal. ==== Using Promotable Clone Resources in Colocation Sets ==== .Additional colocation set options relevant to promotable clone resources [width="95%",cols="1m,1,<6",options="header",align="center"] |========================================================= |Field |Default |Description |role |Started |The role that 'all members' of the set must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[role,Ordering Constraints] indexterm:[Constraints,Ordering,role] |========================================================= In the following example +B+'s master must be located on the same node as +A+'s master. Additionally resources +C+ and +D+ must be located on the same node as +A+'s and +B+'s masters. .Colocate C and D with A's and B's master instances ====== [source,XML] ------- ------- ====== ==== Using Promotable Clone Resources in Ordered Sets ==== .Additional ordered set options relevant to promotable clone resources [width="95%",cols="1m,1,<3",options="header",align="center"] |========================================================= |Field |Default |Description |action |value of +first-action+ |An additional attribute of ordering constraint sets that specifies the action that applies to 'all members' of the set. Allowed values: +start+, +stop+, +promote+, +demote+. indexterm:[action,Ordering Constraints] indexterm:[Constraints,Ordering,action] |========================================================= .Start C and D after first promoting A and B ====== [source,XML] ------- ------- ====== In the above example, +B+ cannot be promoted to a master role until +A+ has been promoted. Additionally, resources +C+ and +D+ must wait until +A+ and +B+ have been promoted before they can start. [[s-clone-stickiness]] === Clone Stickiness === indexterm:[resource-stickiness,Clones] To achieve a stable allocation pattern, clones are slightly sticky by default. If no value for +resource-stickiness+ is provided, the clone will use a value of 1. Being a small value, it causes minimal disturbance to the score calculations of other resources but is enough to prevent Pacemaker from needlessly moving copies around the cluster. [NOTE] ==== For globally unique clones, this may result in multiple instances of the clone staying on a single node, even after another eligible node becomes active (for example, after being put into standby mode then made active again). If you do not want this behavior, specify a +resource-stickiness+ of 0 for the clone temporarily and let the cluster adjust, then set it back to 1 if you want the default behavior to apply again. ==== === Clone Resource Agent Requirements === Any resource can be used as an anonymous clone, as it requires no additional support from the resource agent. Whether it makes sense to do so depends on your resource and its resource agent. ==== Resource Agent Requirements for Globally Unique Clones ==== Globally unique clones require additional support in the resource agent. In particular, it must only respond with +$\{OCF_SUCCESS}+ if the node has that exact instance active. All other probes for instances of the clone should result in +$\{OCF_NOT_RUNNING}+ (or one of the other OCF error codes if they are failed). Individual instances of a clone are identified by appending a colon and a numerical offset, e.g. +apache:2+. Resource agents can find out how many copies there are by examining the +OCF_RESKEY_CRM_meta_clone_max+ environment variable and which instance it is by examining +OCF_RESKEY_CRM_meta_clone+. The resource agent must not make any assumptions (based on +OCF_RESKEY_CRM_meta_clone+) about which numerical instances are active. In particular, the list of active copies will not always be an unbroken sequence, nor always start at 0. ==== Resource Agent Requirements for Promotable Clones ==== Promotable clone resources require two extra actions, +demote+ and +promote+, which are responsible for changing the state of the resource. Like +start+ and +stop+, they should return +$\{OCF_SUCCESS}+ if they completed successfully or a relevant error code if they did not. The states can mean whatever you wish, but when the resource is started, it must come up in the mode called +slave+. From there the cluster will decide which instances to promote to +master+. In addition to the clone requirements for monitor actions, agents must also _accurately_ report which state they are in. The cluster relies on the agent to report its status (including role) accurately and does not indicate to the agent what role it currently believes it to be in. .Role implications of OCF return codes [width="95%",cols="1,<1",options="header",align="center"] |========================================================= |Monitor Return Code |Description |OCF_NOT_RUNNING |Stopped indexterm:[Return Code,OCF_NOT_RUNNING] |OCF_SUCCESS |Running (Slave) indexterm:[Return Code,OCF_SUCCESS] |OCF_RUNNING_MASTER |Running (Master) indexterm:[Return Code,OCF_RUNNING_MASTER] |OCF_FAILED_MASTER |Failed (Master) indexterm:[Return Code,OCF_FAILED_MASTER] |Other |Failed (Slave) |========================================================= ==== Clone Notifications ==== If the clone has the +notify+ meta-attribute set to +true+, and the resource agent supports the +notify+ action, Pacemaker will call the action when appropriate, passing a number of extra variables which, when combined with additional context, can be used to calculate the current state of the cluster and what is about to happen to it. .Environment variables supplied with Clone notify actions [width="95%",cols="5,<3",options="header",align="center"] |========================================================= |Variable |Description |OCF_RESKEY_CRM_meta_notify_type |Allowed values: +pre+, +post+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,type] indexterm:[type,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_operation |Allowed values: +start+, +stop+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,operation] indexterm:[operation,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_resource |Resources to be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_resource] indexterm:[start_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_resource |Resources to be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_resource] indexterm:[stop_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_resource |Resources that are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_resource] indexterm:[active_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_inactive_resource |Resources that are not running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,inactive_resource] indexterm:[inactive_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_uname |Nodes on which resources will be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_uname] indexterm:[start_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_uname |Nodes on which resources will be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_uname] indexterm:[stop_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_uname |Nodes on which resources are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_uname] indexterm:[active_uname,Notification Environment Variable] |========================================================= The variables come in pairs, such as +OCF_RESKEY_CRM_meta_notify_start_resource+ and +OCF_RESKEY_CRM_meta_notify_start_uname+ and should be treated as an array of whitespace-separated elements. +OCF_RESKEY_CRM_meta_notify_inactive_resource+ is an exception as the matching +uname+ variable does not exist since inactive resources are not running on any node. Thus in order to indicate that +clone:0+ will be started on +sles-1+, +clone:2+ will be started on +sles-3+, and +clone:3+ will be started on +sles-2+, the cluster would set .Notification variables ====== [source,Bash] ------- OCF_RESKEY_CRM_meta_notify_start_resource="clone:0 clone:2 clone:3" OCF_RESKEY_CRM_meta_notify_start_uname="sles-1 sles-3 sles-2" ------- ====== ==== Interpretation of Notification Variables ==== .Pre-notification (stop): * Active resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (stop) / Pre-notification (start): * Active resources ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Inactive resources ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (start): * Active resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ==== Extra Notifications for Promotable Clones ==== .Extra environment variables supplied for promotable clones [width="95%",cols="5,<3",options="header",align="center"] |========================================================= |_OCF_RESKEY_CRM_meta_notify_master_resource_ |Resources that are running in +Master+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,master_resource] indexterm:[master_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_slave_resource_ |Resources that are running in +Slave+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,slave_resource] indexterm:[slave_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_promote_resource_ |Resources to be promoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,promote_resource] indexterm:[promote_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_demote_resource_ |Resources to be demoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,demote_resource] indexterm:[demote_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_promote_uname_ |Nodes on which resources will be promoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,promote_uname] indexterm:[promote_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_demote_uname_ |Nodes on which resources will be demoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,demote_uname] indexterm:[demote_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_master_uname_ |Nodes on which resources are running in +Master+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,master_uname] indexterm:[master_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_slave_uname_ |Nodes on which resources are running in +Slave+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,slave_uname] indexterm:[slave_uname,Notification Environment Variable] |========================================================= ==== Interpretation of Promotable Notification Variables ==== .Pre-notification (demote): * +Active+ resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * +Master+ resources: +$OCF_RESKEY_CRM_meta_notify_master_resource+ * +Slave+ resources: +$OCF_RESKEY_CRM_meta_notify_slave_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (demote) / Pre-notification (stop): * +Active+ resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: +$OCF_RESKEY_CRM_meta_notify_slave_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ .Post-notification (stop) / Pre-notification (start) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (start) / Pre-notification (promote) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (promote) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ === Monitoring Promotable Clone Resources === The usual monitor actions are insufficient to monitor a promotable clone resource, because Pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one. Define two monitoring actions: the usual one will cover the slave role, and an additional one with +role="master"+ will cover the master role. .Monitoring both states of a promotable clone resource ====== [source,XML] ------- ------- ====== [IMPORTANT] =========== It is crucial that _every_ monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a promotable clone resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status -- which would cause unexpected return codes, and therefore unnecessary complications. =========== [[s-promotion-scores]] === Determining Which Instance is Promoted === Pacemaker can choose a promotable clone instance to be promoted in one of two ways: * Promotion scores: These are node attributes set via the `crm_master` utility, which generally would be called by the resource agent's start action if it supports promotable clones. This tool automatically detects both the resource and host, and should be used to set a preference for being promoted. Based on this, +promoted-max+, and +promoted-node-max+, the instance(s) with the highest preference will be promoted. * Constraints: Location constraints can indicate which nodes are most preferred as masters. .Explicitly preferring node1 to be promoted to master ====== [source,XML] ------- ------- ====== [[s-resource-bundle]] == Bundles - Isolated Environments == indexterm:[bundle] indexterm:[Resource,bundle] indexterm:[Docker,bundle] indexterm:[rkt,bundle] Pacemaker supports a special syntax for launching a https://en.wikipedia.org/wiki/Operating-system-level_virtualization[container] with any infrastructure it requires: the 'bundle'. Pacemaker bundles support https://www.docker.com/[Docker] and https://coreos.com/rkt/[rkt] container technologies. footnote:[Docker is a trademark of Docker, Inc. No endorsement by or association with Docker, Inc. is implied.] .A bundle for a containerized web server ==== [source,XML] ----

---- ==== === Bundle Properties === .Properties of a Bundle [width="95%",cols="3m,<5",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the bundle (required) indexterm:[id,bundle] indexterm:[bundle,Property,id] |description |Arbitrary text (not used by Pacemaker) indexterm:[description,bundle] indexterm:[bundle,Property,description] |========================================================= A bundle must contain exactly one ++ or ++ element. === Docker Properties === Before configuring a Docker bundle in Pacemaker, the user must install Docker and supply a fully configured Docker image on every node allowed to run the bundle. Pacemaker will create an implicit +ocf:heartbeat:docker+ resource to manage a bundle's Docker container. The user must ensure that resource agent is installed on every node allowed to run the bundle. .Properties of a Bundle's Docker Element [width="95%",cols="3m,4,<5",options="header",align="center"] |========================================================= |Field |Default |Description |image | |Docker image tag (required) indexterm:[image,Docker] indexterm:[Docker,Property,image] |replicas |Value of +promoted-max+ if that is positive, else 1 |A positive integer specifying the number of container instances to launch indexterm:[replicas,Docker] indexterm:[Docker,Property,replicas] |replicas-per-host |1 |A positive integer specifying the number of container instances allowed to run on a single node indexterm:[replicas-per-host,Docker] indexterm:[Docker,Property,replicas-per-host] |promoted-max |0 |A non-negative integer that, if positive, indicates that the containerized service should be treated as a promotable service, with this many replicas allowed to run the service in the master role indexterm:[promoted-max,Docker] indexterm:[Docker,Property,promoted-max] |network | |If specified, this will be passed to +docker run+ as the https://docs.docker.com/engine/reference/run/#network-settings[network setting] for the Docker container. indexterm:[network,Docker] indexterm:[Docker,Property,network] |run-command |`/usr/sbin/pacemaker-remoted` if bundle contains a +primitive+, otherwise none |This command will be run inside the container when launching it ("PID 1"). If the bundle contains a +primitive+, this command 'must' start pacemaker-remoted (but could, for example, be a script that does other stuff, too). If the container image has a pre-2.0.0 version of Pacemaker, set this to +/usr/sbin/pacemaker_remoted+ (note the underbar instead of dash). indexterm:[run-command,Docker] indexterm:[Docker,Property,run-command] |options | |Extra command-line options to pass to `docker run` indexterm:[options,Docker] indexterm:[Docker,Property,options] |========================================================= For backward compatibility, +masters+ is accepted as an alias for +promoted-max+, but is deprecated since 2.0.0, and support for it will be removed in a future version. === rkt Properties === Before configuring a rkt bundle in Pacemaker, the user must install rkt and supply a fully configured container image on every node allowed to run the bundle. Pacemaker will create an implicit +ocf:heartbeat:rkt+ resource to manage a bundle's rkt container. The user must ensure that resource agent is installed on every node allowed to run the bundle. .Properties of a Bundle's rkt Element [width="95%",cols="3m,4,<5",options="header",align="center"] |========================================================= |Field |Default |Description |image | |Container image tag (required) indexterm:[image,rkt] indexterm:[rkt,Property,image] |replicas |Value of +promoted-max+ if that is positive, else 1 |A positive integer specifying the number of container instances to launch indexterm:[replicas,rkt] indexterm:[rkt,Property,replicas] |replicas-per-host |1 |A positive integer specifying the number of container instances allowed to run on a single node indexterm:[replicas-per-host,rkt] indexterm:[rkt,Property,replicas-per-host] |promoted-max |0 |A non-negative integer that, if positive, indicates that the containerized service should be treated as a promotable service, with this many replicas allowed to run the service in the master role indexterm:[promoted-max,rkt] indexterm:[rkt,Property,promoted-max] |network | |If specified, this will be passed to +rkt run+ as the network setting for the rkt container. indexterm:[network,rkt] indexterm:[rkt,Property,network] |run-command |`/usr/sbin/pacemaker-remoted` if bundle contains a +primitive+, otherwise none |This command will be run inside the container when launching it ("PID 1"). If the bundle contains a +primitive+, this command 'must' start pacemaker-remoted (but could, for example, be a script that does other stuff, too). If the container image has a pre-2.0.0 version of Pacemaker, set this to +/usr/sbin/pacemaker_remoted+ (note the underbar instead of dash). indexterm:[run-command,rkt] indexterm:[rkt,Property,run-command] |options | |Extra command-line options to pass to `rkt run` indexterm:[options,rkt] indexterm:[rkt,Property,options] |========================================================= For backward compatibility, +masters+ is accepted as an alias for +promoted-max+, but is deprecated since 2.0.0, and support for it will be removed in a future version. === Bundle Network Properties === A bundle may optionally contain one ++ element. indexterm:[bundle,network] .Properties of a Bundle's Network Element [width="95%",cols="2m,1,<4",options="header",align="center"] |========================================================= |Field |Default |Description |add-host |TRUE |If TRUE, and +ip-range-start+ is used, Pacemaker will automatically ensure that +/etc/hosts+ inside the containers has entries for each <> and its assigned IP. indexterm:[add-host,network] indexterm:[network,Property,add-host] |ip-range-start | |If specified, Pacemaker will create an implicit +ocf:heartbeat:IPaddr2+ resource for each container instance, starting with this IP address, using up to +replicas+ sequential addresses. These addresses can be used from the host's network to reach the service inside the container, though it is not visible within the container itself. Only IPv4 addresses are currently supported. indexterm:[ip-range-start,network] indexterm:[network,Property,ip-range-start] |host-netmask |32 |If +ip-range-start+ is specified, the IP addresses are created with this CIDR netmask (as a number of bits). indexterm:[host-netmask,network] indexterm:[network,Property,host-netmask] |host-interface | |If +ip-range-start+ is specified, the IP addresses are created on this host interface (by default, it will be determined from the IP address). indexterm:[host-interface,network] indexterm:[network,Property,host-interface] |control-port |3121 |If the bundle contains a +primitive+, the cluster will use this integer TCP port for communication with Pacemaker Remote inside the container. Changing this is useful when the container is unable to listen on the default port, for example, when the container uses the host's network rather than +ip-range-start+ (in which case +replicas-per-host+ must be 1), or when the bundle may run on a Pacemaker Remote node that is already listening on the default port. Any PCMK_remote_port environment variable set on the host or in the container is ignored for bundle connections. indexterm:[control-port,network] indexterm:[network,Property,control-port] |========================================================= [[s-resource-bundle-note-replica-names]] [NOTE] ==== Replicas are named by the bundle id plus a dash and an integer counter starting with zero. For example, if a bundle named +httpd-bundle+ has +replicas=2+, its containers will be named +httpd-bundle-0+ and +httpd-bundle-1+. ==== Additionally, a ++ element may optionally contain one or more ++ elements. indexterm:[bundle,network,port-mapping] .Properties of a Bundle's Port-Mapping Element [width="95%",cols="2m,1,<4",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the port mapping (required) indexterm:[id,port-mapping] indexterm:[port-mapping,Property,id] |port | |If this is specified, connections to this TCP port number on the host network (on the container's assigned IP address, if +ip-range-start+ is specified) will be forwarded to the container network. Exactly one of +port+ or +range+ must be specified in a +port-mapping+. indexterm:[port,port-mapping] indexterm:[port-mapping,Property,port] |internal-port |value of +port+ |If +port+ and this are specified, connections to +port+ on the host's network will be forwarded to this port on the container network. indexterm:[internal-port,port-mapping] indexterm:[port-mapping,Property,internal-port] |range | |If this is specified, connections to these TCP port numbers (expressed as 'first_port'-'last_port') on the host network (on the container's assigned IP address, if +ip-range-start+ is specified) will be forwarded to the same ports in the container network. Exactly one of +port+ or +range+ must be specified in a +port-mapping+. indexterm:[range,port-mapping] indexterm:[port-mapping,Property,range] |========================================================= [NOTE] ==== If the bundle contains a +primitive+, Pacemaker will automatically map the +control-port+, so it is not necessary to specify that port in a +port-mapping+. ==== === Bundle Storage Properties === A bundle may optionally contain one ++ element. A ++ element has no properties of its own, but may contain one or more ++ elements. indexterm:[bundle,storage,storage-mapping] .Properties of a Bundle's Storage-Mapping Element [width="95%",cols="2m,1,<4",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the storage mapping (required) indexterm:[id,storage-mapping] indexterm:[storage-mapping,Property,id] |source-dir | |The absolute path on the host's filesystem that will be mapped into the container. Exactly one of +source-dir+ and +source-dir-root+ must be specified in a +storage-mapping+. indexterm:[source-dir,storage-mapping] indexterm:[storage-mapping,Property,source-dir] |source-dir-root | |The start of a path on the host's filesystem that will be mapped into the container, using a different subdirectory on the host for each container instance. The subdirectory will be named the same as the <>. Exactly one of +source-dir+ and +source-dir-root+ must be specified in a +storage-mapping+. indexterm:[source-dir-root,storage-mapping] indexterm:[storage-mapping,Property,source-dir-root] |target-dir | |The path name within the container where the host storage will be mapped (required) indexterm:[target-dir,storage-mapping] indexterm:[storage-mapping,Property,target-dir] |options | |File system mount options to use when mapping the storage indexterm:[options,storage-mapping] indexterm:[storage-mapping,Property,options] |========================================================= [NOTE] ==== Pacemaker does not define the behavior if the source directory does not already exist on the host. However, it is expected that the container technology and/or its resource agent will create the source directory in that case. ==== [NOTE] ==== If the bundle contains a +primitive+, Pacemaker will automatically map the equivalent of +source-dir=/etc/pacemaker/authkey target-dir=/etc/pacemaker/authkey+ and +source-dir-root=/var/log/pacemaker/bundles target-dir=/var/log+ into the container, so it is not necessary to specify those paths in a +storage-mapping+. ==== [IMPORTANT] ==== The +PCMK_authkey_location+ environment variable must not be set to anything other than the default of `/etc/pacemaker/authkey` on any node in the cluster. ==== === Bundle Primitive === A bundle may optionally contain one ++ resource (see <>). The primitive may have operations, instance attributes and meta-attributes defined, as usual. If a bundle contains a primitive resource, the container image must include the Pacemaker Remote daemon, and at least one of +ip-range-start+ or +control-port+ must be configured in the bundle. Pacemaker will create an implicit +ocf:pacemaker:remote+ resource for the connection, launch Pacemaker Remote within the container, and monitor and manage the primitive resource via Pacemaker Remote. If the bundle has more than one container instance (replica), the primitive resource will function as an implicit clone (see <>) -- a promotable clone if the bundle has +masters+ greater than zero (see <>). [IMPORTANT] ==== Containers in bundles with a +primitive+ must have an accessible networking environment, so that Pacemaker on the cluster nodes can contact Pacemaker Remote inside the container. For example, the Docker option `--net=none` should not be used with a +primitive+. The default (using a distinct network space inside the container) works in combination with +ip-range-start+. If the Docker option `--net=host` is used (making the container share the host's network space), a unique +control-port+ should be specified for each bundle. Any firewall must allow access to the +control-port+. ==== [[s-bundle-attributes]] === Bundle Node Attributes === If the bundle has a +primitive+, the primitive's resource agent may want to set node attributes such as <>. However, with containers, it is not apparent which node should get the attribute. If the container uses shared storage that is the same no matter which node the container is hosted on, then it is appropriate to use the promotion score on the bundle node itself. On the other hand, if the container uses storage exported from the underlying host, then it may be more appropriate to use the promotion score on the underlying host. Since this depends on the particular situation, the +container-attribute-target+ resource meta-attribute allows the user to specify which approach to use. If it is set to +host+, then user-defined node attributes will be checked on the underlying host. If it is anything else, the local node (in this case the bundle node) is used as usual. This only applies to user-defined attributes; the cluster will always check the local node for cluster-defined attributes such as +#uname+. If +container-attribute-target+ is +host+, the cluster will pass additional environment variables to the primitive's resource agent that allow it to set node attributes appropriately: +CRM_meta_container_attribute_target+ (identical to the meta-attribute value) and +CRM_meta_physical_host+ (the name of the underlying host). [NOTE] ==== When called by a resource agent, the attrd_updater and crm_attribute commands will automatically check those environment variables and set attributes appropriately. ==== === Bundle Meta-Attributes === Any meta-attribute set on a bundle will be inherited by the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. This includes options such as +priority+, +target-role+, and +is-managed+. See <> for more information. === Limitations of Bundles === Restarting pacemaker while a bundle is unmanaged or the cluster is in maintenance mode may cause the bundle to fail. Bundles may not be explicitly cloned or included in groups. This includes the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. (If +replicas+ is greater than 1, the bundle will behave like a clone implicitly.) Bundles do not have instance attributes, utilization attributes, or operations, though a bundle's primitive may have them. A bundle with a primitive can run on a Pacemaker Remote node only if the bundle uses a distinct +control-port+. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt b/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt index 34daeece5f..34efbb284b 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt @@ -1,423 +1,424 @@ +:compat-mode: legacy = Alerts = //// We prefer [[ch-alerts]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-alerts[Chapter 7, Alerts] indexterm:[Resource,Alerts] 'Alerts' may be configured to take some external action when a cluster event occurs (node failure, resource starting or stopping, etc.). == Alert Agents == As with resource agents, the cluster calls an external program (an 'alert agent') to handle alerts. The cluster passes information about the event to the agent via environment variables. Agents can do anything desired with this information (send an e-mail, log to a file, update a monitoring system, etc.). .Simple alert configuration ===== [source,XML] ----- ----- ===== In the example above, the cluster will call +my-script.sh+ for each event. Multiple alert agents may be configured; the cluster will call all of them for each event. Alert agents will be called only on cluster nodes. They will be called for events involving Pacemaker Remote nodes, but they will never be called _on_ those nodes. == Alert Recipients == Usually alerts are directed towards a recipient. Thus each alert may be additionally configured with one or more recipients. The cluster will call the agent separately for each recipient. .Alert configuration with recipient ===== [source,XML] ----- ----- ===== In the above example, the cluster will call +my-script.sh+ for each event, passing the recipient +some-address+ as an environment variable. The recipient may be anything the alert agent can recognize -- an IP address, an e-mail address, a file name, whatever the particular agent supports. == Alert Meta-Attributes == As with resource agents, meta-attributes can be configured for alert agents to affect how Pacemaker calls them. .Meta-Attributes of an Alert [width="95%",cols="m,1,<2",options="header",align="center"] |========================================================= |Meta-Attribute |Default |Description |timestamp-format |%H:%M:%S.%06N |Format the cluster will use when sending the event's timestamp to the agent. This is a string as used with the `date(1)` command. indexterm:[Alert,Option,timestamp-format] |timeout |30s |If the alert agent does not complete within this amount of time, it will be terminated. indexterm:[Alert,Option,timeout] |========================================================= Meta-attributes can be configured per alert agent and/or per recipient. .Alert configuration with meta-attributes ===== [source,XML] ----- ----- ===== In the above example, the +my-script.sh+ will get called twice for each event, with each call using a 15-second timeout. One call will be passed the recipient +someuser@example.com+ and a timestamp in the format +%D %H:%M+, while the other call will be passed the recipient +otheruser@example.com+ and a timestamp in the format +%c+. == Alert Instance Attributes == As with resource agents, agent-specific configuration values may be configured as instance attributes. These will be passed to the agent as additional environment variables. The number, names and allowed values of these instance attributes are completely up to the particular agent. .Alert configuration with instance attributes ===== [source,XML] ----- ----- ===== == Alert Filters == By default, an alert agent will be called for node events, fencing events, and resource events. An agent may choose to ignore certain types of events, but there is still the overhead of calling it for those events. To eliminate that overhead, you may select which types of events the agent should receive. .Alert configuration to receive only node events and fencing events ===== [source,XML] ----- ----- ===== The possible options within + ----- ===== Node attribute alerts are currently considered experimental. Alerts may be limited to attributes set via attrd_updater, and agents may be called multiple times with the same attribute value. == Using the Sample Alert Agents == Pacemaker provides several sample alert agents, installed in +/usr/share/pacemaker/alerts+ by default. While these sample scripts may be copied and used as-is, they are provided mainly as templates to be edited to suit your purposes. See their source code for the full set of instance attributes they support. .Sending cluster events as SNMP traps ===== [source,XML] ----- ----- ===== .Sending cluster events as e-mails ===== [source,XML] ----- ----- ===== == Writing an Alert Agent == .Environment variables passed to alert agents [width="95%",cols="m,<2",options="header",align="center"] |========================================================= |Environment Variable |Description |CRM_alert_kind |The type of alert (+node+, +fencing+, +resource+, or +attribute+) indexterm:[Environment Variable,CRM_alert_,kind] |CRM_alert_version |The version of Pacemaker sending the alert indexterm:[Environment Variable,CRM_alert_,version] |CRM_alert_recipient |The configured recipient indexterm:[Environment Variable,CRM_alert_,recipient] |CRM_alert_node_sequence |A sequence number increased whenever an alert is being issued on the local node, which can be used to reference the order in which alerts have been issued by Pacemaker. An alert for an event that happened later in time reliably has a higher sequence number than alerts for earlier events. Be aware that this number has no cluster-wide meaning. indexterm:[Environment Variable,CRM_alert_node_,sequence] |CRM_alert_timestamp |A timestamp created prior to executing the agent, in the format specified by the +timestamp-format+ meta-attribute. This allows the agent to have a reliable, high-precision time of when the event occurred, regardless of when the agent itself was invoked (which could potentially be delayed due to system load, etc.). indexterm:[Environment Variable,CRM_alert_,timestamp] |CRM_alert_timestamp_epoch |The same time as +CRM_alert_timestamp+, expressed as the integer number of seconds since January 1, 1970. This (along with +CRM_alert_timestamp_usec+) can be useful for alert agents that need to format time in a specific way rather than let the user configure it. indexterm:[Environment Variable,CRM_alert_,timestamp_epoch] |CRM_alert_timestamp_usec |The same time as +CRM_alert_timestamp+, expressed as the integer number of microseconds since +CRM_alert_timestamp_epoch+. indexterm:[Environment Variable,CRM_alert_,timestamp_usec] |CRM_alert_node |Name of affected node indexterm:[Environment Variable,CRM_alert_,node] |CRM_alert_desc |Detail about event. For +node+ alerts, this is the node's current state (+member+ or +lost+). For +fencing+ alerts, this is a summary of the requested fencing operation, including origin, target, and fencing operation error code, if any. For +resource+ alerts, this is a readable string equivalent of +CRM_alert_status+. indexterm:[Environment Variable,CRM_alert_,desc] |CRM_alert_nodeid |ID of node whose status changed (provided with +node+ alerts only) indexterm:[Environment Variable,CRM_alert_,nodeid] |CRM_alert_task |The requested fencing or resource operation (provided with +fencing+ and +resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,task] |CRM_alert_rc |The numerical return code of the fencing or resource operation (provided with +fencing+ and +resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,rc] |CRM_alert_rsc |The name of the affected resource (+resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,rsc] |CRM_alert_interval |The interval of the resource operation (+resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,interval] |CRM_alert_target_rc |The expected numerical return code of the operation (+resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,target_rc] |CRM_alert_status |A numerical code used by Pacemaker to represent the operation result (+resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,status] |CRM_alert_exec_time |The (wall-clock) time, in milliseconds, that it took to execute the action. If the action timed out, +CRM_alert_status+ will be 2, +CRM_alert_desc+ will be "Timed Out", and this value will be the action timeout. May not be supported on all platforms. (+resource+ alerts only) indexterm:[Environment Variable,CRM_alert_,exec_time] |CRM_alert_attribute_name |The name of the node attribute that changed (+attribute+ alerts only) indexterm:[Environment Variable,CRM_alert_,attribute_name] |CRM_alert_attribute_value |The new value of the node attribute that changed (+attribute+ alerts only) indexterm:[Environment Variable,CRM_alert_,attribute_value] |========================================================= Special concerns when writing alert agents: * Alert agents may be called with no recipient (if none is configured), so the agent must be able to handle this situation, even if it only exits in that case. (Users may modify the configuration in stages, and add a recipient later.) * If more than one recipient is configured for an alert, the alert agent will be called once per recipient. If an agent is not able to run concurrently, it should be configured with only a single recipient. The agent is free, however, to interpret the recipient as a list. * When a cluster event occurs, all alerts are fired off at the same time as separate processes. Depending on how many alerts and recipients are configured, and on what is done within the alert agents, a significant load burst may occur. The agent could be written to take this into consideration, for example by queueing resource-intensive actions into some other instance, instead of directly executing them. * Alert agents are run as the +hacluster+ user, which has a minimal set of permissions. If an agent requires additional privileges, it is recommended to configure +sudo+ to allow the agent to run the necessary commands as another user with the appropriate privileges. * As always, take care to validate and sanitize user-configured parameters, such as CRM_alert_timestamp (whose content is specified by the user-configured timestamp-format), CRM_alert_recipient, and all instance attributes. Mostly this is needed simply to protect against configuration errors, but if some user can modify the CIB without having hacluster-level access to the cluster nodes, it is a potential security concern as well, to avoid the possibility of code injection. [NOTE] ===== The alerts interface is designed to be backward compatible with the external scripts interface used by the +ocf:pacemaker:ClusterMon+ resource, which is now deprecated. To preserve this compatibility, the environment variables passed to alert agents are available prepended with +CRM_notify_+ as well as +CRM_alert_+. One break in compatibility is that ClusterMon ran external scripts as the +root+ user, while alert agents are run as the +hacluster+ user. ===== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt index 49864c9dfd..ec4c655146 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt @@ -1,881 +1,882 @@ +:compat-mode: legacy = Resource Constraints = indexterm:[Resource,Constraints] == Scores == Scores of all kinds are integral to how the cluster works. Practically everything from moving a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in some way. Scores are calculated per resource and node. Any node with a negative score for a resource can't run that resource. The cluster places a resource on the node with the highest score for it. === Infinity Math === Pacemaker implements +INFINITY+ (or equivalently, ++INFINITY+) internally as a score of 1,000,000. Addition and subtraction with it follow these three basic rules: * Any value + +INFINITY+ = +INFINITY+ * Any value - +INFINITY+ = +-INFINITY+ * +INFINITY+ - +INFINITY+ = +-INFINITY+ [NOTE] ====== What if you want to use a score higher than 1,000,000? Typically this possibility arises when someone wants to base the score on some external metric that might go above 1,000,000. The short answer is you can't. The long answer is it is sometimes possible work around this limitation creatively. You may be able to set the score to some computed value based on the external metric rather than use the metric directly. For nodes, you can store the metric as a node attribute, and query the attribute when computing the score (possibly as part of a custom resource agent). ====== == Deciding Which Nodes a Resource Can Run On == indexterm:[Location Constraints] indexterm:[Resource,Constraints,Location] 'Location constraints' tell the cluster which nodes a resource can run on. There are two alternative strategies. One way is to say that, by default, resources can run anywhere, and then the location constraints specify nodes that are not allowed (an 'opt-out' cluster). The other way is to start with nothing able to run anywhere, and use location constraints to selectively enable allowed nodes (an 'opt-in' cluster). Whether you should choose opt-in or opt-out depends on your personal preference and the make-up of your cluster. If most of your resources can run on most of the nodes, then an opt-out arrangement is likely to result in a simpler configuration. On the other-hand, if most resources can only run on a small subset of nodes, an opt-in configuration might be simpler. === Location Properties === .Properties of a rsc_location Constraint [width="95%",cols="2m,1,<5",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the constraint indexterm:[id,Location Constraints] indexterm:[Constraints,Location,id] |rsc | |The name of the resource to which this constraint applies indexterm:[rsc,Location Constraints] indexterm:[Constraints,Location,rsc] |rsc-pattern | |An extended regular expression (as defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04[POSIX]) matching the names of resources to which this constraint applies, if +rsc+ is not specified; if the regular expression contains submatches and the constraint is governed by a rule (see <>), the submatches can be referenced as +%0+ through +%9+ in the rule's +score-attribute+ or a rule expression's +attribute+ indexterm:[rsc-pattern,Location Constraints] indexterm:[Constraints,Location,rsc-pattern] |node | |A node's name indexterm:[node,Location Constraints] indexterm:[Constraints,Location,node] |score | |Positive values indicate a preference for running the affected resource(s) on this node -- the higher the value, the stronger the preference. Negative values indicate the resource(s) should avoid this node (a value of +-INFINITY+ changes "should" to "must"). indexterm:[score,Location Constraints] indexterm:[Constraints,Location,score] |resource-discovery |always a|Whether Pacemaker should perform resource discovery (that is, check whether the resource is already running) for this resource on this node. This should normally be left as the default, so that rogue instances of a service can be stopped when they are running where they are not supposed to be. However, there are two situations where disabling resource discovery is a good idea: when a service is not installed on a node, discovery might return an error (properly written OCF agents will not, so this is usually only seen with other agent types); and when Pacemaker Remote is used to scale a cluster to hundreds of nodes, limiting resource discovery to allowed nodes can significantly boost performance. * +always:+ Always perform resource discovery for the specified resource on this node. * +never:+ Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score, although that is not strictly required. * +exclusive:+ Perform resource discovery for the specified resource only on this node (and other nodes similarly marked as +exclusive+). Multiple location constraints using +exclusive+ discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for +exclusive+ discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes. indexterm:[Resource Discovery,Location Constraints] indexterm:[Constraints,Location,Resource Discovery] |========================================================= [WARNING] ========= Setting resource-discovery to +never+ or +exclusive+ removes Pacemaker's ability to detect and stop unwanted instances of a service running where it's not supposed to be. It is up to the system administrator (you!) to make sure that the service can 'never' be active on nodes without resource-discovery (such as by leaving the relevant software uninstalled). ========= === Asymmetrical "Opt-In" Clusters === indexterm:[Asymmetrical Opt-In Clusters] indexterm:[Cluster Type,Asymmetrical Opt-In] To create an opt-in cluster, start by preventing resources from running anywhere by default: ---- # crm_attribute --name symmetric-cluster --update false ---- Then start enabling nodes. The following fragment says that the web server prefers *sles-1*, the database prefers *sles-2* and both can fail over to *sles-3* if their most preferred node fails. .Opt-in location constraints for two resources ====== [source,XML] ------- ------- ====== === Symmetrical "Opt-Out" Clusters === indexterm:[Symmetrical Opt-Out Clusters] indexterm:[Cluster Type,Symmetrical Opt-Out] To create an opt-out cluster, start by allowing resources to run anywhere by default: ---- # crm_attribute --name symmetric-cluster --update true ---- Then start disabling nodes. The following fragment is the equivalent of the above opt-in configuration. .Opt-out location constraints for two resources ====== [source,XML] ------- ------- ====== [[node-score-equal]] === What if Two Nodes Have the Same Score === If two nodes have the same score, then the cluster will choose one. This choice may seem random and may not be what was intended, however the cluster was not given enough information to know any better. .Constraints where a resource prefers two nodes equally ====== [source,XML] ------- ------- ====== In the example above, assuming no other constraints and an inactive cluster, +Webserver+ would probably be placed on +sles-1+ and +Database+ on +sles-2+. It would likely have placed +Webserver+ based on the node's uname and +Database+ based on the desire to spread the resource load evenly across the cluster. However other factors can also be involved in more complex configurations. [[s-resource-ordering]] == Specifying the Order in which Resources Should Start/Stop == indexterm:[Resource,Constraints,Ordering] indexterm:[Resource,Start Order] indexterm:[Ordering Constraints] 'Ordering constraints' tell the cluster the order in which resources should start. [IMPORTANT] ==== Ordering constraints affect 'only' the ordering of resources; they do 'not' require that the resources be placed on the same node. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint 'and' a colocation constraint (see <>), or alternatively, a group (see <>). ==== === Ordering Properties === .Properties of a rsc_order Constraint [width="95%",cols="1m,1,<4",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the constraint indexterm:[id,Ordering Constraints] indexterm:[Constraints,Ordering,id] |first | |Name of the resource that the +then+ resource depends on indexterm:[first,Ordering Constraints] indexterm:[Constraints,Ordering,first] |then | |Name of the dependent resource indexterm:[then,Ordering Constraints] indexterm:[Constraints,Ordering,then] |first-action |start |The action that the +first+ resource must complete before +then-action+ can be initiated for the +then+ resource. Allowed values: +start+, +stop+, +promote+, +demote+. indexterm:[first-action,Ordering Constraints] indexterm:[Constraints,Ordering,first-action] |then-action |value of +first-action+ |The action that the +then+ resource can execute only after the +first-action+ on the +first+ resource has completed. Allowed values: +start+, +stop+, +promote+, +demote+. indexterm:[then-action,Ordering Constraints] indexterm:[Constraints,Ordering,then-action] |kind | a|How to enforce the constraint. Allowed values: * +Optional:+ Just a suggestion. Only applies if both resources are executing the specified actions. Any change in state by the +first+ resource will have no effect on the +then+ resource. * +Mandatory:+ Always. If +first+ does not perform +first-action+, +then+ will not be allowed to performed +then-action+. If +first+ is restarted, +then+ (if running) will be stopped beforehand and started afterward. * +Serialize:+ Ensure that no two stop/start actions occur concurrently for the resources. +First+ and +then+ can start in either order, but one must complete starting before the other can be started. A typical use case is when resource start-up puts a high load on the host. indexterm:[kind,Ordering Constraints] indexterm:[Constraints,Ordering,kind] |symmetrical |TRUE for +Mandatory+ and +Optional+ kinds. FALSE for +Serialize+ kind. |If true, the reverse of the constraint applies for the opposite action (for example, if B starts after A starts, then B stops before A stops). +Serialize+ orders cannot be symmetrical. indexterm:[symmetrical,Ordering Constraints] indexterm:[Ordering Constraints,symmetrical] |========================================================= +Promote+ and +demote+ apply to the master role of <> resources. === Optional and mandatory ordering === Here is an example of ordering constraints where +Database+ 'must' start before +Webserver+, and +IP+ 'should' start before +Webserver+ if they both need to be started: .Optional and mandatory ordering constraints ====== [source,XML] ------- ------- ====== Because the above example lets +symmetrical+ default to TRUE, +Webserver+ must be stopped before +Database+ can be stopped, and +Webserver+ should be stopped before +IP+ if they both need to be stopped. [[s-resource-colocation]] == Placing Resources Relative to other Resources == indexterm:[Resource,Constraints,Colocation] indexterm:[Resource,Location Relative to other Resources] 'Colocation constraints' tell the cluster that the location of one resource depends on the location of another one. Colocation has an important side-effect: it affects the order in which resources are assigned to a node. Think about it: You can't place A relative to B unless you know where B is. footnote:[ While the human brain is sophisticated enough to read the constraint in any order and choose the correct one depending on the situation, the cluster is not quite so smart. Yet. ] So when you are creating colocation constraints, it is important to consider whether you should colocate A with B, or B with A. Another thing to keep in mind is that, assuming A is colocated with B, the cluster will take into account A's preferences when deciding which node to choose for B. For a detailed look at exactly how this occurs, see http://clusterlabs.org/doc/Colocation_Explained.pdf[Colocation Explained]. [IMPORTANT] ==== Colocation constraints affect 'only' the placement of resources; they do 'not' require that the resources be started in a particular order. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint (see <>) 'and' a colocation constraint, or alternatively, a group (see <>). ==== === Colocation Properties === .Properties of a rsc_colocation Constraint [width="95%",cols="1m,1,<4",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the constraint (required). indexterm:[id,Colocation Constraints] indexterm:[Constraints,Colocation,id] |rsc | |The name of a resource that should be located relative to +with-rsc+ (required). indexterm:[rsc,Colocation Constraints] indexterm:[Constraints,Colocation,rsc] |with-rsc | |The name of the resource used as the colocation target. The cluster will decide where to put this resource first and then decide where to put +rsc+ (required). indexterm:[with-rsc,Colocation Constraints] indexterm:[Constraints,Colocation,with-rsc] |node-attribute |#uname |The node attribute that must be the same on the node running +rsc+ and the node running +with-rsc+ for the constraint to be satisfied. (For details, see <>.) indexterm:[node-attribute,Colocation Constraints] indexterm:[Constraints,Colocation,node-attribute] |score | |Positive values indicate the resources should run on the same node. Negative values indicate the resources should run on different nodes. Values of \+/- +INFINITY+ change "should" to "must". indexterm:[score,Colocation Constraints] indexterm:[Constraints,Colocation,score] |========================================================= === Mandatory Placement === Mandatory placement occurs when the constraint's score is ++INFINITY+ or +-INFINITY+. In such cases, if the constraint can't be satisfied, then the +rsc+ resource is not permitted to run. For +score=INFINITY+, this includes cases where the +with-rsc+ resource is not active. If you need resource +A+ to always run on the same machine as resource +B+, you would add the following constraint: .Mandatory colocation constraint for two resources ==== [source,XML] ==== Remember, because +INFINITY+ was used, if +B+ can't run on any of the cluster nodes (for whatever reason) then +A+ will not be allowed to run. Whether +A+ is running or not has no effect on +B+. Alternatively, you may want the opposite -- that +A+ 'cannot' run on the same machine as +B+. In this case, use +score="-INFINITY"+. .Mandatory anti-colocation constraint for two resources ==== [source,XML] ==== Again, by specifying +-INFINITY+, the constraint is binding. So if the only place left to run is where +B+ already is, then +A+ may not run anywhere. As with +INFINITY+, +B+ can run even if +A+ is stopped. However, in this case +A+ also can run if +B+ is stopped, because it still meets the constraint of +A+ and +B+ not running on the same node. === Advisory Placement === If mandatory placement is about "must" and "must not", then advisory placement is the "I'd prefer if" alternative. For constraints with scores greater than +-INFINITY+ and less than +INFINITY+, the cluster will try to accommodate your wishes but may ignore them if the alternative is to stop some of the cluster resources. As in life, where if enough people prefer something it effectively becomes mandatory, advisory colocation constraints can combine with other elements of the configuration to behave as if they were mandatory. .Advisory colocation constraint for two resources ==== [source,XML] ==== [[s-coloc-attribute]] === Colocation by Node Attribute === The +node-attribute+ property of a colocation constraints allows you to express the requirement, "these resources must be on similar nodes". As an example, imagine that you have two Storage Area Networks (SANs) that are not controlled by the cluster, and each node is connected to one or the other. You may have two resources +r1+ and +r2+ such that +r2+ needs to use the same SAN as +r1+, but doesn't necessarily have to be on the same exact node. In such a case, you could define a <> named +san+, with the value +san1+ or +san2+ on each node as appropriate. Then, you could colocate +r2+ with +r1+ using +node-attribute+ set to +san+. [[s-resource-sets]] == Resource Sets == 'Resource sets' allow multiple resources to be affected by a single constraint. .A set of 3 resources ==== [source,XML] ---- ---- ==== Resource sets are valid inside +rsc_location+, +rsc_order+ (see <>), +rsc_colocation+ (see <>), and +rsc_ticket+ (see <>) constraints. A resource set has a number of properties that can be set, though not all have an effect in all contexts. .Properties of a resource_set [width="95%",cols="2m,1,<5",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the set indexterm:[id,Resource Sets] indexterm:[Constraints,Resource Sets,id] |sequential |true |Whether the members of the set must be acted on in order. Meaningful within +rsc_order+ and +rsc_colocation+. indexterm:[sequential,Resource Sets] indexterm:[Constraints,Resource Sets,sequential] |require-all |true |Whether all members of the set must be active before continuing. With the current implementation, the cluster may continue even if only one member of the set is started, but if more than one member of the set is starting at the same time, the cluster will still wait until all of those have started before continuing (this may change in future versions). Meaningful within +rsc_order+. indexterm:[require-all,Resource Sets] indexterm:[Constraints,Resource Sets,require-all] |role | |Limit the effect of the constraint to the specified role. Meaningful within +rsc_location+, +rsc_colocation+ and +rsc_ticket+. indexterm:[role,Resource Sets] indexterm:[Constraints,Resource Sets,role] |action | |Limit the effect of the constraint to the specified action. Meaningful within +rsc_order+. indexterm:[action,Resource Sets] indexterm:[Constraints,Resource Sets,action] |score | |'Advanced use only.' Use a specific score for this set within the constraint. indexterm:[score,Resource Sets] indexterm:[Constraints,Resource Sets,score] |========================================================= [[s-resource-sets-ordering]] == Ordering Sets of Resources == A common situation is for an administrator to create a chain of ordered resources, such as: .A chain of ordered resources ====== [source,XML] ------- ------- ====== .Visual representation of the four resources' start order for the above constraints image::images/resource-set.png["Ordered set",width="16cm",height="2.5cm",align="center"] === Ordered Set === To simplify this situation, resource sets (see <>) can be used within ordering constraints: .A chain of ordered resources expressed as a set ====== [source,XML] ------- ------- ====== While the set-based format is not less verbose, it is significantly easier to get right and maintain. [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A then B+, or +B then A+. ========= === Ordering Multiple Sets === The syntax can be expanded to allow sets of resources to be ordered relative to each other, where the members of each individual set may be ordered or unordered (controlled by the +sequential+ property). In the example below, +A+ and +B+ can both start in parallel, as can +C+ and +D+, however +C+ and +D+ can only start once _both_ +A+ _and_ +B+ are active. .Ordered sets of unordered resources ====== [source,XML] ------- ------- ====== .Visual representation of the start order for two ordered sets of unordered resources image::images/two-sets.png["Two ordered sets",width="13cm",height="7.5cm",align="center"] Of course either set -- or both sets -- of resources can also be internally ordered (by setting +sequential="true"+) and there is no limit to the number of sets that can be specified. .Advanced use of set ordering - Three ordered sets, two of which are internally unordered ====== [source,XML] ------- ------- ====== .Visual representation of the start order for the three sets defined above image::images/three-sets.png["Three ordered sets",width="16cm",height="7.5cm",align="center"] [IMPORTANT] ==== An ordered set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== === Resource Set OR Logic === The unordered set logic discussed so far has all been "AND" logic. To illustrate this take the 3 resource set figure in the previous section. Those sets can be expressed, +(A and B) then \(C) then (D) then (E and F)+. Say for example we want to change the first set, +(A and B)+, to use "OR" logic so the sets look like this: +(A or B) then \(C) then (D) then (E and F)+. This functionality can be achieved through the use of the +require-all+ option. This option defaults to TRUE which is why the "AND" logic is used by default. Setting +require-all=false+ means only one resource in the set needs to be started before continuing on to the next set. .Resource Set "OR" logic: Three ordered sets, where the first set is internally unordered with "OR" logic ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== An ordered set with +require-all=false+ makes sense only in conjunction with +sequential=false+. Think of it like this: +sequential=false+ modifies the set to be an unordered set using "AND" logic by default, and adding +require-all=false+ flips the unordered set's "AND" logic to "OR" logic. ==== [[s-resource-sets-colocation]] == Colocating Sets of Resources == Another common situation is for an administrator to create a set of colocated resources. One way to do this would be to define a resource group (see <>), but that cannot always accurately express the desired state. Another way would be to define each relationship as an individual constraint, but that causes a constraint explosion as the number of resources and combinations grow. An example of this approach: .Chain of colocated resources ====== [source,XML] ------- ------- ====== To make things easier, resource sets (see <>) can be used within colocation constraints. As with the chained version, a resource that can't be active prevents any resource that must be colocated with it from being active. For example, if +B+ is not able to run, then both +C+ and by inference +D+ must also remain stopped. Here is an example +resource_set+: .Equivalent colocation chain expressed using +resource_set+ ====== [source,XML] ------- ------- ====== [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A with B+, or +B with A+. ========= This notation can also be used to tell the cluster that sets of resources must be colocated relative to each other, where the individual members of each set may or may not depend on each other being active (controlled by the +sequential+ property). In this example, +A+, +B+, and +C+ will each be colocated with +D+. +D+ must be active, but any of +A+, +B+, or +C+ may be inactive without affecting any other resources. .Using colocated sets to specify a common peer ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== A colocated set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== There is no inherent limit to the number and size of the sets used. The only thing that matters is that in order for any member of one set in the constraint to be active, all members of sets listed after it must also be active (and naturally on the same node); and if a set has +sequential="true"+, then in order for one member of that set to be active, all members listed before it must also be active. If desired, you can restrict the dependency to instances of promotable clone resources that are in a specific role, using the set's +role+ property. .Colocation chain in which the members of the middle set have no interdependencies, and the last listed set (which the cluster places first) is restricted to instances in master status. ====== [source,XML] ------- ------- ====== .Visual representation the above example (resources to the left are placed first) image::images/three-sets-complex.png["Colocation chain",width="16cm",height="9cm",align="center"] [NOTE] ==== Pay close attention to the order in which resources and sets are listed. While the colocation dependency for members of any one set is last-to-first, the colocation dependency for multiple sets is first-to-last. In the above example, +B+ is colocated with +A+, but +colocated-set-1+ is colocated with +colocated-set-2+. Unlike ordered sets, colocated sets do not use the +require-all+ option. ==== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Intro.txt b/doc/Pacemaker_Explained/en-US/Ch-Intro.txt index 4975281b54..49994a92cb 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Intro.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Intro.txt @@ -1,23 +1,24 @@ +:compat-mode: legacy = Read-Me-First = == The Scope of this Document == This document is intended to be an exhaustive reference for configuring Pacemaker. To achieve this, it focuses on the XML syntax used to configure the CIB. For those that are allergic to XML, multiple higher-level front-ends (both command-line and GUI) are available. These tools will not be covered at all in this document footnote:[ I hope, however, that the concepts explained here make the functionality of these tools more easily understood. ]. Users may be interested in other parts of the https://www.clusterlabs.org/pacemaker/doc/[Pacemaker documentation set], such as 'Clusters from Scratch', a step-by-step guide to setting up an example cluster, and 'Pacemaker Administration', a guide to maintaining a cluster. include::../../shared/en-US/pacemaker-intro.txt[] diff --git a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt index 91e17532c3..1ae131fbf2 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt @@ -1,333 +1,334 @@ +:compat-mode: legacy = Multi-Site Clusters and Tickets = Apart from local clusters, Pacemaker also supports multi-site clusters. That means you can have multiple, geographically dispersed sites, each with a local cluster. Failover between these clusters can be coordinated manually by the administrator, or automatically by a higher-level entity called a 'Cluster Ticket Registry (CTR)'. == Challenges for Multi-Site Clusters == Typically, multi-site environments are too far apart to support synchronous communication and data replication between the sites. That leads to significant challenges: - How do we make sure that a cluster site is up and running? - How do we make sure that resources are only started once? - How do we make sure that quorum can be reached between the different sites and a split-brain scenario avoided? - How do we manage failover between sites? - How do we deal with high latency in case of resources that need to be stopped? In the following sections, learn how to meet these challenges. == Conceptual Overview == Multi-site clusters can be considered as “overlay” clusters where each cluster site corresponds to a cluster node in a traditional cluster. The overlay cluster can be managed by a CTR in order to guarantee that any cluster resource will be active on no more than one cluster site. This is achieved by using 'tickets' that are treated as failover domain between cluster sites, in case a site should be down. The following sections explain the individual components and mechanisms that were introduced for multi-site clusters in more detail. === Ticket === Tickets are, essentially, cluster-wide attributes. A ticket grants the right to run certain resources on a specific cluster site. Resources can be bound to a certain ticket by +rsc_ticket+ constraints. Only if the ticket is available at a site can the respective resources be started there. Vice versa, if the ticket is revoked, the resources depending on that ticket must be stopped. The ticket thus is similar to a 'site quorum', i.e. the permission to manage/own resources associated with that site. (One can also think of the current +have-quorum+ flag as a special, cluster-wide ticket that is granted in case of node majority.) Tickets can be granted and revoked either manually by administrators (which could be the default for classic enterprise clusters), or via the automated CTR mechanism described below. A ticket can only be owned by one site at a time. Initially, none of the sites has a ticket. Each ticket must be granted once by the cluster administrator. The presence or absence of tickets for a site is stored in the CIB as a cluster status. With regards to a certain ticket, there are only two states for a site: +true+ (the site has the ticket) or +false+ (the site does not have the ticket). The absence of a certain ticket (during the initial state of the multi-site cluster) is the same as the value +false+. === Dead Man Dependency === A site can only activate resources safely if it can be sure that the other site has deactivated them. However after a ticket is revoked, it can take a long time until all resources depending on that ticket are stopped "cleanly", especially in case of cascaded resources. To cut that process short, the concept of a 'Dead Man Dependency' was introduced. If a dead man dependency is in force, if a ticket is revoked from a site, the nodes that are hosting dependent resources are fenced. This considerably speeds up the recovery process of the cluster and makes sure that resources can be migrated more quickly. This can be configured by specifying a +loss-policy="fence"+ in +rsc_ticket+ constraints. === Cluster Ticket Registry === A CTR is a coordinated group of network daemons that automatically handles granting, revoking, and timing out tickets (instead of the administrator revoking the ticket somewhere, waiting for everything to stop, and then granting it on the desired site). Pacemaker does not implement its own CTR, but interoperates with external software designed for that purpose (similar to how resource and fencing agents are not directly part of pacemaker). Participating clusters run the CTR daemons, which connect to each other, exchange information about their connectivity, and vote on which sites gets which tickets. A ticket is granted to a site only once the CTR is sure that the ticket has been relinquished by the previous owner, implemented via a timer in most scenarios. If a site loses connection to its peers, its tickets time out and recovery occurs. After the connection timeout plus the recovery timeout has passed, the other sites are allowed to re-acquire the ticket and start the resources again. This can also be thought of as a "quorum server", except that it is not a single quorum ticket, but several. === Configuration Replication === As usual, the CIB is synchronized within each cluster, but it is 'not' synchronized across cluster sites of a multi-site cluster. You have to configure the resources that will be highly available across the multi-site cluster for every site accordingly. [[s-ticket-constraints]] == Configuring Ticket Dependencies == The `rsc_ticket` constraint lets you specify the resources depending on a certain ticket. Together with the constraint, you can set a `loss-policy` that defines what should happen to the respective resources if the ticket is revoked. The attribute `loss-policy` can have the following values: * +fence:+ Fence the nodes that are running the relevant resources. * +stop:+ Stop the relevant resources. * +freeze:+ Do nothing to the relevant resources. * +demote:+ Demote relevant resources that are running in master mode to slave mode. .Constraint that fences node if +ticketA+ is revoked ==== [source,XML] ------- ------- ==== The example above creates a constraint with the ID +rsc1-req-ticketA+. It defines that the resource +rsc1+ depends on +ticketA+ and that the node running the resource should be fenced if +ticketA+ is revoked. If resource +rsc1+ were a promotable resource (i.e. it could run in master or slave mode), you might want to configure that only master mode depends on +ticketA+. With the following configuration, +rsc1+ will be demoted to slave mode if +ticketA+ is revoked: .Constraint that demotes +rsc1+ if +ticketA+ is revoked ==== [source,XML] ------- ------- ==== You can create multiple `rsc_ticket` constraints to let multiple resources depend on the same ticket. However, `rsc_ticket` also supports resource sets (see <>), so one can easily list all the resources in one `rsc_ticket` constraint instead. .Ticket constraint for multiple resources ==== [source,XML] ------- ------- ==== In the example above, there are two resource sets, so we can list resources with different roles in a single +rsc_ticket+ constraint. There's no dependency between the two resource sets, and there's no dependency among the resources within a resource set. Each of the resources just depends on +ticketA+. Referencing resource templates in +rsc_ticket+ constraints, and even referencing them within resource sets, is also supported. If you want other resources to depend on further tickets, create as many constraints as necessary with +rsc_ticket+. == Managing Multi-Site Clusters == === Granting and Revoking Tickets Manually === You can grant tickets to sites or revoke them from sites manually. If you want to re-distribute a ticket, you should wait for the dependent resources to stop cleanly at the previous site before you grant the ticket to the new site. Use the `crm_ticket` command line tool to grant and revoke tickets. To grant a ticket to this site: ------- # crm_ticket --ticket ticketA --grant ------- To revoke a ticket from this site: ------- # crm_ticket --ticket ticketA --revoke ------- [IMPORTANT] ==== If you are managing tickets manually, use the `crm_ticket` command with great care, because it cannot check whether the same ticket is already granted elsewhere. ==== === Granting and Revoking Tickets via a Cluster Ticket Registry === We will use https://github.com/ClusterLabs/booth[Booth] here as an example of software that can be used with pacemaker as a Cluster Ticket Registry. Booth implements the http://en.wikipedia.org/wiki/Raft_%28computer_science%29[Raft] algorithm to guarantee the distributed consensus among different cluster sites, and manages the ticket distribution (and thus the failover process between sites). Each of the participating clusters and 'arbitrators' runs the Booth daemon `boothd`. An 'arbitrator' is the multi-site equivalent of a quorum-only node in a local cluster. If you have a setup with an even number of sites, you need an additional instance to reach consensus about decisions such as failover of resources across sites. In this case, add one or more arbitrators running at additional sites. Arbitrators are single machines that run a booth instance in a special mode. An arbitrator is especially important for a two-site scenario, otherwise there is no way for one site to distinguish between a network failure between it and the other site, and a failure of the other site. The most common multi-site scenario is probably a multi-site cluster with two sites and a single arbitrator on a third site. However, technically, there are no limitations with regards to the number of sites and the number of arbitrators involved. `Boothd` at each site connects to its peers running at the other sites and exchanges connectivity details. Once a ticket is granted to a site, the booth mechanism will manage the ticket automatically: If the site which holds the ticket is out of service, the booth daemons will vote which of the other sites will get the ticket. To protect against brief connection failures, sites that lose the vote (either explicitly or implicitly by being disconnected from the voting body) need to relinquish the ticket after a time-out. Thus, it is made sure that a ticket will only be re-distributed after it has been relinquished by the previous site. The resources that depend on that ticket will fail over to the new site holding the ticket. The nodes that have run the resources before will be treated according to the `loss-policy` you set within the `rsc_ticket` constraint. Before the booth can manage a certain ticket within the multi-site cluster, you initially need to grant it to a site manually via the `booth` command-line tool. After you have initially granted a ticket to a site, `boothd` will take over and manage the ticket automatically. [IMPORTANT] ==== The `booth` command-line tool can be used to grant, list, or revoke tickets and can be run on any machine where `boothd` is running. If you are managing tickets via Booth, use only `booth` for manual intervention, not `crm_ticket`. That ensures the same ticket will only be owned by one cluster site at a time. ==== ==== Booth Requirements ==== * All clusters that will be part of the multi-site cluster must be based on Pacemaker. * Booth must be installed on all cluster nodes and on all arbitrators that will be part of the multi-site cluster. * Nodes belonging to the same cluster site should be synchronized via NTP. However, time synchronization is not required between the individual cluster sites. === General Management of Tickets === Display the information of tickets: ------- # crm_ticket --info ------- Or you can monitor them with: ------- # crm_mon --tickets ------- Display the +rsc_ticket+ constraints that apply to a ticket: ------- # crm_ticket --ticket ticketA --constraints ------- When you want to do maintenance or manual switch-over of a ticket, revoking the ticket would trigger the loss policies. If +loss-policy="fence"+, the dependent resources could not be gracefully stopped/demoted, and other unrelated resources could even be affected. The proper way is making the ticket 'standby' first with: ------- # crm_ticket --ticket ticketA --standby ------- Then the dependent resources will be stopped or demoted gracefully without triggering the loss policies. If you have finished the maintenance and want to activate the ticket again, you can run: ------- # crm_ticket --ticket ticketA --activate ------- == For more information == * https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html[SUSE's Geo Clustering quick start] * https://github.com/ClusterLabs/booth[Booth] diff --git a/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt b/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt index 75bb4fa8db..511b87986e 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt @@ -1,86 +1,87 @@ +:compat-mode: legacy = Cluster Nodes = == Defining a Cluster Node == Each node in the cluster will have an entry in the nodes section containing its UUID, uname, and type. .Example Corosync cluster node entry ====== [source,XML] ====== In normal circumstances, the admin should let the cluster populate this information automatically from the communications and membership data. [[s-node-name]] == Where Pacemaker Gets the Node Name == Traditionally, Pacemaker required nodes to be referred to by the value returned by `uname -n`. This can be problematic for services that require the `uname -n` to be a specific value (e.g. for a licence file). This requirement has been relaxed for clusters using Corosync 2.0 or later. The name Pacemaker uses is: . The value stored in +corosync.conf+ under *ring0_addr* in the *nodelist*, if it does not contain an IP address; otherwise . The value stored in +corosync.conf+ under *name* in the *nodelist*; otherwise . The value of `uname -n` Pacemaker provides the `crm_node -n` command which displays the name used by a running cluster. If a Corosync *nodelist* is used, `crm_node --name-for-id` pass:[number] is also available to display the name used by the node with the corosync *nodeid* of pass:[number], for example: `crm_node --name-for-id 2`. [[s-node-attributes]] == Node Attributes == indexterm:[Node,attribute] 'Node attributes' are a special type of option (name-value pair) that applies to a node object. Beyond the basic definition of a node, the administrator can describe the node's attributes, such as how much RAM, disk, what OS or kernel version it has, perhaps even its physical location. This information can then be used by the cluster when deciding where to place resources. For more information on the use of node attributes, see <>. Node attributes can be specified ahead of time or populated later, when the cluster is running, using `crm_attribute`. Below is what the node's definition would look like if the admin ran the command: .Result of using crm_attribute to specify which kernel pcmk-1 is running ====== ------- # crm_attribute --type nodes --node pcmk-1 --name kernel --update $(uname -r) ------- [source,XML] ------- ------- ====== Rather than having to read the XML, a simpler way to determine the current value of an attribute is to use `crm_attribute` again: ---- # crm_attribute --type nodes --node pcmk-1 --name kernel --query scope=nodes name=kernel value=3.10.0-123.13.2.el7.x86_64 ---- By specifying `--type nodes` the admin tells the cluster that this attribute is persistent. There are also transient attributes which are kept in the status section which are "forgotten" whenever the node rejoins the cluster. The cluster uses this area to store a record of how many times a resource has failed on that node, but administrators can also read and write to this section by specifying `--type status`. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt index b9cc00912c..3ab2db6f2a 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt @@ -1,410 +1,411 @@ +:compat-mode: legacy = Cluster-Wide Configuration = == Configuration Layout == The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this: .An empty configuration ====== [source,XML] ------- ------- ====== The empty configuration above contains the major sections that make up a CIB: * +cib+: The entire CIB is enclosed with a +cib+ tag. Certain fundamental settings are defined as attributes of this tag. ** +configuration+: This section -- the primary focus of this document -- contains traditional configuration information such as what resources the cluster serves and the relationships among them. *** +crm_config+: cluster-wide configuration options *** +nodes+: the machines that host the cluster *** +resources+: the services run by the cluster *** +constraints+: indications of how resources should be placed ** +status+: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way. In this document, configuration settings will be described as 'properties' or 'options' based on how they are defined in the CIB: * Properties are XML attributes of an XML element. * Options are name-value pairs expressed as +nvpair+ child elements of an XML element. Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak. == CIB Properties == Certain settings are defined by CIB properties (that is, attributes of the +cib+ tag) rather than with the rest of the cluster configuration in the +configuration+ section. The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location. .CIB Properties [width="95%",cols="2m,<5",options="header",align="center"] |========================================================= |Field |Description | admin_epoch | indexterm:[Configuration Version,Cluster] indexterm:[Cluster,Option,Configuration Version] indexterm:[admin_epoch,Cluster Option] indexterm:[Cluster,Option,admin_epoch] When a node joins the cluster, the cluster performs a check to see which node has the best configuration. It asks the node with the highest (+admin_epoch+, +epoch+, +num_updates+) tuple to replace the configuration on all the nodes -- which makes setting them, and setting them correctly, very important. +admin_epoch+ is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete. _Never set this value to zero_. In such cases, the cluster cannot tell the difference between your configuration and the "empty" one used when nothing is found on disk. | epoch | indexterm:[epoch,Cluster Option] indexterm:[Cluster,Option,epoch] The cluster increments this every time the configuration is updated (usually by the administrator). | num_updates | indexterm:[num_updates,Cluster Option] indexterm:[Cluster,Option,num_updates] The cluster increments this every time the configuration or status is updated (usually by the cluster) and resets it to 0 when epoch changes. | validate-with | indexterm:[validate-with,Cluster Option] indexterm:[Cluster,Option,validate-with] Determines the type of XML validation that will be done on the configuration. If set to +none+, the cluster will not verify that updates conform to the DTD (nor reject ones that don't). This option can be useful when operating a mixed-version cluster during an upgrade. |cib-last-written | indexterm:[cib-last-written,Cluster Property] indexterm:[Cluster,Property,cib-last-written] Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. |have-quorum | indexterm:[have-quorum,Cluster Property] indexterm:[Cluster,Property,have-quorum] Indicates if the cluster has quorum. If false, this may mean that the cluster cannot start resources or fence other nodes (see +no-quorum-policy+ below). Maintained by the cluster. |dc-uuid | indexterm:[dc-uuid,Cluster Property] indexterm:[Cluster,Property,dc-uuid] Indicates which cluster node is the current leader. Used by the cluster when placing resources and determining the order of some events. Maintained by the cluster. |========================================================= [[s-cluster-options]] == Cluster Options == Cluster options, as you might expect, control how the cluster behaves when confronted with certain situations. They are grouped into sets within the +crm_config+ section, and, in advanced configurations, there may be more than one set. (This will be described later in the section on <> where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once. You can obtain an up-to-date list of cluster options, including their default values, by running the `man pacemaker-schedulerd` and `man pacemaker-controld` commands. .Cluster Options [width="95%",cols="5m,2,<11",options="header",align="center"] |========================================================= |Option |Default |Description | dc-version | | indexterm:[dc-version,Cluster Property] indexterm:[Cluster,Property,dc-version] Version of Pacemaker on the cluster's DC. Determined automatically by the cluster. Often includes the hash which identifies the exact Git changeset it was built from. Used for diagnostic purposes. | cluster-infrastructure | | indexterm:[cluster-infrastructure,Cluster Property] indexterm:[Cluster,Property,cluster-infrastructure] The messaging stack on which Pacemaker is currently running. Determined automatically by the cluster. Used for informational and diagnostic purposes. | no-quorum-policy | stop a| indexterm:[no-quorum-policy,Cluster Option] indexterm:[Cluster,Option,no-quorum-policy] What to do when the cluster does not have quorum. Allowed values: * +ignore:+ continue all resource management * +freeze:+ continue resource management, but don't recover resources from nodes not in the affected partition * +stop:+ stop all resources in the affected cluster partition * +suicide:+ fence all nodes in the affected cluster partition | batch-limit | 0 | indexterm:[batch-limit,Cluster Option] indexterm:[Cluster,Option,batch-limit] The maximum number of actions that the cluster may execute in parallel across all nodes. The "correct" value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. | migration-limit | -1 | indexterm:[migration-limit,Cluster Option] indexterm:[Cluster,Option,migration-limit] The number of migration jobs that the TE is allowed to execute in parallel on a node. A value of -1 means unlimited. | symmetric-cluster | TRUE | indexterm:[symmetric-cluster,Cluster Option] indexterm:[Cluster,Option,symmetric-cluster] Can all resources run on any node by default? | stop-all-resources | FALSE | indexterm:[stop-all-resources,Cluster Option] indexterm:[Cluster,Option,stop-all-resources] Should the cluster stop all resources? | stop-orphan-resources | TRUE | indexterm:[stop-orphan-resources,Cluster Option] indexterm:[Cluster,Option,stop-orphan-resources] Should deleted resources be stopped? This value takes precedence over +is-managed+ (i.e. even unmanaged resources will be stopped if deleted from the configuration when this value is TRUE). | stop-orphan-actions | TRUE | indexterm:[stop-orphan-actions,Cluster Option] indexterm:[Cluster,Option,stop-orphan-actions] Should deleted actions be cancelled? | start-failure-is-fatal | TRUE | indexterm:[start-failure-is-fatal,Cluster Option] indexterm:[Cluster,Option,start-failure-is-fatal] Should a failure to start a resource on a particular node prevent further start attempts on that node? If FALSE, the cluster will decide whether the same node is still eligible based on the resource's current failure count and +migration-threshold+ (see <>). | enable-startup-probes | TRUE | indexterm:[enable-startup-probes,Cluster Option] indexterm:[Cluster,Option,enable-startup-probes] Should the cluster check for active resources during startup? | maintenance-mode | FALSE | indexterm:[maintenance-mode,Cluster Option] indexterm:[Cluster,Option,maintenance-mode] Should the cluster refrain from monitoring, starting and stopping resources? | stonith-enabled | TRUE | indexterm:[stonith-enabled,Cluster Option] indexterm:[Cluster,Option,stonith-enabled] Should failed nodes and nodes with resources that can't be stopped be shot? If you value your data, set up a STONITH device and enable this. If true, or unset, the cluster will refuse to start resources unless one or more STONITH resources have been configured. If false, unresponsive nodes are immediately assumed to be running no resources, and resource takeover to online nodes starts without any further protection (which means _data loss_ if the unresponsive node still accesses shared storage, for example). See also the +requires+ meta-attribute in <>. | stonith-action | reboot | indexterm:[stonith-action,Cluster Option] indexterm:[Cluster,Option,stonith-action] Action to send to STONITH device. Allowed values are +reboot+ and +off+. The value +poweroff+ is also allowed, but is only used for legacy devices. | stonith-timeout | 60s | indexterm:[stonith-timeout,Cluster Option] indexterm:[Cluster,Option,stonith-timeout] How long to wait for STONITH actions (reboot, on, off) to complete | stonith-max-attempts | 10 | indexterm:[stonith-max-attempts,Cluster Option] indexterm:[Cluster,Option,stonith-max-attempts] How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. | stonith-watchdog-timeout | 0 | indexterm:[stonith-watchdog-timeout,Cluster Option] indexterm:[Cluster,Option,stonith-watchdog-timeout] If nonzero, rely on hardware watchdog self-fencing. If positive, assume unseen nodes self-fence within this much time. If negative, and the SBD_WATCHDOG_TIMEOUT environment variable is set, use twice that value. | concurrent-fencing | FALSE | indexterm:[concurrent-fencing,Cluster Option] indexterm:[Cluster,Option,concurrent-fencing] Is the cluster allowed to initiate multiple fence actions concurrently? | cluster-delay | 60s | indexterm:[cluster-delay,Cluster Option] indexterm:[Cluster,Option,cluster-delay] Estimated maximum round-trip delay over the network (excluding action execution). If the TE requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node in this time (after considering the action's own timeout). The "correct" value will depend on the speed and load of your network and cluster nodes. | dc-deadtime | 20s | indexterm:[dc-deadtime,Cluster Option] indexterm:[Cluster,Option,dc-deadtime] How long to wait for a response from other nodes during startup. The "correct" value will depend on the speed/load of your network and the type of switches used. | cluster-recheck-interval | 15min | indexterm:[cluster-recheck-interval,Cluster Option] indexterm:[Cluster,Option,cluster-recheck-interval] Polling interval for time-based changes to options, resource parameters and constraints. The Cluster is primarily event-driven, but your configuration can have elements that take effect based on the time of day. To ensure these changes take effect, we can optionally poll the cluster's status for changes. A value of 0 disables polling. Positive values are an interval (in seconds unless other SI units are specified, e.g. 5min). | cluster-ipc-limit | 500 | indexterm:[cluster-ipc-limit,Cluster Option] indexterm:[Cluster,Option,cluster-ipc-limit] The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see "Evicting client" messages for cluster daemon PIDs in the logs. | pe-error-series-max | -1 | indexterm:[pe-error-series-max,Cluster Option] indexterm:[Cluster,Option,pe-error-series-max] The number of PE inputs resulting in ERRORs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-warn-series-max | -1 | indexterm:[pe-warn-series-max,Cluster Option] indexterm:[Cluster,Option,pe-warn-series-max] The number of PE inputs resulting in WARNINGs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-input-series-max | -1 | indexterm:[pe-input-series-max,Cluster Option] indexterm:[Cluster,Option,pe-input-series-max] The number of "normal" PE inputs to save. Used when reporting problems. A value of -1 means unlimited (report all). | placement-strategy | default | indexterm:[placement-strategy,Cluster Option] indexterm:[Cluster,Option,placement-strategy] How the cluster should allocate resources to nodes (see <>). Allowed values are +default+, +utilization+, +balanced+, and +minimal+. | node-health-strategy | none | indexterm:[node-health-strategy,Cluster Option] indexterm:[Cluster,Option,node-health-strategy] How the cluster should react to node health attributes (see <>). Allowed values are +none+, +migrate-on-red+, +only-green+, +progressive+, and +custom+. | node-health-base | 0 | indexterm:[node-health-base,Cluster Option] indexterm:[Cluster,Option,node-health-base] The base health score assigned to a node. Only used when +node-health-strategy+ is +progressive+. | node-health-green | 0 | indexterm:[node-health-green,Cluster Option] indexterm:[Cluster,Option,node-health-green] The score to use for a node health attribute whose value is +green+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-yellow | 0 | indexterm:[node-health-yellow,Cluster Option] indexterm:[Cluster,Option,node-health-yellow] The score to use for a node health attribute whose value is +yellow+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-red | 0 | indexterm:[node-health-red,Cluster Option] indexterm:[Cluster,Option,node-health-red] The score to use for a node health attribute whose value is +red+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | remove-after-stop | FALSE | indexterm:[remove-after-stop,Cluster Option] indexterm:[Cluster,Option,remove-after-stop] _Advanced Use Only:_ Should the cluster remove resources from the LRM after they are stopped? Values other than the default are, at best, poorly tested and potentially dangerous. | startup-fencing | TRUE | indexterm:[startup-fencing,Cluster Option] indexterm:[Cluster,Option,startup-fencing] _Advanced Use Only:_ Should the cluster shoot unseen nodes? Not using the default is very unsafe! | election-timeout | 2min | indexterm:[election-timeout,Cluster Option] indexterm:[Cluster,Option,election-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | shutdown-escalation | 20min | indexterm:[shutdown-escalation,Cluster Option] indexterm:[Cluster,Option,shutdown-escalation] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | join-integration-timeout | 3min | indexterm:[join-integration-timeout,Cluster Option] indexterm:[Cluster,Option,join-integration-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | join-finalization-timeout | 30min | indexterm:[join-finalization-timeout,Cluster Option] indexterm:[Cluster,Option,join-finalization-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | transition-delay | 0s | indexterm:[transition-delay,Cluster Option] indexterm:[Cluster,Option,transition-delay] _Advanced Use Only:_ Delay cluster recovery for the configured interval to allow for additional/related events to occur. Useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions. |========================================================= diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt index 3f23151793..61710b64b9 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt @@ -1,886 +1,887 @@ +:compat-mode: legacy = Cluster Resources = [[s-resource-primitive]] == What is a Cluster Resource? == indexterm:[Resource] A resource is a service made highly available by a cluster. The simplest type of resource, a 'primitive' resource, is described in this chapter. More complex forms, such as groups and clones, are described in later chapters. Every primitive resource has a 'resource agent'. A resource agent is an external program that abstracts the service it provides and present a consistent view to the cluster. This allows the cluster to be agnostic about the resources it manages. The cluster doesn't need to understand how the resource works because it relies on the resource agent to do the right thing when given a `start`, `stop` or `monitor` command. For this reason, it is crucial that resource agents are well-tested. Typically, resource agents come in the form of shell scripts. However, they can be written using any technology (such as C, Python or Perl) that the author is comfortable with. [[s-resource-supported]] == Resource Classes == indexterm:[Resource,class] Pacemaker supports several classes of agents: * OCF * LSB * Upstart * Systemd * Service * Fencing * Nagios Plugins === Open Cluster Framework === indexterm:[Resource,OCF] indexterm:[OCF,Resources] indexterm:[Open Cluster Framework,Resources] The OCF standard footnote:[See http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD -- at least as it relates to resource agents. The Pacemaker implementation has been somewhat extended from the OCF specs, but none of those changes are incompatible with the original OCF specification.] is basically an extension of the Linux Standard Base conventions for init scripts to: * support parameters, * make them self-describing, and * make them extensible OCF specs have strict definitions of the exit codes that actions must return. footnote:[ The resource-agents source code includes the `ocf-tester` script, which can be useful in this regard. ] The cluster follows these specifications exactly, and giving the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state. Parameters are passed to the resource agent as environment variables, with the special prefix +OCF_RESKEY_+. So, a parameter which the user thinks of as +ip+ will be passed to the resource agent as +OCF_RESKEY_ip+. The number and purpose of the parameters is left to the resource agent; however, the resource agent should use the `meta-data` command to advertise any that it supports. The OCF class is the most preferred as it is an industry standard, highly flexible (allowing parameters to be passed to agents in a non-positional manner) and self-describing. For more information, see the http://www.linux-ha.org/wiki/OCF_Resource_Agents[reference] and the 'Resource Agents' chapter of 'Pacemaker Administration'. === Linux Standard Base === indexterm:[Resource,LSB] indexterm:[LSB,Resources] indexterm:[Linux Standard Base,Resources] 'LSB' resource agents are rather known as 'init scripts' (service startup scripts), located in +/etc/init.d+. Commonly, they are provided by the OS distribution and, in order to be used with the cluster, they must conform to the LSB Spec. footnote:[ See http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html for the LSB Spec as it relates to init scripts. ] [WARNING] ==== Many distributions or particular software packages claim LSB compliance but ship with broken init scripts. For details on how to check whether your init script is LSB-compatible, see the 'Resource Agents' chapter of 'Pacemaker Administration'. Common problematic violations of the LSB standard include: * Not implementing the +status+ operation at all * Not observing the correct exit status codes for +start+/+stop+/+status+ actions * Starting a started resource returns an error * Stopping a stopped resource returns an error Since the LSB standard is pragmatic enough so as _not_ to elaborate on clean and reliable (busy-waiting-free) service dependency chains beyond symbolic system facilities names to order against (one of the strongest guarantees set forth is with _syslog_ in particular, denoting that, when satisfied, it's actually _operational_ -- something not demanded universally with the standard) and because explicit dependency-based ordering is crucial for stacked HA applications, additionally this imminent setback, possibly rooted deeper in the lack of synchronization after initial forking in daemons themselves (something that currently spoils also Pacemaker's own user-facing ones) and hence nothing init scripts alone could be blamed for, stands out: * Insufficient causality discreetness on either service start-up (for the dependency chains, it's rather essential the service is also _operational_, with the minimal viable interpretation being that subsequent +status+ returns success but preferably in the strict sense, once the respective init script invocation finishes with success) or shutdown (ditto with no child processes left behind) footnote:[ There's an inherent difference between _started_ and _ready_ state of the service at hand, see discussion at https://jdebp.eu/FGA/unix-daemon-readiness-protocol-problems.html also showing how suitably prepared <> may possibly improve on this through a native arrangement scheme. ] ==== [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== [[s-resource-supported-systemd]] === Systemd === indexterm:[Resource,Systemd] indexterm:[Systemd,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons and scripts with an alternative called http://www.freedesktop.org/wiki/Software/systemd[Systemd]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, systemd has 'unit files'. Generally, the services (unit files) are provided by the OS distribution, but there are online guides for converting from init scripts. footnote:[For example, http://0pointer.de/blog/projects/systemd-for-admins-3.html] [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === Upstart === indexterm:[Resource,Upstart] indexterm:[Upstart,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons (and scripts) with an alternative called http://upstart.ubuntu.com/[Upstart]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, upstart has 'jobs'. Generally, the services (jobs) are provided by the OS distribution. [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === System Services === indexterm:[Resource,System Services] indexterm:[System Service,Resources] Since there are various types of system services (+systemd+, +upstart+, and +lsb+), Pacemaker supports a special +service+ alias which intelligently figures out which one applies to a given cluster node. This is particularly useful when the cluster contains a mix of +systemd+, +upstart+, and +lsb+. In order, Pacemaker will try to find the named service as: . an LSB init script . a Systemd unit file . an Upstart job === STONITH === indexterm:[Resource,STONITH] indexterm:[STONITH,Resources] The STONITH class is used exclusively for fencing-related resources. This is discussed later in <>. === Nagios Plugins === indexterm:[Resource,Nagios Plugins] indexterm:[Nagios Plugins,Resources] Nagios Plugins footnote:[The project has two independent forks, hosted at https://www.nagios-plugins.org/ and https://www.monitoring-plugins.org/. Output from both projects' plugins is similar, so plugins from either project can be used with pacemaker.] allow us to monitor services on remote hosts. Pacemaker is able to do remote monitoring with the plugins _if they are present_. A common use case is to configure them as resources belonging to a resource container (usually a virtual machine), and the container will be restarted if any of them has failed. Another use is to configure them as ordinary resources to be used for monitoring hosts or services via the network. The supported parameters are same as the long options of the plugin. [[primitive-resource]] == Resource Properties == These values tell the cluster which resource agent to use for the resource, where to find that resource agent and what standards it conforms to. .Properties of a Primitive Resource [width="95%",cols="1m,<6",options="header",align="center"] |========================================================= |Field |Description |id |Your name for the resource indexterm:[id,Resource] indexterm:[Resource,Property,id] |class |The standard the resource agent conforms to. Allowed values: +lsb+, +nagios+, +ocf+, +service+, +stonith+, +systemd+, +upstart+ indexterm:[class,Resource] indexterm:[Resource,Property,class] |type |The name of the Resource Agent you wish to use. E.g. +IPaddr+ or +Filesystem+ indexterm:[type,Resource] indexterm:[Resource,Property,type] |provider |The OCF spec allows multiple vendors to supply the same resource agent. To use the OCF resource agents supplied by the Heartbeat project, you would specify +heartbeat+ here. indexterm:[provider,Resource] indexterm:[Resource,Property,provider] |========================================================= The XML definition of a resource can be queried with the `crm_resource` tool. For example: ---- # crm_resource --resource Email --query-xml ---- might produce: .A system resource definition ===== [source,XML] ===== [NOTE] ===== One of the main drawbacks to system services (LSB, systemd or Upstart) resources is that they do not allow any parameters! ===== //// See https://tools.ietf.org/html/rfc5737 for choice of example IP address //// .An OCF resource definition ===== [source,XML] ------- ------- ===== [[s-resource-options]] == Resource Options == Resources have two types of options: 'meta-attributes' and 'instance attributes'. Meta-attributes apply to any type of resource, while instance attributes are specific to each resource agent. === Resource Meta-Attributes === Meta-attributes are used by the cluster to decide how a resource should behave and can be easily set using the `--meta` option of the `crm_resource` command. .Meta-attributes of a Primitive Resource [width="95%",cols="2m,2,<5",options="header",align="center"] |========================================================= |Field |Default |Description |priority |0 |If not all resources can be active, the cluster will stop lower priority resources in order to keep higher priority ones active. indexterm:[priority,Resource Option] indexterm:[Resource,Option,priority] |target-role |Started a|What state should the cluster attempt to keep this resource in? Allowed values: * +Stopped:+ Force the resource to be stopped * +Started:+ Allow the resource to be started (and in the case of <>, promoted to master if appropriate) * +Slave:+ Allow the resource to be started, but only in Slave mode if the resource is <> * +Master:+ Equivalent to +Started+ indexterm:[target-role,Resource Option] indexterm:[Resource,Option,target-role] |is-managed |TRUE |Is the cluster allowed to start and stop the resource? Allowed values: +true+, +false+ indexterm:[is-managed,Resource Option] indexterm:[Resource,Option,is-managed] |resource-stickiness |value of +resource-stickiness+ in the +rsc_defaults+ section |How much does the resource prefer to stay where it is? indexterm:[resource-stickiness,Resource Option] indexterm:[Resource,Option,resource-stickiness] |requires |+quorum+ for resources with a +class+ of +stonith+, otherwise +unfencing+ if unfencing is active in the cluster, otherwise +fencing+ if +stonith-enabled+ is true, otherwise +quorum+ a|Conditions under which the resource can be started Allowed values: * +nothing:+ can always be started * +quorum:+ The cluster can only start this resource if a majority of the configured nodes are active * +fencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes have been <> * +unfencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes have been fenced _and_ only on nodes that have been <> indexterm:[requires,Resource Option] indexterm:[Resource,Option,requires] |migration-threshold |INFINITY |How many failures may occur for this resource on a node, before this node is marked ineligible to host this resource. A value of 0 indicates that this feature is disabled (the node will never be marked ineligible); by constrast, the cluster treats INFINITY (the default) as a very large but finite number. This option has an effect only if the failed operation specifies +on-fail+ as +restart+ (the default), and additionally for failed +start+ operations, if the cluster property +start-failure-is-fatal+ is +false+. indexterm:[migration-threshold,Resource Option] indexterm:[Resource,Option,migration-threshold] |failure-timeout |0 |How many seconds to wait before acting as if the failure had not occurred, and potentially allowing the resource back to the node on which it failed. A value of 0 indicates that this feature is disabled. As with any time-based actions, this is not guaranteed to be checked more frequently than the value of +cluster-recheck-interval+ (see <>). indexterm:[failure-timeout,Resource Option] indexterm:[Resource,Option,failure-timeout] |multiple-active |stop_start a|What should the cluster do if it ever finds the resource active on more than one node? Allowed values: * +block:+ mark the resource as unmanaged * +stop_only:+ stop all active instances and leave them that way * +stop_start:+ stop all active instances and start the resource in one location only indexterm:[multiple-active,Resource Option] indexterm:[Resource,Option,multiple-active] |allow-migrate |TRUE for ocf:pacemaker:remote resources, FALSE otherwise |Whether the cluster should try to "live migrate" this resource when it needs to be moved (see <>) |container-attribute-target | |Specific to bundle resources; see <> |remote-node | |The name of the Pacemaker Remote guest node this resource is associated with, if any. If specified, this both enables the resource as a guest node and defines the unique name used to identify the guest node. The guest must be configured to run the Pacemaker Remote daemon when it is started. +WARNING:+ This value cannot overlap with any resource or node IDs. |remote-port |3121 |If +remote-node+ is specified, the port on the guest used for its Pacemaker Remote connection. The Pacemaker Remote daemon on the guest must be configured to listen on this port. |remote-addr |value of +remote-node+ |If +remote-node+ is specified, the IP address or hostname used to connect to the guest via Pacemaker Remote. The Pacemaker Remote daemon on the guest must be configured to accept connections on this address. |remote-connect-timeout |60s |If +remote-node+ is specified, how long before a pending guest connection will time out. |========================================================= As an example of setting resource options, if you performed the following commands on an LSB Email resource: ------- # crm_resource --meta --resource Email --set-parameter priority --parameter-value 100 # crm_resource -m -r Email -p multiple-active -v block ------- the resulting resource definition might be: .An LSB resource with cluster options ===== [source,XML] ------- ------- ===== [[s-resource-defaults]] === Setting Global Defaults for Resource Meta-Attributes === To set a default value for a resource option, add it to the +rsc_defaults+ section with `crm_attribute`. For example, ---- # crm_attribute --type rsc_defaults --name is-managed --update false ---- would prevent the cluster from starting or stopping any of the resources in the configuration (unless of course the individual resources were specifically enabled by having their +is-managed+ set to +true+). === Resource Instance Attributes === The resource agents of some resource classes (lsb, systemd and upstart 'not' among them) can be given parameters which determine how they behave and which instance of a service they control. If your resource agent supports parameters, you can add them with the `crm_resource` command. For example, ---- # crm_resource --resource Public-IP --set-parameter ip --parameter-value 192.0.2.2 ---- would create an entry in the resource like this: .An example OCF resource with instance attributes ===== [source,XML] ------- ------- ===== For an OCF resource, the result would be an environment variable called +OCF_RESKEY_ip+ with a value of +192.0.2.2+. The list of instance attributes supported by an OCF resource agent can be found by calling the resource agent with the `meta-data` command. The output contains an XML description of all the supported attributes, their purpose and default values. .Displaying the metadata for the Dummy resource agent template ===== ---- # export OCF_ROOT=/usr/lib/ocf # $OCF_ROOT/resource.d/pacemaker/Dummy meta-data ---- [source,XML] ------- 1.0 This is a Dummy Resource Agent. It does absolutely nothing except keep track of whether its running or not. Its purpose in life is for testing and to serve as a template for RA writers. NB: Please pay attention to the timeouts specified in the actions section below. They should be meaningful for the kind of resource the agent manages. They should be the minimum advised timeouts, but they shouldn't/cannot cover _all_ possible resource instances. So, try to be neither overly generous nor too stingy, but moderate. The minimum timeouts should never be below 10 seconds. Example stateless resource agent Location to store the resource state in. State file Fake attribute that can be changed to cause a reload Fake attribute that can be changed to cause a reload Number of seconds to sleep during operations. This can be used to test how the cluster reacts to operation timeouts. Operation sleep duration in seconds. ------- ===== == Resource Operations == indexterm:[Resource,Action] 'Operations' are actions the cluster can perform on a resource by calling the resource agent. Resource agents must support certain common operations such as start, stop and monitor, and may implement any others. Some operations are generated by the cluster itself, for example, stopping and starting resources as needed. You can configure operations in the cluster configuration. As an example, by default the cluster will 'not' ensure your resources stay healthy once they are started. footnote:[Currently, anyway. Automatic monitoring operations may be added in a future version of Pacemaker.] To instruct the cluster to do this, you need to add a +monitor+ operation to the resource's definition. .An OCF resource with a recurring health check ===== [source,XML] ------- ------- ===== .Properties of an Operation [width="95%",cols="2m,3,<6",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the operation. indexterm:[id,Action Property] indexterm:[Action,Property,id] |name | |The action to perform. This can be any action supported by the agent; common values include +monitor+, +start+, and +stop+. indexterm:[name,Action Property] indexterm:[Action,Property,name] |interval |0 |How frequently (in seconds) to perform the operation. A value of 0 means never. A positive value defines a 'recurring action', which is typically used with <>. indexterm:[interval,Action Property] indexterm:[Action,Property,interval] |timeout | |How long to wait before declaring the action has failed indexterm:[timeout,Action Property] indexterm:[Action,Property,timeout] |on-fail |restart '(except for +stop+ operations, which default to' fence 'when STONITH is enabled and' block 'otherwise)' a|The action to take if this action ever fails. Allowed values: * +ignore:+ Pretend the resource did not fail. * +block:+ Don't perform any further operations on the resource. * +stop:+ Stop the resource and do not start it elsewhere. * +restart:+ Stop the resource and start it again (possibly on a different node). * +fence:+ STONITH the node on which the resource failed. * +standby:+ Move _all_ resources away from the node on which the resource failed. indexterm:[on-fail,Action Property] indexterm:[Action,Property,on-fail] |enabled |TRUE |If +false+, ignore this operation definition. This is typically used to pause a particular recurring +monitor+ operation; for instance, it can complement the respective resource being unmanaged (+is-managed=false+), as this alone will <>. Disabling the operation does not suppress all actions of the given type. Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |record-pending |FALSE |If +true+, the intention to perform the operation is recorded so that GUIs and CLI tools can indicate that an operation is in progress. This is best set as an _operation default_ (see next section). Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |role | |Run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring +monitor+ operations. Allowed (case-sensitive) values: +Stopped+, +Started+, and in the case of <>, +Slave+ and +Master+. indexterm:[role,Action Property] indexterm:[Action,Property,role] |========================================================= [[s-resource-monitoring]] === Monitoring Resources for Failure === When Pacemaker first starts a resource, it runs one-time +monitor+ operations (referred to as 'probes') to ensure the resource is running where it's supposed to be, and not running where it's not supposed to be. (This behavior can be affected by the +resource-discovery+ location constraint property.) Other than those initial probes, Pacemaker will not (by default) check that the resource continues to stay healthy. As in the example above, you must configure +monitor+ operations explicitly to perform these checks. By default, a +monitor+ operation will ensure that the resource is running where it is supposed to. The +target-role+ property can be used for further checking. For example, if a resource has one +monitor+ operation with +interval=10 role=Started+ and a second +monitor+ operation with +interval=11 role=Stopped+, the cluster will run the first monitor on any nodes it thinks 'should' be running the resource, and the second monitor on any nodes that it thinks 'should not' be running the resource (for the truly paranoid, who want to know when an administrator manually starts a service by mistake). [[s-monitoring-unmanaged]] === Monitoring Resources When Administration is Disabled === Recurring +monitor+ operations behave differently under various administrative settings: * When a resource is unmanaged (by setting +is-managed=false+): No monitors will be stopped. + If the unmanaged resource is stopped on a node where the cluster thinks it should be running, the cluster will detect and report that it is not, but it will not consider the monitor failed, and will not try to start the resource until it is managed again. + Starting the unmanaged resource on a different node is strongly discouraged and will at least cause the cluster to consider the resource failed, and may require the resource's +target-role+ to be set to +Stopped+ then +Started+ to be recovered. * When a node is put into standby: All resources will be moved away from the node, and all +monitor+ operations will be stopped on the node, except those specifying +role+ as +Stopped+. Such rather atypical monitoring will consequently be started on the node if appropriate. * When the cluster is put into maintenance mode: All resources will be marked as unmanaged. All monitor operations will be stopped, except those with specifying +role+ as +Stopped+. As with single unmanaged resources, starting a resource on a node other than where the cluster expects it to be will cause problems. [[s-operation-defaults]] === Setting Global Defaults for Operations === You can change the global default values for operation properties in a given cluster. These are defined in an +op_defaults+ section of the CIB's +configuration+ section, and can be set with `crm_attribute`. For example, ---- # crm_attribute --type op_defaults --name timeout --update 20s ---- would default each operation's +timeout+ to 20 seconds. If an operation's definition also includes a value for +timeout+, then that value would be used for that operation instead. === When Implicit Operations Take a Long Time === The cluster will always perform a number of implicit operations: +start+, +stop+ and a non-recurring +monitor+ operation used at startup to check whether the resource is already active. If one of these is taking too long, then you can create an entry for them and specify a longer timeout. .An OCF resource with custom timeouts for its implicit actions ===== [source,XML] ------- ------- ===== === Multiple Monitor Operations === Provided no two operations (for a single resource) have the same name and interval, you can have as many +monitor+ operations as you like. In this way, you can do a superficial health check every minute and progressively more intense ones at higher intervals. To tell the resource agent what kind of check to perform, you need to provide each monitor with a different value for a common parameter. The OCF standard creates a special parameter called +OCF_CHECK_LEVEL+ for this purpose and dictates that it is "made available to the resource agent without the normal +OCF_RESKEY+ prefix". Whatever name you choose, you can specify it by adding an +instance_attributes+ block to the +op+ tag. It is up to each resource agent to look for the parameter and decide how to use it. .An OCF resource with two recurring health checks, performing different levels of checks specified via +OCF_CHECK_LEVEL+. ===== [source,XML] ------- ------- ===== === Disabling a Monitor Operation === The easiest way to stop a recurring monitor is to just delete it. However, there can be times when you only want to disable it temporarily. In such cases, simply add +enabled=false+ to the operation's definition. .Example of an OCF resource with a disabled health check ===== [source,XML] ------- ------- ===== This can be achieved from the command line by executing: ---- # cibadmin --modify --xml-text '' ---- Once you've done whatever you needed to do, you can then re-enable it with ---- # cibadmin --modify --xml-text '' ---- diff --git a/doc/Pacemaker_Explained/en-US/Ch-Reusing-Configuration.txt b/doc/Pacemaker_Explained/en-US/Ch-Reusing-Configuration.txt index 29905a771c..c0d1883b04 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Reusing-Configuration.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Reusing-Configuration.txt @@ -1,372 +1,373 @@ +:compat-mode: legacy = Reusing Parts of the Configuration = Pacemaker provides multiple ways to simplify the configuration XML by reusing parts of it in multiple places. Besides simplifying the XML, this also allows you to manipulate multiple configuration elements with a single reference. == Reusing Resource Definitions == If you want to create lots of resources with similar configurations, defining a 'resource template' simplifies the task. Once defined, it can be referenced in primitives or in certain types of constraints. === Configuring Resources with Templates === The primitives referencing the template will inherit all meta-attributes, instance attributes, utilization attributes and operations defined in the template. And you can define specific attributes and operations for any of the primitives. If any of these are defined in both the template and the primitive, the values defined in the primitive will take precedence over the ones defined in the template. Hence, resource templates help to reduce the amount of configuration work. If any changes are needed, they can be done to the template definition and will take effect globally in all resource definitions referencing that template. Resource templates have a syntax similar to that of primitives. .Resource template for a migratable Xen virtual machine ==== [source,XML] ---- ---- ==== Once you define a resource template, you can use it in primitives by specifying the +template+ property. .Xen primitive resource using a resource template ==== [source,XML] ---- ---- ==== In the example above, the new primitive +vm1+ will inherit everything from +vm-template+. For example, the equivalent of the above two examples would be: .Equivalent Xen primitive resource not using a resource template ==== [source,XML] ---- ---- ==== If you want to overwrite some attributes or operations, add them to the particular primitive's definition. .Xen resource overriding template values ==== [source,XML] ---- ---- ==== In the example above, the new primitive +vm2+ has special attribute values. Its +monitor+ operation has a longer +timeout+ and +interval+, and the primitive has an additional +stop+ operation. To see the resulting definition of a resource, run: ---- # crm_resource --query-xml --resource vm2 ---- To see the raw definition of a resource in the CIB, run: ---- # crm_resource --query-xml-raw --resource vm2 ---- === Using Templates in Constraints === A resource template can be referenced in the following types of constraints: - +order+ constraints (see <>) - +colocation+ constraints (see <>) - +rsc_ticket+ constraints (for multi-site clusters as described in <>) Resource templates referenced in constraints stand for all primitives which are derived from that template. This means, the constraint applies to all primitive resources referencing the resource template. Referencing resource templates in constraints is an alternative to resource sets and can simplify the cluster configuration considerably. For example, given the example templates earlier in this chapter: [source,XML] would colocate all VMs with +base-rsc+ and is the equivalent of the following constraint configuration: [source,XML] ---- ---- [NOTE] ====== In a colocation constraint, only one template may be referenced from either `rsc` or `with-rsc`; the other reference must be a regular resource. ====== === Using Templates in Resource Sets === Resource templates can also be referenced in resource sets. For example, given the example templates earlier in this section, then: [source,XML] ---- ---- is the equivalent of the following constraint using a sequential resource set: [source,XML] ---- ---- Or, if the resources referencing the template can run in parallel, then: [source,XML] ---- ---- is the equivalent of the following constraint configuration: [source,XML] ---- ---- [[s-reusing-config-elements]] == Reusing Rules, Options and Sets of Operations == Sometimes a number of constraints need to use the same set of rules, and resources need to set the same options and parameters. To simplify this situation, you can refer to an existing object using an +id-ref+ instead of an +id+. So if for one resource you have [source,XML] ------ ------ Then instead of duplicating the rule for all your other resources, you can instead specify: .Referencing rules from other constraints ===== [source,XML] ------- ------- ===== [IMPORTANT] =========== The cluster will insist that the +rule+ exists somewhere. Attempting to add a reference to a non-existing rule will cause a validation failure, as will attempting to remove a +rule+ that is referenced elsewhere. =========== The same principle applies for +meta_attributes+ and +instance_attributes+ as illustrated in the example below: .Referencing attributes, options, and operations from other resources ===== [source,XML] ------- ------- ===== == Tagging Configuration Elements == Pacemaker allows you to 'tag' any configuration element that has an XML ID. The main purpose of tagging is to support higher-level user interface tools; Pacemaker itself only uses tags within constraints. Therefore, what you can do with tags mostly depends on the tools you use. === Configuring Tags === A tag is simply a named list of XML IDs. .Tag referencing three resources ==== [source,XML] ---- ---- ==== What you can do with this new tag depends on what your higher-level tools support. For example, a tool might allow you to enable or disable all of the tagged resources at once, or show the status of just the tagged resources. A single configuration element can be listed in any number of tags. === Using Tags in Constraints and Resource Sets === Pacemaker itself only uses tags in constraints. If you supply a tag name instead of a resource name in any constraint, the constraint will apply to all resources listed in that tag. .Constraint using a tag ==== [source,XML] ---- ---- ==== In the example above, assuming the +all-vms+ tag is defined as in the previous example, the constraint will behave the same as: .Equivalent constraints without tags ==== [source,XML] ---- ---- ==== A tag may be used directly in the constraint, or indirectly by being listed in a <> used in the constraint. When used in a resource set, an expanded tag will honor the set's +sequential+ property. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Rules.txt b/doc/Pacemaker_Explained/en-US/Ch-Rules.txt index af05d7b8f4..3ac2920ea8 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Rules.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Rules.txt @@ -1,642 +1,643 @@ +:compat-mode: legacy = Rules = //// We prefer [[ch-rules]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-rules[Chapter 8, Rules] indexterm:[Resource,Constraint,Rule] Rules can be used to make your configuration more dynamic. One common example is to set one value for +resource-stickiness+ during working hours, to prevent resources from being moved back to their most preferred location, and another on weekends when no-one is around to notice an outage. Another use of rules might be to assign machines to different processing groups (using a node attribute) based on time and to then use that attribute when creating location constraints. Each rule can contain a number of expressions, date-expressions and even other rules. The results of the expressions are combined based on the rule's +boolean-op+ field to determine if the rule ultimately evaluates to +true+ or +false+. What happens next depends on the context in which the rule is being used. == Rule Properties == .Properties of a Rule [width="95%",cols="2m,1,<5",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the rule (required) indexterm:[id,Constraint Rule] indexterm:[Constraint,Rule,id] |role |+Started+ |Limits the rule to apply only when the resource is in the specified role. Allowed values are +Started+, +Slave+, and +Master+. A rule with +role="Master"+ cannot determine the initial location of a clone instance and will only affect which of the active instances will be promoted. indexterm:[role,Constraint Rule] indexterm:[Constraint,Rule,role] |score | |The score to apply if the rule evaluates to +true+. Limited to use in rules that are part of location constraints. indexterm:[score,Constraint Rule] indexterm:[Constraint,Rule,score] |score-attribute | |The node attribute to look up and use as a score if the rule evaluates to +true+. Limited to use in rules that are part of location constraints. indexterm:[score-attribute,Constraint Rule] indexterm:[Constraint,Rule,score-attribute] |boolean-op |+and+ |How to combine the result of multiple expression objects. Allowed values are +and+ and +or+. indexterm:[boolean-op,Constraint Rule] indexterm:[Constraint,Rule,boolean-op] |========================================================= == Node Attribute Expressions == indexterm:[Resource,Constraint,Attribute Expression] Expression objects are used to control a resource based on the attributes defined by a node or nodes. .Properties of an Expression [width="95%",cols="2m,1,<5",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the expression (required) indexterm:[id,Constraint Expression] indexterm:[Constraint,Attribute Expression,id] |attribute | |The node attribute to test (required) indexterm:[attribute,Constraint Expression] indexterm:[Constraint,Attribute Expression,attribute] |type |+string+ |Determines how the value(s) should be tested. Allowed values are +string+, +integer+, and +version+. indexterm:[type,Constraint Expression] indexterm:[Constraint,Attribute Expression,type] |operation | a|The comparison to perform (required). Allowed values: * +lt:+ True if the value of the node's +attribute+ is less than +value+ * +gt:+ True if the value of the node's +attribute+ is greater than +value+ * +lte:+ True if the value of the node's +attribute+ is less than or equal to +value+ * +gte:+ True if the value of the node's +attribute+ is greater than or equal to +value+ * +eq:+ True if the value of the node's +attribute+ is equal to +value+ * +ne:+ True if the value of the node's +attribute+ is not equal to +value+ * +defined:+ True if the node has the named attribute * +not_defined:+ True if the node does not have the named attribute indexterm:[operation,Constraint Expression] indexterm:[Constraint,Attribute Expression,operation] |value | |User-supplied value for comparison (required) indexterm:[value,Constraint Expression] indexterm:[Constraint,Attribute Expression,value] |value-source |+literal+ a|How the +value+ is derived. Allowed values: * +literal+: +value+ is a literal string to compare against * +param+: +value+ is the name of a resource parameter to compare against (only valid in location constraints) * +meta+: +value+ is the name of a resource meta-attribute to compare against (only valid in location constraints) indexterm:[value,Constraint Expression] indexterm:[Constraint,Attribute Expression,value] |========================================================= In addition to any attributes added by the administrator, the cluster defines special, built-in node attributes for each node that can also be used. .Built-in node attributes [width="95%",cols="1m,<5",options="header",align="center"] |========================================================= |Name |Value |#uname |Node <> |#id |Node ID |#kind |Node type. Possible values are +cluster+, +remote+, and +container+. Kind is +remote+ for Pacemaker Remote nodes created with the +ocf:pacemaker:remote+ resource, and +container+ for Pacemaker Remote guest nodes and bundle nodes |#is_dc |"true" if this node is a Designated Controller (DC), "false" otherwise |#cluster-name |The value of the +cluster-name+ cluster property, if set |#site-name |The value of the +site-name+ cluster property, if set, otherwise identical to +#cluster-name+ |#role a|The role the relevant promotable clone resource has on this node. Valid only within a rule for a location constraint for a promotable clone resource. //// // if uncommenting, put a pipe in front of first two lines #ra-version The installed version of the resource agent on the node, as defined by the +version+ attribute of the +resource-agent+ tag in the agent's metadata. Valid only within rules controlling resource options. This can be useful during rolling upgrades of a backward-incompatible resource agent. '(coming in x.x.x)' //// |========================================================= == Time- and Date-Based Expressions == indexterm:[Time Based Expressions] indexterm:[Resource,Constraint,Date/Time Expression] As the name suggests, +date_expressions+ are used to control a resource or cluster option based on the current date/time. They may contain an optional +date_spec+ and/or +duration+ object depending on the context. .Properties of a Date Expression [width="95%",cols="2m,<5",options="header",align="center"] |========================================================= |Field |Description |start |A date/time conforming to the http://en.wikipedia.org/wiki/ISO_8601[ISO8601] specification. indexterm:[start,Constraint Expression] indexterm:[Constraint,Date/Time Expression,start] |end |A date/time conforming to the http://en.wikipedia.org/wiki/ISO_8601[ISO8601] specification. Can be inferred by supplying a value for +start+ and a +duration+. indexterm:[end,Constraint Expression] indexterm:[Constraint,Date/Time Expression,end] |operation a|Compares the current date/time with the start and/or end date, depending on the context. Allowed values: * +gt:+ True if the current date/time is after +start+ * +lt:+ True if the current date/time is before +end+ * +in_range:+ True if the current date/time is after +start+ and before +end+ * +date_spec:+ True if the current date/time matches a +date_spec+ object (described below) indexterm:[operation,Constraint Expression] indexterm:[Constraint,Date/Time Expression,operation] |========================================================= [NOTE] ====== As these comparisons (except for +date_spec+) include the time, the +eq+, +neq+, +gte+ and +lte+ operators have not been implemented since they would only be valid for a single second. ====== === Date Specifications === indexterm:[Date Specification] indexterm:[Resource,Constraint,Date Specification] +date_spec+ objects are used to create cron-like expressions relating to time. Each field can contain a single number or a single range. Instead of defaulting to zero, any field not supplied is ignored. For example, +monthdays="1"+ matches the first day of every month and +hours="09-17"+ matches the hours between 9am and 5pm (inclusive). At this time, multiple ranges (e.g. +weekdays="1,2"+ or +weekdays="1-2,5-6"+) are not supported; depending on demand, this might be implemented in a future release. .Properties of a Date Specification [width="95%",cols="2m,<5",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the object indexterm:[id,Date Specification] indexterm:[Constraint,Date Specification,id] |hours |Allowed values: 0-23 indexterm:[hours,Date Specification] indexterm:[Constraint,Date Specification,hours] |monthdays |Allowed values: 1-31 (depending on month and year) indexterm:[monthdays,Date Specification] indexterm:[Constraint,Date Specification,monthdays] |weekdays |Allowed values: 1-7 (1=Monday, 7=Sunday) indexterm:[weekdays,Date Specification] indexterm:[Constraint,Date Specification,weekdays] |yeardays |Allowed values: 1-366 (depending on the year) indexterm:[yeardays,Date Specification] indexterm:[Constraint,Date Specification,yeardays] |months |Allowed values: 1-12 indexterm:[months,Date Specification] indexterm:[Constraint,Date Specification,months] |weeks |Allowed values: 1-53 (depending on weekyear) indexterm:[weeks,Date Specification] indexterm:[Constraint,Date Specification,weeks] |years |Year according to the Gregorian calendar indexterm:[years,Date Specification] indexterm:[Constraint,Date Specification,years] |weekyears |Year in which the week started; e.g. 1 January 2005 can be specified as '2005-001 Ordinal', '2005-01-01 Gregorian' or '2004-W53-6 Weekly' and thus would match +years="2005"+ or +weekyears="2004"+ indexterm:[weekyears,Date Specification] indexterm:[Constraint,Date Specification,weekyears] |moon |Allowed values are 0-7 (0 is new, 4 is full moon). Seriously, you can use this. This was implemented to demonstrate the ease with which new comparisons could be added. indexterm:[moon,Date Specification] indexterm:[Constraint,Date Specification,moon] |========================================================= === Durations === indexterm:[Duration] indexterm:[Resource,Constraint,Duration] Durations are used to calculate a value for +end+ when one is not supplied to +in_range+ operations. They contain the same fields as +date_spec+ objects but without the limitations (e.g. you can have a duration of 19 months). As with +date_specs+, any field not supplied is ignored. === Sample Time-Based Expressions === A small sample of how time-based expressions can be used: //// On older versions of asciidoc, the [source] directive makes the title disappear //// .True if now is any time in the year 2005 ==== [source,XML] ---- ---- ==== .Equivalent expression ==== [source,XML] ---- ---- ==== .9am-5pm Monday-Friday ==== [source,XML] ------- ------- ==== Please note that the +16+ matches up to +16:59:59+, as the numeric value (hour) still matches! .9am-6pm Monday through Friday or anytime Saturday ==== [source,XML] ------- ------- ==== .9am-5pm or 9pm-12am Monday through Friday ==== [source,XML] ------- ------- ==== .Mondays in March 2005 ==== [source,XML] ------- ------- ==== [NOTE] ====== Because no time is specified with the above dates, 00:00:00 is implied. This means that the range includes all of 2005-03-01 but none of 2005-04-01. You may wish to write +end="2005-03-31T23:59:59"+ to avoid confusion. ====== .A full moon on Friday the 13th ===== [source,XML] ------- ------- ===== == Using Rules to Determine Resource Location == indexterm:[Rule,Determine Resource Location] indexterm:[Resource,Location,Determine by Rules] A location constraint may contain rules. When the constraint's outermost rule evaluates to +false+, the cluster treats the constraint as if it were not there. When the rule evaluates to +true+, the node's preference for running the resource is updated with the score associated with the rule. If this sounds familiar, it is because you have been using a simplified syntax for location constraint rules already. Consider the following location constraint: .Prevent myApacheRsc from running on c001n03 ===== [source,XML] ------- ------- ===== This constraint can be more verbosely written as: .Prevent myApacheRsc from running on c001n03 - expanded version ===== [source,XML] ------- ------- ===== The advantage of using the expanded form is that one can then add extra clauses to the rule, such as limiting the rule such that it only applies during certain times of the day or days of the week. === Location Rules Based on Other Node Properties === The expanded form allows us to match on node properties other than its name. If we rated each machine's CPU power such that the cluster had the following nodes section: .A sample nodes section for use with score-attribute ===== [source,XML] ------- ------- ===== then we could prevent resources from running on underpowered machines with this rule: [source,XML] ------- ------- === Using +score-attribute+ Instead of +score+ === When using +score-attribute+ instead of +score+, each node matched by the rule has its score adjusted differently, according to its value for the named node attribute. Thus, in the previous example, if a rule used +score-attribute="cpu_mips"+, +c001n01+ would have its preference to run the resource increased by +1234+ whereas +c001n02+ would have its preference increased by +5678+. == Using Rules to Control Resource Options == Often some cluster nodes will be different from their peers. Sometimes, these differences -- e.g. the location of a binary or the names of network interfaces -- require resources to be configured differently depending on the machine they're hosted on. By defining multiple +instance_attributes+ objects for the resource and adding a rule to each, we can easily handle these special cases. In the example below, +mySpecialRsc+ will use eth1 and port 9999 when run on +node1+, eth2 and port 8888 on +node2+ and default to eth0 and port 9999 for all other nodes. .Defining different resource options based on the node name ===== [source,XML] ------- ------- ===== The order in which +instance_attributes+ objects are evaluated is determined by their score (highest to lowest). If not supplied, score defaults to zero, and objects with an equal score are processed in listed order. If the +instance_attributes+ object has no rule or a +rule+ that evaluates to +true+, then for any parameter the resource does not yet have a value for, the resource will use the parameter values defined by the +instance_attributes+. For example, given the configuration above, if the resource is placed on node1: . +special-node1+ has the highest score (3) and so is evaluated first; its rule evaluates to +true+, so +interface+ is set to +eth1+. . +special-node2+ is evaluated next with score 2, but its rule evaluates to +false+, so it is ignored. . +defaults+ is evaluated last with score 1, and has no rule, so its values are examined; +interface+ is already defined, so the value here is not used, but +port+ is not yet defined, so +port+ is set to +9999+. == Using Rules to Control Cluster Options == indexterm:[Rule,Controlling Cluster Options] indexterm:[Cluster,Setting Options with Rules] Controlling cluster options is achieved in much the same manner as specifying different resource options on different nodes. The difference is that because they are cluster options, one cannot (or should not, because they won't work) use attribute-based expressions. The following example illustrates how to set a different +resource-stickiness+ value during and outside work hours. This allows resources to automatically move back to their most preferred hosts, but at a time that (in theory) does not interfere with business activities. .Change +resource-stickiness+ during working hours ===== [source,XML] ------- ------- ===== [[s-rules-recheck]] == Ensuring Time-Based Rules Take Effect == A Pacemaker cluster is an event-driven system. As such, it won't recalculate the best place for resources to run unless something (like a resource failure or configuration change) happens. This can mean that a location constraint that only allows resource X to run between 9am and 5pm is not enforced. If you rely on time-based rules, the +cluster-recheck-interval+ cluster option (which defaults to 15 minutes) is essential. This tells the cluster to periodically recalculate the ideal state of the cluster. For example, if you set +cluster-recheck-interval="5m"+, then sometime between 09:00 and 09:05 the cluster would notice that it needs to start resource X, and between 17:00 and 17:05 it would realize that X needed to be stopped. The timing of the actual start and stop actions depends on what other actions the cluster may need to perform first. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Status.txt b/doc/Pacemaker_Explained/en-US/Ch-Status.txt index e6394ad1a1..abd8d83c24 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Status.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Status.txt @@ -1,372 +1,373 @@ +:compat-mode: legacy = Status -- Here be dragons = Most users never need to understand the contents of the status section and can be happy with the output from `crm_mon`. However for those with a curious inclination, this section attempts to provide an overview of its contents. == Node Status == indexterm:[Node,Status] indexterm:[Status of a Node] In addition to the cluster's configuration, the CIB holds an up-to-date representation of each cluster node in the +status+ section. .A bare-bones status entry for a healthy node *cl-virt-1* ====== [source,XML] ----- ----- ====== Users are highly recommended _not_ to modify any part of a node's state _directly_. The cluster will periodically regenerate the entire section from authoritative sources, so any changes should be done with the tools appropriate to those sources. .Authoritative Sources for State Information [width="95%",cols="1m,<1",options="header",align="center"] |========================================================= | CIB Object | Authoritative Source |node_state|pacemaker-controld |transient_attributes|pacemaker-attrd |lrm|pacemaker-execd |========================================================= The fields used in the +node_state+ objects are named as they are largely for historical reasons and are rooted in Pacemaker's origins as the resource manager for the older Heartbeat project. They have remained unchanged to preserve compatibility with older versions. .Node Status Fields [width="95%",cols="1m,<4",options="header",align="center"] |========================================================= |Field |Description | id | indexterm:[id,Node Status] indexterm:[Node,Status,id] Unique identifier for the node. Corosync-based clusters use a numeric counter. | uname | indexterm:[uname,Node Status] indexterm:[Node,Status,uname] The node's name as known by the cluster | in_ccm | indexterm:[in_ccm,Node Status] indexterm:[Node,Status,in_ccm] Is the node a member at the cluster communication layer? Allowed values: +true+, +false+. | crmd | indexterm:[crmd,Node Status] indexterm:[Node,Status,crmd] Is the node a member at the pacemaker layer? Allowed values: +online+, +offline+. | crm-debug-origin | indexterm:[crm-debug-origin,Node Status] indexterm:[Node,Status,crm-debug-origin] The name of the source function that made the most recent change (for debugging purposes). | join | indexterm:[join,Node Status] indexterm:[Node,Status,join] Does the node participate in hosting resources? Allowed values: +down+, +pending+, +member+, +banned+. | expected | indexterm:[expected,Node Status] indexterm:[Node,Status,expected] Expected value for +join+. |========================================================= The cluster uses these fields to determine whether, at the node level, the node is healthy or is in a failed state and needs to be fenced. == Transient Node Attributes == Like regular <>, the name/value pairs listed in the +transient_attributes+ section help to describe the node. However they are forgotten by the cluster when the node goes offline. This can be useful, for instance, when you want a node to be in standby mode (not able to run resources) just until the next reboot. In addition to any values the administrator sets, the cluster will also store information about failed resources here. .A set of transient node attributes for node *cl-virt-1* ====== [source,XML] ----- ----- ====== In the above example, we can see that a monitor on the +pingd:0+ resource has failed once, at 09:22:22 UTC 6 April 2009. footnote:[ You can use the standard `date` command to print a human-readable version of any seconds-since-epoch value, for example `date -d @1239009742`. ] We also see that the node is connected to three *pingd* peers and that all known resources have been checked for on this machine (+probe_complete+). == Operation History == indexterm:[Operation History] A node's resource history is held in the +lrm_resources+ tag (a child of the +lrm+ tag). The information stored here includes enough information for the cluster to stop the resource safely if it is removed from the +configuration+ section. Specifically, the resource's +id+, +class+, +type+ and +provider+ are stored. .A record of the +apcstonith+ resource ====== [source,XML] ====== Additionally, we store the last job for every combination of +resource+, +action+ and +interval+. The concatenation of the values in this tuple are used to create the id of the +lrm_rsc_op+ object. .Contents of an +lrm_rsc_op+ job [width="95%",cols="2m,<5",options="header",align="center"] |========================================================= |Field |Description | id | indexterm:[id,Action Status] indexterm:[Action,Status,id] Identifier for the job constructed from the resource's +id+, +operation+ and +interval+. | call-id | indexterm:[call-id,Action Status] indexterm:[Action,Status,call-id] The job's ticket number. Used as a sort key to determine the order in which the jobs were executed. | operation | indexterm:[operation,Action Status] indexterm:[Action,Status,operation] The action the resource agent was invoked with. | interval | indexterm:[interval,Action Status] indexterm:[Action,Status,interval] The frequency, in milliseconds, at which the operation will be repeated. A one-off job is indicated by 0. | op-status | indexterm:[op-status,Action Status] indexterm:[Action,Status,op-status] The job's status. Generally this will be either 0 (done) or -1 (pending). Rarely used in favor of +rc-code+. | rc-code | indexterm:[rc-code,Action Status] indexterm:[Action,Status,rc-code] The job's result. Refer to the 'Resource Agents' chapter of 'Pacemaker Administration' for details on what the values here mean and how they are interpreted. | last-run | indexterm:[last-run,Action Status] indexterm:[Action,Status,last-run] Machine-local date/time, in seconds since epoch, at which the job was executed. For diagnostic purposes. | last-rc-change | indexterm:[last-rc-change,Action Status] indexterm:[Action,Status,last-rc-change] Machine-local date/time, in seconds since epoch, at which the job first returned the current value of +rc-code+. For diagnostic purposes. | exec-time | indexterm:[exec-time,Action Status] indexterm:[Action,Status,exec-time] Time, in milliseconds, that the job was running for. For diagnostic purposes. | queue-time | indexterm:[queue-time,Action Status] indexterm:[Action,Status,queue-time] Time, in seconds, that the job was queued for in the LRMd. For diagnostic purposes. | crm_feature_set | indexterm:[crm_feature_set,Action Status] indexterm:[Action,Status,crm_feature_set] The version which this job description conforms to. Used when processing +op-digest+. | transition-key | indexterm:[transition-key,Action Status] indexterm:[Action,Status,transition-key] A concatenation of the job's graph action number, the graph number, the expected result and the UUID of the controller instance that scheduled it. This is used to construct +transition-magic+ (below). | transition-magic | indexterm:[transition-magic,Action Status] indexterm:[Action,Status,transition-magic] A concatenation of the job's +op-status+, +rc-code+ and +transition-key+. Guaranteed to be unique for the life of the cluster (which ensures it is part of CIB update notifications) and contains all the information needed for the controller to correctly analyze and process the completed job. Most importantly, the decomposed elements tell the controller if the job entry was expected and whether it failed. | op-digest | indexterm:[op-digest,Action Status] indexterm:[Action,Status,op-digest] An MD5 sum representing the parameters passed to the job. Used to detect changes to the configuration, to restart resources if necessary. | crm-debug-origin | indexterm:[crm-debug-origin,Action Status] indexterm:[Action,Status,crm-debug-origin] The origin of the current values. For diagnostic purposes. |========================================================= === Simple Operation History Example === .A monitor operation (determines current state of the +apcstonith+ resource) ====== [source,XML] ----- ----- ====== In the above example, the job is a non-recurring monitor operation often referred to as a "probe" for the +apcstonith+ resource. The cluster schedules probes for every configured resource on a node when the node first starts, in order to determine the resource's current state before it takes any further action. From the +transition-key+, we can see that this was the 22nd action of the 2nd graph produced by this instance of the controller (2668bbeb-06d5-40f9-936d-24cb7f87006a). The third field of the +transition-key+ contains a 7, which indicates that the job expects to find the resource inactive. By looking at the +rc-code+ property, we see that this was the case. As that is the only job recorded for this node, we can conclude that the cluster started the resource elsewhere. === Complex Operation History Example === .Resource history of a +pingd+ clone with multiple jobs ====== [source,XML] ----- ----- ====== When more than one job record exists, it is important to first sort them by +call-id+ before interpreting them. Once sorted, the above example can be summarized as: . A non-recurring monitor operation returning 7 (not running), with a +call-id+ of 3 . A stop operation returning 0 (success), with a +call-id+ of 32 . A start operation returning 0 (success), with a +call-id+ of 33 . A recurring monitor returning 0 (success), with a +call-id+ of 34 The cluster processes each job record to build up a picture of the resource's state. After the first and second entries, it is considered stopped, and after the third it considered active. Based on the last operation, we can tell that the resource is currently active. Additionally, from the presence of a +stop+ operation with a lower +call-id+ than that of the +start+ operation, we can conclude that the resource has been restarted. Specifically this occurred as part of actions 11 and 31 of transition 11 from the controller instance with the key +2668bbeb...+. This information can be helpful for locating the relevant section of the logs when looking for the source of a failure. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt index ff7cf5f98b..69a6b4d20c 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt @@ -1,939 +1,940 @@ +:compat-mode: legacy = STONITH = //// We prefer [[ch-stonith]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-stonith[Chapter 13, STONITH] indexterm:[STONITH, Configuration] == What Is STONITH? == STONITH (an acronym for "Shoot The Other Node In The Head"), also called 'fencing', protects your data from being corrupted by rogue nodes or concurrent access. Just because a node is unresponsive, this doesn't mean it isn't accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH so we can be certain that the node is truly offline, before allowing the data to be accessed from another node. STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere. == What STONITH Device Should You Use? == It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one. The biggest mistake people make in choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. In such cases, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault. Likewise, any device that relies on the machine being active (such as SSH-based "devices" used during testing) are inappropriate. == Special Treatment of STONITH Resources == STONITH resources are somewhat special in Pacemaker. STONITH may be initiated by pacemaker or by other parts of the cluster (such as resources like DRBD or DLM). To accommodate this, pacemaker does not require the STONITH resource to be in the 'started' state in order to be used, thus allowing reliable use of STONITH devices in such a case. All nodes have access to STONITH devices' definitions and instantiate them on-the-fly when needed, but preference is given to 'verified' instances, which are the ones that are 'started' according to the cluster's knowledge. In the case of a cluster split, the partition with a verified instance will have a slight advantage, because the STONITH daemon in the other partition will have to hear from all its current peers before choosing a node to perform the fencing. Fencing resources do work the same as regular resources in some respects: * +target-role+ can be used to enable or disable the resource * Location constraints can be used to prevent a specific node from using the resource [IMPORTANT] =========== Currently there is a limitation that fencing resources may only have one set of meta-attributes and one set of instance attributes. This can be revisited if it becomes a significant limitation for people. =========== See the table below or run `man pacemaker-fenced` to see special instance attributes that may be set for any fencing resource, regardless of fence agent. .Additional Properties of Fencing Resources [width="95%",cols="5m,2,3,<10",options="header",align="center"] |========================================================= |Field |Type |Default |Description |stonith-timeout |NA |NA a|Older versions used this to override the default period to wait for a STONITH (reboot, on, off) action to complete for this device. It has been replaced by the +pcmk_reboot_timeout+ and +pcmk_off_timeout+ properties. indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] //// priority integer 0 The priority of the STONITH resource. Devices are tried in order of highest priority to lowest. indexterm:[priority,Fencing] indexterm:[Fencing,Property,priority] //// |provides |string | |Any special capability provided by the fence device. Currently, only one such capability is meaningful: +unfencing+ (see <>). indexterm:[priority,Fencing] indexterm:[Fencing,Property,priority] |pcmk_host_map |string | |A mapping of host names to ports numbers for devices that do not support host names. Example: +node1:1;node2:2,3+ tells the cluster to use port 1 for *node1* and ports 2 and 3 for *node2*. indexterm:[pcmk_host_map,Fencing] indexterm:[Fencing,Property,pcmk_host_map] |pcmk_host_list |string | |A list of machines controlled by this device (optional unless +pcmk_host_check+ is +static-list+). indexterm:[pcmk_host_list,Fencing] indexterm:[Fencing,Property,pcmk_host_list] |pcmk_host_check |string |dynamic-list a|How to determine which machines are controlled by the device. Allowed values: * +dynamic-list:+ query the device * +static-list:+ check the +pcmk_host_list+ attribute * +none:+ assume every device can fence every machine indexterm:[pcmk_host_check,Fencing] indexterm:[Fencing,Property,pcmk_host_check] |pcmk_delay_max |time |0s |Enable a random delay of up to the time specified before executing stonith actions. This is sometimes used in two-node clusters to ensure that the nodes don't fence each other at the same time. The overall delay introduced by pacemaker is derived from this random delay value adding a static delay so that the sum is kept below the maximum delay. indexterm:[pcmk_delay_max,Fencing] indexterm:[Fencing,Property,pcmk_delay_max] |pcmk_delay_base |time |0s |Enable a static delay before executing stonith actions. This can be used e.g. in two-node clusters to ensure that the nodes don't fence each other, by having separate fencing resources with different values. The node that is fenced with the shorter delay will lose a fencing race. The overall delay introduced by pacemaker is derived from this value plus a random delay such that the sum is kept below the maximum delay. indexterm:[pcmk_delay_base,Fencing] indexterm:[Fencing,Property,pcmk_delay_base] |pcmk_action_limit |integer |1 |The maximum number of actions that can be performed in parallel on this device, if the cluster option +concurrent-fencing+ is +true+. -1 is unlimited. indexterm:[pcmk_action_limit,Fencing] indexterm:[Fencing,Property,pcmk_action_limit] |pcmk_host_argument |string |port |'Advanced use only.' Which parameter should be supplied to the resource agent to identify the node to be fenced. Some devices do not support the standard +port+ parameter or may provide additional ones. Use this to specify an alternate, device-specific parameter. A value of +none+ tells the cluster not to supply any additional parameters. indexterm:[pcmk_host_argument,Fencing] indexterm:[Fencing,Property,pcmk_host_argument] |pcmk_reboot_action |string |reboot |'Advanced use only.' The command to send to the resource agent in order to reboot a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_reboot_action,Fencing] indexterm:[Fencing,Property,pcmk_reboot_action] |pcmk_reboot_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `reboot` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_reboot_timeout,Fencing] indexterm:[Fencing,Property,pcmk_reboot_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_reboot_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `reboot` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_reboot_retries,Fencing] indexterm:[Fencing,Property,pcmk_reboot_retries] |pcmk_off_action |string |off |'Advanced use only.' The command to send to the resource agent in order to shut down a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_off_action,Fencing] indexterm:[Fencing,Property,pcmk_off_action] |pcmk_off_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `off` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_off_timeout,Fencing] indexterm:[Fencing,Property,pcmk_off_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_off_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `off` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_off_retries,Fencing] indexterm:[Fencing,Property,pcmk_off_retries] |pcmk_list_action |string |list |'Advanced use only.' The command to send to the resource agent in order to list nodes. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_list_action,Fencing] indexterm:[Fencing,Property,pcmk_list_action] |pcmk_list_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `list` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_list_timeout,Fencing] indexterm:[Fencing,Property,pcmk_list_timeout] |pcmk_list_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `list` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_list_retries,Fencing] indexterm:[Fencing,Property,pcmk_list_retries] |pcmk_monitor_action |string |monitor |'Advanced use only.' The command to send to the resource agent in order to report extended status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_monitor_action,Fencing] indexterm:[Fencing,Property,pcmk_monitor_action] |pcmk_monitor_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `monitor` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_monitor_timeout,Fencing] indexterm:[Fencing,Property,pcmk_monitor_timeout] |pcmk_monitor_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `monitor` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_monitor_retries,Fencing] indexterm:[Fencing,Property,pcmk_monitor_retries] |pcmk_status_action |string |status |'Advanced use only.' The command to send to the resource agent in order to report status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_status_action,Fencing] indexterm:[Fencing,Property,pcmk_status_action] |pcmk_status_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `status` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_status_timeout,Fencing] indexterm:[Fencing,Property,pcmk_status_timeout] |pcmk_status_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `status` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_status_retries,Fencing] indexterm:[Fencing,Property,pcmk_status_retries] |========================================================= [[s-unfencing]] == Unfencing == Most fence devices cut the power to the target. By contrast, fence devices that perform 'fabric fencing' cut off a node's access to some critical resource, such as a shared disk or a network switch. With fabric fencing, it is expected that the cluster will fence the node, and then a system administrator must manually investigate what went wrong, correct any issues found, then reboot (or restart the cluster services on) the node. Once the node reboots and rejoins the cluster, some fabric fencing devices require that an explicit command to restore the node's access to the critical resource. This capability is called 'unfencing' and is typically implemented as the fence agent's +on+ command. If any cluster resource has +requires+ set to +unfencing+, then that resource will not be probed or started on a node until that node has been unfenced. == Configuring STONITH == [NOTE] =========== Higher-level configuration shells include functionality to simplify the process below, particularly the step for deciding which parameters are required. However since this document deals only with core components, you should refer to the STONITH chapter of the http://www.clusterlabs.org/doc/[Clusters from Scratch] guide for those details. =========== . Find the correct driver: + ---- # stonith_admin --list-installed ---- . Find the required parameters associated with the device (replacing $AGENT_NAME with the name obtained from the previous step): + ---- # stonith_admin --metadata --agent $AGENT_NAME ---- . Create a file called +stonith.xml+ containing a primitive resource with a class of +stonith+, a type equal to the agent name obtained earlier, and a parameter for each of the values returned in the previous step. . If the device does not know how to fence nodes based on their uname, you may also need to set the special +pcmk_host_map+ parameter. See `man pacemaker-fenced` for details. . If the device does not support the `list` command, you may also need to set the special +pcmk_host_list+ and/or +pcmk_host_check+ parameters. See `man pacemaker-fenced` for details. . If the device does not expect the victim to be specified with the `port` parameter, you may also need to set the special +pcmk_host_argument+ parameter. See `man pacemaker-fenced` for details. . Upload it into the CIB using cibadmin: + ---- # cibadmin -C -o resources --xml-file stonith.xml ---- . Set +stonith-enabled+ to true: + ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- . Once the stonith resource is running, you can test it by executing the following (although you might want to stop the cluster on that machine first): + ---- # stonith_admin --reboot nodename ---- === Example STONITH Configuration === Assume we have an chassis containing four nodes and an IPMI device active on 192.0.2.1. We would choose the `fence_ipmilan` driver, and obtain the following list of parameters: .Obtaining a list of STONITH Parameters ==== ---- # stonith_admin --metadata -a fence_ipmilan ---- [source,XML] ----

---- ==== Based on that, we would create a STONITH resource fragment that might look like this: .An IPMI-based STONITH Resource ==== [source,XML] ---- ---- ==== Finally, we need to enable STONITH: ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- == Advanced STONITH Configurations == Some people consider that having one fencing device is a single point of failure footnote:[Not true, since a node or resource must fail before fencing even has a chance to]; others prefer removing the node from the storage and network instead of turning it off. Whatever the reason, Pacemaker supports fencing nodes with multiple devices through a feature called 'fencing topologies'. Simply create the individual devices as you normally would, then define one or more +fencing-level+ entries in the +fencing-topology+ section of the configuration. * Each fencing level is attempted in order of ascending +index+. Allowed values are 1 through 9. * If a device fails, processing terminates for the current level. No further devices in that level are exercised, and the next level is attempted instead. * If the operation succeeds for all the listed devices in a level, the level is deemed to have passed. * The operation is finished when a level has passed (success), or all levels have been attempted (failed). * If the operation failed, the next step is determined by the scheduler and/or the controller. Some possible uses of topologies include: * Try poison-pill and fail back to power * Try disk and network, and fall back to power if either fails * Initiate a kdump and then poweroff the node .Properties of Fencing Levels [width="95%",cols="1m,<3",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the level indexterm:[id,fencing-level] indexterm:[Fencing,fencing-level,id] |target |The name of a single node to which this level applies indexterm:[target,fencing-level] indexterm:[Fencing,fencing-level,target] |target-pattern |An extended regular expression (as defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04[POSIX]) matching the names of nodes to which this level applies indexterm:[target-pattern,fencing-level] indexterm:[Fencing,fencing-level,target-pattern] |target-attribute |The name of a node attribute that is set (to +target-value+) for nodes to which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |target-value |The node attribute value (of +target-attribute+) that is set for nodes to which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |index |The order in which to attempt the levels. Levels are attempted in ascending order 'until one succeeds'. Valid values are 1 through 9. indexterm:[index,fencing-level] indexterm:[Fencing,fencing-level,index] |devices |A comma-separated list of devices that must all be tried for this level indexterm:[devices,fencing-level] indexterm:[Fencing,fencing-level,devices] |========================================================= .Fencing topology with different devices for different nodes ==== [source,XML] ---- ...

... ---- ==== === Example Dual-Layer, Dual-Device Fencing Topologies === The following example illustrates an advanced use of +fencing-topology+ in a cluster with the following properties: * 3 nodes (2 active prod-mysql nodes, 1 prod_mysql-rep in standby for quorum purposes) * the active nodes have an IPMI-controlled power board reached at 192.0.2.1 and 192.0.2.2 * the active nodes also have two independent PSUs (Power Supply Units) connected to two independent PDUs (Power Distribution Units) reached at 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) * the first fencing method uses the `fence_ipmi` agent * the second fencing method uses the `fence_apc_snmp` agent targetting 2 fencing devices (one per PSU, either port 10 or 11) * fencing is only implemented for the active nodes and has location constraints * fencing topology is set to try IPMI fencing first then default to a "sure-kill" dual PDU fencing In a normal failure scenario, STONITH will first select +fence_ipmi+ to try to kill the faulty node. Using a fencing topology, if that first method fails, STONITH will then move on to selecting +fence_apc_snmp+ twice: * once for the first PDU * again for the second PDU The fence action is considered successful only if both PDUs report the required status. If any of them fails, STONITH loops back to the first fencing method, +fence_ipmi+, and so on until the node is fenced or fencing action is cancelled. .First fencing method: single IPMI device Each cluster node has it own dedicated IPMI channel that can be called for fencing using the following primitives: [source,XML] ---- ---- .Second fencing method: dual PDU devices Each cluster node also has two distinct power channels controlled by two distinct PDUs. That means a total of 4 fencing devices configured as follows: - Node 1, PDU 1, PSU 1 @ port 10 - Node 1, PDU 2, PSU 2 @ port 10 - Node 2, PDU 1, PSU 1 @ port 11 - Node 2, PDU 2, PSU 2 @ port 11 The matching fencing agents are configured as follows: [source,XML] ---- ---- .Location Constraints To prevent STONITH from trying to run a fencing agent on the same node it is supposed to fence, constraints are placed on all the fencing primitives: [source,XML] ---- ---- .Fencing topology Now that all the fencing resources are defined, it's time to create the right topology. We want to first fence using IPMI and if that does not work, fence both PDUs to effectively and surely kill the node. [source,XML] ----

---- Please note, in +fencing-topology+, the lowest +index+ value determines the priority of the first fencing method. .Final configuration Put together, the configuration looks like this: [source,XML] ---- ...

... ---- == Remapping Reboots == When the cluster needs to reboot a node, whether because +stonith-action+ is +reboot+ or because a reboot was manually requested (such as by `stonith_admin --reboot`), it will remap that to other commands in two cases: . If the chosen fencing device does not support the +reboot+ command, the cluster will ask it to perform +off+ instead. . If a fencing topology level with multiple devices must be executed, the cluster will ask all the devices to perform +off+, then ask the devices to perform +on+. To understand the second case, consider the example of a node with redundant power supplies connected to intelligent power switches. Rebooting one switch and then the other would have no effect on the node. Turning both switches off, and then on, actually reboots the node. In such a case, the fencing operation will be treated as successful as long as the +off+ commands succeed, because then it is safe for the cluster to recover any resources that were on the node. Timeouts and errors in the +on+ phase will be logged but ignored. When a reboot operation is remapped, any action-specific timeout for the remapped action will be used (for example, +pcmk_off_timeout+ will be used when executing the +off+ command, not +pcmk_reboot_timeout+). diff --git a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt index 9fecf4c681..4a938ad272 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt @@ -1,229 +1,230 @@ +:compat-mode: legacy = Utilization and Placement Strategy = [[s-utilization]] Pacemaker decides where to place a resource according to the resource allocation scores on every node. The resource will be allocated to the node where the resource has the highest score. If the resource allocation scores on all the nodes are equal, by the default placement strategy, Pacemaker will choose a node with the least number of allocated resources for balancing the load. If the number of resources on each node is equal, the first eligible node listed in the CIB will be chosen to run the resource. Often, in real-world situations, different resources use significantly different proportions of a node's capacities (memory, I/O, etc.). We cannot balance the load ideally just according to the number of resources allocated to a node. Besides, if resources are placed such that their combined requirements exceed the provided capacity, they may fail to start completely or run with degraded performance. To take these factors into account, Pacemaker allows you to configure: . The capacity a certain node provides. . The capacity a certain resource requires. . An overall strategy for placement of resources. == Utilization attributes == To configure the capacity that a node provides or a resource requires, you can use 'utilization attributes' in +node+ and +resource+ objects. You can name utilization attributes according to your preferences and define as many name/value pairs as your configuration needs. However, the attributes' values must be integers. .Specifying CPU and RAM capacities of two nodes ==== [source,XML] ---- ---- ==== .Specifying CPU and RAM consumed by several resources ==== [source,XML] ---- ---- ==== A node is considered eligible for a resource if it has sufficient free capacity to satisfy the resource's requirements. The nature of the required or provided capacities is completely irrelevant to Pacemaker -- it just makes sure that all capacity requirements of a resource are satisfied before placing a resource to a node. == Placement Strategy == After you have configured the capacities your nodes provide and the capacities your resources require, you need to set the +placement-strategy+ in the global cluster options, otherwise the capacity configurations have 'no effect'. Four values are available for the +placement-strategy+: +default+:: Utilization values are not taken into account at all. Resources are allocated according to allocation scores. If scores are equal, resources are evenly distributed across nodes. +utilization+:: Utilization values are taken into account 'only' when deciding whether a node is considered eligible (i.e. whether it has sufficient free capacity to satisfy the resource's requirements). Load-balancing is still done based on the number of resources allocated to a node. +balanced+:: Utilization values are taken into account when deciding whether a node is eligible to serve a resource 'and' when load-balancing, so an attempt is made to spread the resources in a way that optimizes resource performance. +minimal+:: Utilization values are taken into account 'only' when deciding whether a node is eligible to serve a resource. For load-balancing, an attempt is made to concentrate the resources on as few nodes as possible, thereby enabling possible power savings on the remaining nodes. Set +placement-strategy+ with `crm_attribute`: ---- # crm_attribute --name placement-strategy --update balanced ---- Now Pacemaker will ensure the load from your resources will be distributed evenly throughout the cluster, without the need for convoluted sets of colocation constraints. == Allocation Details == === Which node is preferred to get consumed first when allocating resources? === - The node with the highest node weight gets consumed first. Node weight is a score maintained by the cluster to represent node health. - If multiple nodes have the same node weight: * If +placement-strategy+ is +default+ or +utilization+, the node that has the least number of allocated resources gets consumed first. ** If their numbers of allocated resources are equal, the first eligible node listed in the CIB gets consumed first. * If +placement-strategy+ is +balanced+, the node that has the most free capacity gets consumed first. ** If the free capacities of the nodes are equal, the node that has the least number of allocated resources gets consumed first. *** If their numbers of allocated resources are equal, the first eligible node listed in the CIB gets consumed first. * If +placement-strategy+ is +minimal+, the first eligible node listed in the CIB gets consumed first. === Which node has more free capacity? === If only one type of utilization attribute has been defined, free capacity is a simple numeric comparison. If multiple types of utilization attributes have been defined, then the node that is numerically highest in the the most attribute types has the most free capacity. For example: - If +nodeA+ has more free +cpus+, and +nodeB+ has more free +memory+, then their free capacities are equal. - If +nodeA+ has more free +cpus+, while +nodeB+ has more free +memory+ and +storage+, then +nodeB+ has more free capacity. === Which resource is preferred to be assigned first? === - The resource that has the highest +priority+ (see <>) gets allocated first. - If their priorities are equal, check whether they are already running. The resource that has the highest score on the node where it's running gets allocated first, to prevent resource shuffling. - If the scores above are equal or the resources are not running, the resource has the highest score on the preferred node gets allocated first. - If the scores above are equal, the first runnable resource listed in the CIB gets allocated first. == Limitations and Workarounds == The type of problem Pacemaker is dealing with here is known as the http://en.wikipedia.org/wiki/Knapsack_problem[knapsack problem] and falls into the http://en.wikipedia.org/wiki/NP-complete[NP-complete] category of computer science problems -- a fancy way of saying "it takes a really long time to solve". Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours or days, finding an optimal solution while services remain unavailable. So instead of trying to solve the problem completely, Pacemaker uses a 'best effort' algorithm for determining which node should host a particular service. This means it arrives at a solution much faster than traditional linear programming algorithms, but by doing so at the price of leaving some services stopped. In the contrived example at the start of this chapter: - +rsc-small+ would be allocated to +node1+ - +rsc-medium+ would be allocated to +node2+ - +rsc-large+ would remain inactive Which is not ideal. There are various approaches to dealing with the limitations of pacemaker's placement strategy: Ensure you have sufficient physical capacity.:: It might sound obvious, but if the physical capacity of your nodes is (close to) maxed out by the cluster under normal conditions, then failover isn't going to go well. Even without the utilization feature, you'll start hitting timeouts and getting secondary failures. Build some buffer into the capabilities advertised by the nodes.:: Advertise slightly more resources than we physically have, on the (usually valid) assumption that a resource will not use 100% of the configured amount of CPU, memory and so forth 'all' the time. This practice is sometimes called 'overcommit'. Specify resource priorities.:: If the cluster is going to sacrifice services, it should be the ones you care about (comparatively) the least. Ensure that resource priorities are properly set so that your most important resources are scheduled first. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt b/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt index d6543f9fa2..5765f0846e 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt @@ -1,76 +1,77 @@ +:compat-mode: legacy = Alternative Configurations = These alternative configurations may be appropriate in limited cases, such as a test cluster, but are not the best method in most situations. They are presented here for completeness and as an example of Pacemaker's flexibility to suit your needs. == Virtual Machines as Cluster Nodes == The preferred use of virtual machines in a Pacemaker cluster is as a cluster resource, whether opaque or as a guest node. However, it is possible to run the full cluster stack on a virtual node instead. This is commonly used to set up test environments; a single physical host (that does not participate in the cluster) runs two or more virtual machines, all running the full cluster stack. This can be used to simulate a larger cluster for testing purposes. In a production environment, fencing becomes more complicated, especially if the underlying hosts run any services besides the clustered VMs. If the VMs are not guaranteed a minimum amount of host resources, CPU and I/O contention can cause timing issues for cluster components. Another situation where this approach is sometimes used is when the cluster owner leases the VMs from a provider and does not have direct access to the underlying host. The main concerns in this case are proper fencing (usually via a custom resource agent that communicates with the provider's APIs) and maintaining a static IP address between reboots, as well as resource contention issues. == Virtual Machines as Remote Nodes == Virtual machines may be configured following the process for remote nodes rather than guest nodes (i.e., using an *ocf:pacemaker:remote* resource rather than letting the cluster manage the VM directly). This is mainly useful in testing, to use a single physical host to simulate a larger cluster involving remote nodes. Pacemaker's Cluster Test Suite (CTS) uses this approach to test remote node functionality. == Containers as Guest Nodes == Containers,footnote:[https://en.wikipedia.org/wiki/Operating-system-level_virtualization] and in particular Linux containers (LXC) and Docker, have become a popular method of isolating services in a resource-efficient manner. The preferred means of integrating containers into Pacemaker is as a cluster resource, whether opaque or using Pacemaker's 'bundle' resource type. However, it is possible to run `pacemaker_remote` inside a container, following the process for guest nodes. This is not recommended but can be useful, for example, in testing scenarios, to simulate a large number of guest nodes. The configuration process is very similar to that described for guest nodes using virtual machines. Key differences: * The underlying host must install the libvirt driver for the desired container technology -- for example, the +libvirt-daemon-lxc+ package to get the http://libvirt.org/drvlxc.html[libvirt-lxc] driver for LXC containers. * Libvirt XML definitions must be generated for the containers. The +pacemaker-cts+ package includes a script for this purpose, +/usr/share/pacemaker/tests/cts/lxc_autogen.sh+. Run it with the `--help` option for details on how to use it. It is intended for testing purposes only, and hardcodes various parameters that would need to be set appropriately in real usage. Of course, you can create XML definitions manually, following the appropriate libvirt driver documentation. * To share the authentication key, either share the host's +/etc/pacemaker+ directory with the container, or copy the key into the container's filesystem. * The *VirtualDomain* resource for a container will need *force_stop="true"* and an appropriate hypervisor option, for example *hypervisor="lxc:///"* for LXC containers. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt b/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt index 2341f67398..71b17cdabb 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt @@ -1,305 +1,306 @@ +:compat-mode: legacy = Remote Node Walk-through = *What this tutorial is:* An in-depth walk-through of how to get Pacemaker to integrate a remote node into the cluster as a node capable of running cluster resources. *What this tutorial is not:* A realistic deployment scenario. The steps shown here are meant to get users familiar with the concept of remote nodes as quickly as possible. This tutorial requires three machines: two to act as cluster nodes, and a third to act as the remote node. == Configure Remote Node == === Configure Firewall on Remote Node === Allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- # setenforce 0 # sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config # systemctl mask firewalld.service # systemctl stop firewalld.service # iptables --flush ---- ====== === Configure pacemaker_remote on Remote Node === Install the pacemaker_remote daemon on the remote node. ---- # yum install -y pacemaker-remote resource-agents pcs ---- Create a location for the shared authentication key: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker ---- All nodes (both cluster nodes and remote nodes) must have the same authentication key installed for the communication to work correctly. If you already have a key on an existing node, copy it to the new remote node. Otherwise, create a new key, for example: ---- # dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- Now start and enable the pacemaker_remote daemon on the remote node. ---- # systemctl enable pacemaker_remote.service # systemctl start pacemaker_remote.service ---- Verify the start is successful. ---- # systemctl status pacemaker_remote pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled) Active: active (running) since Fri 2018-01-12 15:21:20 CDT; 20s ago Main PID: 21273 (pacemaker_remot) CGroup: /system.slice/pacemaker_remote.service └─21273 /usr/sbin/pacemaker-remoted Jan 12 15:21:20 remote1 systemd[1]: Starting Pacemaker Remote Service... Jan 12 15:21:20 remote1 systemd[1]: Started Pacemaker Remote Service. Jan 12 15:21:20 remote1 pacemaker-remoted[21273]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Jan 12 15:21:20 remote1 pacemaker-remoted[21273]: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. Jan 12 15:21:20 remote1 pacemaker-remoted[21273]: notice: bind_and_listen: Listening on address :: ---- == Verify Connection to Remote Node == Before moving forward, it's worth verifying that the cluster nodes can contact the remote node on port 3121. Here's a trick you can use. Connect using ssh from each of the cluster nodes. The connection will get destroyed, but how it is destroyed tells you whether it worked or not. First, add the remote node's hostname (we're using *remote1* in this tutorial) to the cluster nodes' +/etc/hosts+ files if you haven't already. This is required unless you have DNS set up in a way where remote1's address can be discovered. Execute the following on each cluster node, replacing the IP address with the actual IP address of the remote node. ---- # cat << END >> /etc/hosts 192.168.122.10 remote1 END ---- If running the ssh command on one of the cluster nodes results in this output before disconnecting, the connection works: ---- # ssh -p 3121 remote1 ssh_exchange_identification: read: Connection reset by peer ---- If you see one of these, the connection is not working: ---- # ssh -p 3121 remote1 ssh: connect to host remote1 port 3121: No route to host ---- ---- # ssh -p 3121 remote1 ssh: connect to host remote1 port 3121: Connection refused ---- Once you can successfully connect to the remote node from the both cluster nodes, move on to setting up Pacemaker on the cluster nodes. == Configure Cluster Nodes == === Configure Firewall on Cluster Nodes === On each cluster node, allow cluster-related services through the local firewall, following the same procedure as in <<_configure_firewall_on_remote_node>>. === Install Pacemaker on Cluster Nodes === On the two cluster nodes, install the following packages. ---- # yum install -y pacemaker corosync pcs resource-agents ---- === Copy Authentication Key to Cluster Nodes === Create a location for the shared authentication key, and copy it from any existing node: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker # scp remote1:/etc/pacemaker/authkey /etc/pacemaker/authkey ---- === Configure Corosync on Cluster Nodes === Corosync handles Pacemaker's cluster membership and messaging. The corosync config file is located in +/etc/corosync/corosync.conf+. That config file must be initialized with information about the two cluster nodes before pacemaker can start. To initialize the corosync config file, execute the following pcs command on both nodes, filling in the information in <> with your nodes' information. ---- # pcs cluster setup --force --local --name mycluster ---- === Start Pacemaker on Cluster Nodes === Start the cluster stack on both cluster nodes using the following command. ---- # pcs cluster start ---- Verify corosync membership .... # pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 node1 (local) .... Verify Pacemaker status. At first, the `pcs cluster status` output will look like this. ---- # pcs status Cluster name: mycluster Stack: corosync Current DC: NONE Last updated: Fri Jan 12 16:14:05 2018 Last change: Fri Jan 12 14:02:14 2018 1 node configured 0 resources configured ---- After about a minute, you should see your two cluster nodes come online. ---- # pcs status Cluster name: mycluster Stack: corosync Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 16:16:32 2018 Last change: Fri Jan 12 14:02:14 2018 2 nodes configured 0 resources configured Online: [ node1 node2 ] ---- For the sake of this tutorial, we are going to disable stonith to avoid having to cover fencing device configuration. ---- # pcs property set stonith-enabled=false ---- == Integrate Remote Node into Cluster == Integrating a remote node into the cluster is achieved through the creation of a remote node connection resource. The remote node connection resource both establishes the connection to the remote node and defines that the remote node exists. Note that this resource is actually internal to Pacemaker's controller. A metadata file for this resource can be found in the +/usr/lib/ocf/resource.d/pacemaker/remote+ file that describes what options are available, but there is no actual *ocf:pacemaker:remote* resource agent script that performs any work. Define the remote node connection resource to our remote node, *remote1*, using the following command on any cluster node. ---- # pcs resource create remote1 ocf:pacemaker:remote ---- That's it. After a moment you should see the remote node come online. ---- Cluster name: mycluster Stack: corosync Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 17:13:09 2018 Last change: Fri Jan 12 17:02:02 2018 3 nodes configured 1 resources configured Online: [ node1 node2 ] RemoteOnline: [ remote1 ] Full list of resources: remote1 (ocf::pacemaker:remote): Started node1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Starting Resources on Remote Node == Once the remote node is integrated into the cluster, starting resources on a remote node is the exact same as on cluster nodes. Refer to the http://clusterlabs.org/doc/['Clusters from Scratch'] document for examples of resource creation. [WARNING] ========= Never involve a remote node connection resource in a resource group, colocation constraint, or order constraint. ========= == Fencing Remote Nodes == Remote nodes are fenced the same way as cluster nodes. No special considerations are required. Configure fencing resources for use with remote nodes the same as you would with cluster nodes. Note, however, that remote nodes can never 'initiate' a fencing action. Only cluster nodes are capable of actually executing a fencing operation against another node. == Accessing Cluster Tools from a Remote Node == Besides allowing the cluster to manage resources on a remote node, pacemaker_remote has one other trick. The pacemaker_remote daemon allows nearly all the pacemaker tools (`crm_resource`, `crm_mon`, `crm_attribute`, `crm_master`, etc.) to work on remote nodes natively. Try it: Run `crm_mon` on the remote node after pacemaker has integrated it into the cluster. These tools just work. These means resource agents such as promotable resources (which need access to tools like `crm_master`) work seamlessly on the remote nodes. Higher-level command shells such as `pcs` may have partial support on remote nodes, but it is recommended to run them from a cluster node. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Example.txt b/doc/Pacemaker_Remote/en-US/Ch-Example.txt index 54ee229505..16faff8073 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Example.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Example.txt @@ -1,130 +1,131 @@ +:compat-mode: legacy = Guest Node Quick Example = If you already know how to use Pacemaker, you'll likely be able to grasp this new concept of guest nodes by reading through this quick example without having to sort through all the detailed walk-through steps. Here are the key configuration ingredients that make this possible using libvirt and KVM virtual guests. These steps strip everything down to the very basics. (((guest node))) (((node,guest node))) == Mile-High View of Configuration Steps == * Give each virtual machine that will be used as a guest node a static network address and unique hostname. * Put the same authentication key with the path +/etc/pacemaker/authkey+ on every cluster node and virtual machine. This secures remote communication. + Run this command if you want to make a somewhat random key: + ---- dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- * Install pacemaker_remote on every virtual machine, enabling it to start at boot, and if a local firewall is used, allow the node to accept connections on TCP port 3121. + ---- yum install pacemaker-remote resource-agents systemctl enable pacemaker_remote firewall-cmd --add-port 3121/tcp --permanent ---- + [NOTE] ====== If you just want to see this work, you may want to simply disable the local firewall and put SELinux in permissive mode while testing. This creates security risks and should not be done on a production machine exposed to the Internet, but can be appropriate for a protected test machine. ====== * Create a Pacemaker resource to launch each virtual machine, using the *remote-node* meta-attribute to let Pacemaker know this will be a guest node capable of running resources. + ---- # pcs resource create vm-guest1 VirtualDomain hypervisor="qemu:///system" config="vm-guest1.xml" meta remote-node="guest1" ---- + The above command will create CIB XML similar to the following: + [source,XML] ---- ---- In the example above, the meta-attribute *remote-node="guest1"* tells Pacemaker that this resource is a guest node with the hostname *guest1*. The cluster will attempt to contact the virtual machine's pacemaker_remote service at the hostname *guest1* after it launches. [NOTE] ====== The ID of the resource creating the virtual machine (*vm-guest1* in the above example) 'must' be different from the virtual machine's uname (*guest1* in the above example). Pacemaker will create an implicit internal resource for the pacemaker_remote connection to the guest, named with the value of *remote-node*, so that value cannot be used as the name of any other resource. ====== == Using a Guest Node == Guest nodes will show up in `crm_mon` output as normal: .Example `crm_mon` output after *guest1* is integrated into cluster ---- Stack: corosync Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 13:52:39 2018 Last change: Fri Jan 12 13:25:17 2018 via pacemaker-controld on node1 2 nodes configured 2 resources configured Online: [ node1 guest1] vm-guest1 (ocf::heartbeat:VirtualDomain): Started node1 ---- Now, you could place a resource, such as a webserver, on *guest1*: ---- # pcs resource create webserver apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s # pcs constraint location webserver prefers guest1 ---- Now, the crm_mon output would show: ---- Stack: corosync Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 13:52:39 2018 Last change: Fri Jan 12 13:25:17 2018 via pacemaker-controld on node1 2 nodes configured 2 resources configured Online: [ node1 guest1] vm-guest1 (ocf::heartbeat:VirtualDomain): Started node1 webserver (ocf::heartbeat::apache): Started guest1 ---- It is worth noting that after *guest1* is integrated into the cluster, nearly all the Pacemaker command-line tools immediately become available to the guest node. This means things like `crm_mon`, `crm_resource`, and `crm_attribute` will work natively on the guest node, as long as the connection between the guest node and a cluster node exists. This is particularly important for any promotable clone resources executing on the guest node that need access to `crm_master` to set transient attributes. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt index 139d23016a..e280f05c43 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt @@ -1,159 +1,160 @@ +:compat-mode: legacy = Scaling a Pacemaker Cluster = == Overview == In a basic Pacemaker high-availability cluster,footnote:[See the http://www.clusterlabs.org/doc/[Pacemaker documentation], especially 'Clusters From Scratch' and 'Pacemaker Explained', for basic information about high-availability using Pacemaker] each node runs the full cluster stack of corosync and all Pacemaker components. This allows great flexibility but limits scalability to around 16 nodes. To allow for scalability to dozens or even hundreds of nodes, Pacemaker allows nodes not running the full cluster stack to integrate into the cluster and have the cluster manage their resources as if they were a cluster node. == Terms == cluster node:: A node running the full high-availability stack of corosync and all Pacemaker components. Cluster nodes may run cluster resources, run all Pacemaker command-line tools (`crm_mon`, `crm_resource` and so on), execute fencing actions, count toward cluster quorum, and serve as the cluster's Designated Controller (DC). (((cluster node))) (((node,cluster node))) pacemaker_remote:: A small service daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. Nodes running pacemaker_remote may run cluster resources and most command-line tools, but cannot perform other functions of full cluster nodes such as fencing execution, quorum voting or DC eligibility. The pacemaker_remote daemon is an enhanced version of Pacemaker's local resource management daemon (LRMD). (((pacemaker_remote))) remote node:: A physical host running pacemaker_remote. Remote nodes have a special resource that manages communication with the cluster. This is sometimes referred to as the 'baremetal' case. (((remote node))) (((node,remote node))) guest node:: A virtual host running pacemaker_remote. Guest nodes differ from remote nodes mainly in that the guest node is itself a resource that the cluster manages. (((guest node))) (((node,guest node))) [NOTE] ====== 'Remote' in this document refers to the node not being a part of the underlying corosync cluster. It has nothing to do with physical proximity. Remote nodes and guest nodes are subject to the same latency requirements as cluster nodes, which means they are typically in the same data center. ====== [NOTE] ====== It is important to distinguish the various roles a virtual machine can serve in Pacemaker clusters: * A virtual machine can run the full cluster stack, in which case it is a cluster node and is not itself managed by the cluster. * A virtual machine can be managed by the cluster as a resource, without the cluster having any awareness of the services running inside the virtual machine. The virtual machine is 'opaque' to the cluster. * A virtual machine can be a cluster resource, and run pacemaker_remote to make it a guest node, allowing the cluster to manage services inside it. The virtual machine is 'transparent' to the cluster. ====== == Guest Nodes == (((guest node))) (((node,guest node))) *"I want a Pacemaker cluster to manage virtual machine resources, but I also want Pacemaker to be able to manage the resources that live within those virtual machines."* Without pacemaker_remote, the possibilities for implementing the above use case have significant limitations: * The cluster stack could be run on the physical hosts only, which loses the ability to monitor resources within the guests. * A separate cluster could be on the virtual guests, which quickly hits scalability issues. * The cluster stack could be run on the guests using the same cluster as the physical hosts, which also hits scalability issues and complicates fencing. With pacemaker_remote: * The physical hosts are cluster nodes (running the full cluster stack). * The virtual machines are guest nodes (running the pacemaker_remote service). Nearly zero configuration is required on the virtual machine. * The cluster stack on the cluster nodes launches the virtual machines and immediately connects to the pacemaker_remote service on them, allowing the virtual machines to integrate into the cluster. The key difference here between the guest nodes and the cluster nodes is that the guest nodes do not run the cluster stack. This means they will never become the DC, initiate fencing actions or participate in quorum voting. On the other hand, this also means that they are not bound to the scalability limits associated with the cluster stack (no 16-node corosync member limits to deal with). That isn't to say that guest nodes can scale indefinitely, but it is known that guest nodes scale horizontally much further than cluster nodes. Other than the quorum limitation, these guest nodes behave just like cluster nodes with respect to resource management. The cluster is fully capable of managing and monitoring resources on each guest node. You can build constraints against guest nodes, put them in standby, or do whatever else you'd expect to be able to do with cluster nodes. They even show up in `crm_mon` output as nodes. To solidify the concept, below is an example that is very similar to an actual deployment we test in our developer environment to verify guest node scalability: * 16 cluster nodes running the full corosync + pacemaker stack * 64 Pacemaker-managed virtual machine resources running pacemaker_remote configured as guest nodes * 64 Pacemaker-managed webserver and database resources configured to run on the 64 guest nodes With this deployment, you would have 64 webservers and databases running on 64 virtual machines on 16 hardware nodes, all of which are managed and monitored by the same Pacemaker deployment. It is known that pacemaker_remote can scale to these lengths and possibly much further depending on the specific scenario. == Remote Nodes == (((remote node))) (((node,remote node))) *"I want my traditional high-availability cluster to scale beyond the limits imposed by the corosync messaging layer."* Ultimately, the primary advantage of remote nodes over cluster nodes is scalability. There are likely some other use cases related to geographically distributed HA clusters that remote nodes may serve a purpose in, but those use cases are not well understood at this point. Like guest nodes, remote nodes will never become the DC, initiate fencing actions or participate in quorum voting. That is not to say, however, that fencing of a remote node works any differently than that of a cluster node. The Pacemaker scheduler understands how to fence remote nodes. As long as a fencing device exists, the cluster is capable of ensuring remote nodes are fenced in the exact same way as cluster nodes. == Expanding the Cluster Stack == With pacemaker_remote, the traditional view of the high-availability stack can be expanded to include a new layer: .Traditional HA Stack image::images/pcmk-ha-cluster-stack.png["Traditional Pacemaker+Corosync Stack",width="17cm",height="9cm",align="center"] .HA Stack With Guest Nodes image::images/pcmk-ha-remote-stack.png["Pacemaker+Corosync Stack With pacemaker_remote",width="20cm",height="10cm",align="center"] diff --git a/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt b/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt index 48a19e20ea..bd2c4889b2 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt @@ -1,578 +1,579 @@ +:compat-mode: legacy = Guest Node Walk-through = *What this tutorial is:* An in-depth walk-through of how to get Pacemaker to manage a KVM guest instance and integrate that guest into the cluster as a guest node. *What this tutorial is not:* A realistic deployment scenario. The steps shown here are meant to get users familiar with the concept of guest nodes as quickly as possible. == Configure the Physical Host == [NOTE] ====== For this example, we will use a single physical host named *example-host*. A production cluster would likely have multiple physical hosts, in which case you would run the commands here on each one, unless noted otherwise. ====== === Configure Firewall on Host === On the physical host, allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- [root@pcmk-1 ~]# setenforce 0 [root@pcmk-1 ~]# sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config [root@pcmk-1 ~]# systemctl mask firewalld.service [root@pcmk-1 ~]# systemctl stop firewalld.service [root@pcmk-1 ~]# iptables --flush ---- ====== === Install Cluster Software === ---- # yum install -y pacemaker corosync pcs resource-agents ---- === Configure Corosync === Corosync handles pacemaker's cluster membership and messaging. The corosync config file is located in +/etc/corosync/corosync.conf+. That config file must be initialized with information about the cluster nodes before pacemaker can start. To initialize the corosync config file, execute the following `pcs` command, replacing the cluster name and hostname as desired: ---- # pcs cluster setup --force --local --name mycluster example-host ---- [NOTE] ====== If you have multiple physical hosts, you would execute the setup command on only one host, but list all of them at the end of the command. ====== === Configure Pacemaker for Remote Node Communication === Create a place to hold an authentication key for use with pacemaker_remote: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker ---- Generate a key: ---- # dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- [NOTE] ====== If you have multiple physical hosts, you would generate the key on only one host, and copy it to the same location on all hosts. ====== === Verify Cluster Software === Start the cluster ---- # pcs cluster start ---- Verify corosync membership .... # pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 example-host (local) .... Verify pacemaker status. At first, the output will look like this: ---- # pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: NONE Last updated: Fri Jan 12 15:18:32 2018 Last change: Fri Jan 12 12:42:21 2018 by root via cibadmin on example-host 1 node configured 0 resources configured Node example-host: UNCLEAN (offline) No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- After a short amount of time, you should see your host as a single node in the cluster: ---- # pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition WITHOUT quorum Last updated: Fri Jan 12 15:20:05 2018 Last change: Fri Jan 12 12:42:21 2018 by root via cibadmin on example-host 1 node configured 0 resources configured Online: [ example-host ] No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- === Disable STONITH and Quorum === Now, enable the cluster to work without quorum or stonith. This is required for the sake of getting this tutorial to work with a single cluster node. ---- # pcs property set stonith-enabled=false # pcs property set no-quorum-policy=ignore ---- [WARNING] ========= The use of `stonith-enabled=false` is completely inappropriate for a production cluster. It tells the cluster to simply pretend that failed nodes are safely powered off. Some vendors will refuse to support clusters that have STONITH disabled. We disable STONITH here only to focus the discussion on pacemaker_remote, and to be able to use a single physical host in the example. ========= Now, the status output should look similar to this: ---- # pcs status Cluster name: mycluster Stack: corosync Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 15:22:49 2018 Last change: Fri Jan 12 15:22:46 2018 by root via cibadmin on example-host 1 node configured 0 resources configured Online: [ example-host ] No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Go ahead and stop the cluster for now after verifying everything is in order. ---- # pcs cluster stop --force ---- === Install Virtualization Software === ---- # yum install -y kvm libvirt qemu-system qemu-kvm bridge-utils virt-manager # systemctl enable libvirtd.service ---- Reboot the host. [NOTE] ====== While KVM is used in this example, any virtualization platform with a Pacemaker resource agent can be used to create a guest node. The resource agent needs only to support usual commands (start, stop, etc.); Pacemaker implements the *remote-node* meta-attribute, independent of the agent. ====== == Configure the KVM guest == === Create Guest === We will not outline here the installation steps required to create a KVM guest. There are plenty of tutorials available elsewhere that do that. Just be sure to configure the guest with a hostname and a static IP address (as an example here, we will use guest1 and 192.168.122.10). === Configure Firewall on Guest === On each guest, allow cluster-related services through the local firewall, following the same procedure as in <<_configure_firewall_on_host>>. === Verify Connectivity === At this point, you should be able to ping and ssh into guests from hosts, and vice versa. === Configure pacemaker_remote === Install pacemaker_remote, and enable it to run at start-up. Here, we also install the pacemaker package; it is not required, but it contains the dummy resource agent that we will use later for testing. ---- # yum install -y pacemaker pacemaker-remote resource-agents # systemctl enable pacemaker_remote.service ---- Copy the authentication key from a host: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker # scp root@example-host:/etc/pacemaker/authkey /etc/pacemaker ---- Start pacemaker_remote, and verify the start was successful: ---- # systemctl start pacemaker_remote # systemctl status pacemaker_remote pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled) Active: active (running) since Thu 2013-03-14 18:24:04 EDT; 2min 8s ago Main PID: 1233 (pacemaker_remot) CGroup: name=systemd:/system/pacemaker_remote.service └─1233 /usr/sbin/pacemaker-remoted Mar 14 18:24:04 guest1 systemd[1]: Starting Pacemaker Remote Service... Mar 14 18:24:04 guest1 systemd[1]: Started Pacemaker Remote Service. Mar 14 18:24:04 guest1 pacemaker-remoted[1233]: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. ---- === Verify Host Connection to Guest === Before moving forward, it's worth verifying that the host can contact the guest on port 3121. Here's a trick you can use. Connect using ssh from the host. The connection will get destroyed, but how it is destroyed tells you whether it worked or not. First add guest1 to the host machine's +/etc/hosts+ file if you haven't already. This is required unless you have DNS setup in a way where guest1's address can be discovered. ---- # cat << END >> /etc/hosts 192.168.122.10 guest1 END ---- If running the ssh command on one of the cluster nodes results in this output before disconnecting, the connection works: ---- # ssh -p 3121 guest1 ssh_exchange_identification: read: Connection reset by peer ---- If you see one of these, the connection is not working: ---- # ssh -p 3121 guest1 ssh: connect to host guest1 port 3121: No route to host ---- ---- # ssh -p 3121 guest1 ssh: connect to host guest1 port 3121: Connection refused ---- Once you can successfully connect to the guest from the host, shutdown the guest. Pacemaker will be managing the virtual machine from this point forward. == Integrate Guest into Cluster == Now the fun part, integrating the virtual machine you've just created into the cluster. It is incredibly simple. === Start the Cluster === On the host, start pacemaker. ---- # pcs cluster start ---- Wait for the host to become the DC. The output of `pcs status` should look as it did in <<_disable_stonith_and_quorum>>. === Integrate as Guest Node === If you didn't already do this earlier in the verify host to guest connection section, add the KVM guest's IP address to the host's +/etc/hosts+ file so we can connect by hostname. For this example: ---- # cat << END >> /etc/hosts 192.168.122.10 guest1 END ---- We will use the *VirtualDomain* resource agent for the management of the virtual machine. This agent requires the virtual machine's XML config to be dumped to a file on disk. To do this, pick out the name of the virtual machine you just created from the output of this list. .... # virsh list --all Id Name State ---------------------------------------------------- - guest1 shut off .... In my case I named it guest1. Dump the xml to a file somewhere on the host using the following command. ---- # virsh dumpxml guest1 > /etc/pacemaker/guest1.xml ---- Now just register the resource with pacemaker and you're set! ---- # pcs resource create vm-guest1 VirtualDomain hypervisor="qemu:///system" \ config="/etc/pacemaker/guest1.xml" meta remote-node=guest1 ---- [NOTE] ====== This example puts the guest XML under /etc/pacemaker because the permissions and SELinux labeling should not need any changes. If you run into trouble with this or any step, try disabling SELinux with `setenforce 0`. If it works after that, see SELinux documentation for how to troubleshoot, if you wish to reenable SELinux. ====== [NOTE] ====== Pacemaker will automatically monitor pacemaker_remote connections for failure, so it is not necessary to create a recurring monitor on the VirtualDomain resource. ====== Once the *vm-guest1* resource is started you will see *guest1* appear in the `pcs status` output as a node. The final `pcs status` output should look something like this. ---- # pcs status Cluster name: mycluster Stack: corosync Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 18:00:45 2018 Last change: Fri Jan 12 17:53:44 2018 by root via crm_resource on example-host 2 nodes configured 2 resources configured Online: [ example-host ] GuestOnline: [ guest1@example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- === Starting Resources on KVM Guest === The commands below demonstrate how resources can be executed on both the guest node and the cluster node. Create a few Dummy resources. Dummy resources are real resource agents used just for testing purposes. They actually execute on the host they are assigned to just like an apache server or database would, except their execution just means a file was created. When the resource is stopped, that the file it created is removed. ---- # pcs resource create FAKE1 ocf:pacemaker:Dummy # pcs resource create FAKE2 ocf:pacemaker:Dummy # pcs resource create FAKE3 ocf:pacemaker:Dummy # pcs resource create FAKE4 ocf:pacemaker:Dummy # pcs resource create FAKE5 ocf:pacemaker:Dummy ---- Now check your `pcs status` output. In the resource section, you should see something like the following, where some of the resources started on the cluster node, and some started on the guest node. ---- Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started example-host FAKE4 (ocf::pacemaker:Dummy): Started guest1 FAKE5 (ocf::pacemaker:Dummy): Started example-host ---- The guest node, *guest1*, reacts just like any other node in the cluster. For example, pick out a resource that is running on your cluster node. For my purposes, I am picking FAKE3 from the output above. We can force FAKE3 to run on *guest1* in the exact same way we would any other node. ---- # pcs constraint location FAKE3 prefers guest1 ---- Now, looking at the bottom of the `pcs status` output you'll see FAKE3 is on *guest1*. ---- Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started guest1 FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host ---- === Testing Recovery and Fencing === Pacemaker's scheduler is smart enough to know fencing guest nodes associated with a virtual machine means shutting off/rebooting the virtual machine. No special configuration is necessary to make this happen. If you are interested in testing this functionality out, trying stopping the guest's pacemaker_remote daemon. This would be equivalent of abruptly terminating a cluster node's corosync membership without properly shutting it down. ssh into the guest and run this command. ---- # kill -9 $(pidof pacemaker-remoted) ---- Within a few seconds, your `pcs status` output will show a monitor failure, and the *guest1* node will not be shown while it is being recovered. ---- # pcs status Cluster name: mycluster Stack: corosync Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 18:08:35 2018 Last change: Fri Jan 12 18:07:00 2018 by root via cibadmin on example-host 2 nodes configured 7 resources configured Online: [ example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Stopped FAKE2 (ocf::pacemaker:Dummy): Stopped FAKE3 (ocf::pacemaker:Dummy): Stopped FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host Failed Actions: * guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none', last-rc-change='Fri Jan 12 18:08:29 2018', queued=0ms, exec=0ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- [NOTE] ====== A guest node involves two resources: the one you explicitly configured creates the guest, and Pacemaker creates an implicit resource for the pacemaker_remote connection, which will be named the same as the value of the *remote-node* attribute of the explicit resource. When we killed pacemaker_remote, it is the implicit resource that failed, which is why the failed action starts with *guest1* and not *vm-guest1*. ====== Once recovery of the guest is complete, you'll see it automatically get re-integrated into the cluster. The final `pcs status` output should look something like this. ---- Cluster name: mycluster Stack: corosync Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Fri Jan 12 18:18:30 2018 Last change: Fri Jan 12 18:07:00 2018 by root via cibadmin on example-host 2 nodes configured 7 resources configured Online: [ example-host ] GuestOnline: [ guest1@example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started guest1 FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host Failed Actions: * guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none', last-rc-change='Fri Jan 12 18:08:29 2018', queued=0ms, exec=0ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Normally, once you've investigated and addressed a failed action, you can clear the failure. However Pacemaker does not yet support cleanup for the implicitly created connection resource while the explicit resource is active. If you want to clear the failed action from the status output, stop the guest resource before clearing it. For example: ---- # pcs resource disable vm-guest1 --wait # pcs resource cleanup guest1 # pcs resource enable vm-guest1 ---- === Accessing Cluster Tools from Guest Node === Besides allowing the cluster to manage resources on a guest node, pacemaker_remote has one other trick. The pacemaker_remote daemon allows nearly all the pacemaker tools (`crm_resource`, `crm_mon`, `crm_attribute`, `crm_master`, etc.) to work on guest nodes natively. Try it: Run `crm_mon` on the guest after pacemaker has integrated the guest node into the cluster. These tools just work. This means resource agents such as promotable resources (which need access to tools like `crm_master`) work seamlessly on the guest nodes. Higher-level command shells such as `pcs` may have partial support on guest nodes, but it is recommended to run them from a cluster node. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Options.txt b/doc/Pacemaker_Remote/en-US/Ch-Options.txt index f50cd25f7c..f238e0da1b 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Options.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Options.txt @@ -1,136 +1,137 @@ +:compat-mode: legacy = Configuration Explained = The walk-through examples use some of these options, but don't explain exactly what they mean or do. This section is meant to be the go-to resource for all the options available for configuring pacemaker_remote-based nodes. (((configuration))) == Resource Meta-Attributes for Guest Nodes == When configuring a virtual machine as a guest node, the virtual machine is created using one of the usual resource agents for that purpose (for example, ocf:heartbeat:VirtualDomain or ocf:heartbeat:Xen), with additional metadata parameters. No restrictions are enforced on what agents may be used to create a guest node, but obviously the agent must create a distinct environment capable of running the pacemaker_remote daemon and cluster resources. An additional requirement is that fencing the host running the guest node resource must be sufficient for ensuring the guest node is stopped. This means, for example, that not all hypervisors supported by VirtualDomain may be used to create guest nodes; if the guest can survive the hypervisor being fenced, it may not be used as a guest node. Below are the metadata options available to enable a resource as a guest node and define its connection parameters. .Meta-attributes for configuring VM resources as guest nodes [width="95%",cols="2m,1,<4",options="header",align="center"] |========================================================= |Option |Default |Description |remote-node |'none' |The node name of the guest node this resource defines. This both enables the resource as a guest node and defines the unique name used to identify the guest node. If no other parameters are set, this value will also be assumed as the hostname to use when connecting to pacemaker_remote on the VM. This value *must not* overlap with any resource or node IDs. |remote-port |3121 |The port on the virtual machine that the cluster will use to connect to pacemaker_remote. |remote-addr |'value of' +remote-node+ |The IP address or hostname to use when connecting to pacemaker_remote on the VM. |remote-connect-timeout |60s |How long before a pending guest connection will time out. |========================================================= == Connection Resources for Remote Nodes == A remote node is defined by a connection resource. That connection resource has instance attributes that define where the remote node is located on the network and how to communicate with it. Descriptions of these instance attributes can be retrieved using the following `pcs` command: ---- # pcs resource describe remote ocf:pacemaker:remote - remote resource agent Resource options: server: Server location to connect to. This can be an ip address or hostname. port: tcp port to connect to. reconnect_interval: Interval in seconds at which Pacemaker will attempt to reconnect to a remote node after an active connection to the remote node has been severed. When this value is nonzero, Pacemaker will retry the connection indefinitely, at the specified interval. As with any time-based actions, this is not guaranteed to be checked more frequently than the value of the cluster-recheck-interval cluster option. ---- When defining a remote node's connection resource, it is common and recommended to name the connection resource the same as the remote node's hostname. By default, if no *server* option is provided, the cluster will attempt to contact the remote node using the resource name as the hostname. Example defining a remote node with the hostname *remote1*: ---- # pcs resource create remote1 remote ---- Example defining a remote node to connect to a specific IP address and port: ---- # pcs resource create remote1 remote server=192.168.122.200 port=8938 ---- == Environment Variables for Daemon Start-up == Authentication and encryption of the connection between cluster nodes and nodes running pacemaker_remote is achieved using with https://en.wikipedia.org/wiki/TLS-PSK[TLS-PSK] encryption/authentication over TCP (port 3121 by default). This means that both the cluster node and remote node must share the same private key. By default, this key is placed at +/etc/pacemaker/authkey+ on each node. You can change the default port and/or key location for Pacemaker and pacemaker_remote via environment variables. How these variables are set varies by OS, but usually they are set in the +/etc/sysconfig/pacemaker+ or +/etc/default/pacemaker+ file. ---- #==#==# Pacemaker Remote # Use a custom directory for finding the authkey. PCMK_authkey_location=/etc/pacemaker/authkey # # Specify a custom port for Pacemaker Remote connections PCMK_remote_port=3121 ---- == Removing Remote Nodes and Guest Nodes == If the resource creating a guest node, or the *ocf:pacemaker:remote* resource creating a connection to a remote node, is removed from the configuration, the affected node will continue to show up in output as an offline node. If you want to get rid of that output, run (replacing $NODE_NAME appropriately): ---- # crm_node --force --remove $NODE_NAME ---- [WARNING] ========= Be absolutely sure that there are no references to the node's resource in the configuration before running the above command. ========= diff --git a/doc/asciidoc.reference b/doc/asciidoc.reference index e06d96c251..7a5fcb56cf 100644 --- a/doc/asciidoc.reference +++ b/doc/asciidoc.reference @@ -1,96 +1,99 @@ = Single-chapter part of the documentation = == Go-to reference chapter for how we use AsciiDoc on this project == [NOTE] ====== This is *not* an attempt for fully self-hosted AsciiDoc document, consider it a plaintext full of AsciiDoc samples (it's up to the reader to recognize the borderline) at documentation writers' disposal to somewhat standardize the style{empty}footnote:[ style of both source notation and final visual appearance ]. See also: http://powerman.name/doc/asciidoc ====== Emphasis: _some test_ Mono: +some text+ Strong: *some text* Super: ^some text^ Sub: ~some text~ Quotes: ``double quoted'' `single quoted' Command: `some-tool --with option` Newly introduced term: - 'some text' (another form of emphasis as of this edit) + 'some text' (another form of emphasis as of this edit, + but not compatible with newer revision + of the standard/Asciidoctor outside of + legacy compatibility mode) File: mono Literal: mono Tool: command Option: mono Replaceable: emphasis mono Varname: mono Term encountered on system (e.g., menu choice, hostname): strong .Title for Example ===== Some text ===== .Title for Example with XML Listing ===== [source,XML] ----- ----- ===== Naked code listing: (Use 'C' and a leading '#' instead of 'Bash' when commands are being show) [source,C] ----- # some command --here ----- Section anchors: [[s-name]] === Some Section Title === References to section anchors: <> or <> Tables: Typically styled like this: [width="95%",cols="1m,<4m,<6",options="header",align="center"] It's vital that column alignment/style, if any, goes first/last in the proper column specifier (as a whole possibly preceded with column multiplier), otherwise Asciidoctor will end up with invalid DocBook sources: - correct: 1m,<4m,<6 - bad: 1m,4