Clusters from Scratch

diff --git a/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt b/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt index 18ed7c432d..bd42ca38da 100644 --- a/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt +++ b/doc/Clusters_from_Scratch/en-US/Ap-Configuration.txt @@ -1,454 +1,450 @@ [appendix] == Configuration Recap == === Final Cluster Configuration === ---- [root@pcmk-1 ~]# pcs resource Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started ClusterIP:1 (ocf::heartbeat:IPaddr2): Started Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] ---- ---- [root@pcmk-1 ~]# pcs resource op defaults timeout: 240s ---- ---- [root@pcmk-1 ~]# pcs stonith impi-fencing (stonith:fence_ipmilan) Started ---- ---- [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite-clone (kind:Mandatory) promote WebDataClone then start WebFS-clone (kind:Mandatory) start WebFS-clone then start WebSite-clone (kind:Mandatory) start dlm-clone then start WebFS-clone (kind:Mandatory) Colocation Constraints: WebSite-clone with ClusterIP-clone (score:INFINITY) WebFS-clone with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite-clone with WebFS-clone (score:INFINITY) WebFS-clone with dlm-clone (score:INFINITY) +Ticket Constraints: ---- ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 12:05:37 2015 -Last change: Fri Aug 14 11:49:29 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -11 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 12:05:37 2018 +Last change: Fri Jan 12 11:49:29 2018 +2 nodes configured +11 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: impi-fencing (stonith:fence_ipmilan): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started pcmk-2 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started pcmk-1 Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- ---- [root@pcmk-1 ~]# pcs cluster cib ---- [source,XML] ---- - + - + - + - - + + - + - - + + - - + + - - + + - + - - + + - - + + - + - - + + - - + + ---- === Node List === ---- [root@pcmk-1 ~]# pcs status nodes Pacemaker Nodes: Online: pcmk-1 pcmk-2 Standby: Offline: ---- === Cluster Options === ---- [root@pcmk-1 ~]# pcs property Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster - dc-version: 1.1.12-a14efad + dc-version: 1.1.16-12.el7_4.5-94ff4df have-watchdog: false last-lrm-refresh: 1439569053 stonith-enabled: true ---- The output shows state information automatically obtained about the cluster, including: * *cluster-infrastructure* - the cluster communications layer in use * *cluster-name* - the cluster name chosen by the administrator when the cluster was created * *dc-version* - the version (including upstream source-code hash) of Pacemaker used on the Designated Controller The output also shows options set by the administrator that control the way the cluster operates, including: * *stonith-enabled=true* - whether the cluster is allowed to use STONITH resources === Resources === ==== Default Options ==== ---- [root@pcmk-1 ~]# pcs resource defaults resource-stickiness: 100 ---- This shows cluster option defaults that apply to every resource that does not explicitly set the option itself. Above: * *resource-stickiness* - Specify the aversion to moving healthy resources to other machines ==== Fencing ==== ---- [root@pcmk-1 ~]# pcs stonith show ipmi-fencing (stonith:fence_ipmilan) Started [root@pcmk-1 ~]# pcs stonith show ipmi-fencing Resource: ipmi-fencing (class=stonith type=fence_ipmilan) Attributes: ipaddr="10.0.0.1" login="testuser" passwd="acd123" pcmk_host_list="pcmk-1 pcmk-2" Operations: monitor interval=60s (fence-monitor-interval-60s) ---- ==== Service Address ==== Users of the services provided by the cluster require an unchanging address with which to access it. Additionally, we cloned the address so it will be active on both nodes. An iptables rule (created as part of the resource agent) is used to ensure that each request only gets processed by one of the two clone instances. The additional meta options tell the cluster that we want two instances of the clone (one "request bucket" for each node) and that if one node fails, then the remaining node should hold both. ---- [root@pcmk-1 ~]# pcs resource show ClusterIP-clone Clone: ClusterIP-clone Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=192.168.122.120 cidr_netmask=32 clusterip_hash=sourceip Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s) stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s) monitor interval=30s (ClusterIP-monitor-interval-30s) ---- ==== DRBD - Shared Storage ==== Here, we define the DRBD service and specify which DRBD resource (from /etc/drbd.d/*.res) it should manage. We make it a master/slave resource and, in order to have an active/active setup, allow both instances to be promoted to master at the same time. We also set the notify option so that the cluster will tell DRBD agent when its peer changes state. ---- [root@pcmk-1 ~]# pcs resource show WebDataClone Master: WebDataClone Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Resource: WebData (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=wwwdata Operations: start interval=0s timeout=240 (WebData-start-timeout-240) promote interval=0s timeout=90 (WebData-promote-timeout-90) demote interval=0s timeout=90 (WebData-demote-timeout-90) stop interval=0s timeout=100 (WebData-stop-timeout-100) monitor interval=60s (WebData-monitor-interval-60s) [root@pcmk-1 ~]# pcs constraint ref WebDataClone Resource: WebDataClone colocation-WebFS-WebDataClone-INFINITY order-WebDataClone-WebFS-mandatory ---- ==== Cluster Filesystem ==== The cluster filesystem ensures that files are read and written correctly. We need to specify the block device (provided by DRBD), where we want it mounted and that we are using GFS2. Again, it is a clone because it is intended to be active on both nodes. The additional constraints ensure that it can only be started on nodes with active DLM and DRBD instances. ---- [root@pcmk-1 ~]# pcs resource show WebFS-clone Clone: WebFS-clone Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=gfs2 Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) [root@pcmk-1 ~]# pcs constraint ref WebFS-clone Resource: WebFS-clone colocation-WebFS-WebDataClone-INFINITY colocation-WebSite-WebFS-INFINITY colocation-WebFS-clone-dlm-clone-INFINITY order-WebDataClone-WebFS-mandatory order-WebFS-WebSite-mandatory order-dlm-clone-WebFS-clone-mandatory ---- ==== Apache ==== Lastly, we have the actual service, Apache. We need only tell the cluster where to find its main configuration file and restrict it to running on nodes that have the required filesystem mounted and the IP address active. ---- [root@pcmk-1 ~]# pcs resource show WebSite-clone Clone: WebSite-clone Resource: WebSite (class=ocf provider=heartbeat type=apache) Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/server-status Operations: start interval=0s timeout=40s (WebSite-start-timeout-40s) stop interval=0s timeout=60s (WebSite-stop-timeout-60s) monitor interval=1min (WebSite-monitor-interval-1min) [root@pcmk-1 ~]# pcs constraint ref WebSite-clone Resource: WebSite-clone colocation-WebSite-ClusterIP-INFINITY colocation-WebSite-WebFS-INFINITY order-ClusterIP-WebSite-mandatory order-WebFS-WebSite-mandatory ---- diff --git a/doc/Clusters_from_Scratch/en-US/Book_Info.xml b/doc/Clusters_from_Scratch/en-US/Book_Info.xml index cf24b7f423..54b4b8b623 100644 --- a/doc/Clusters_from_Scratch/en-US/Book_Info.xml +++ b/doc/Clusters_from_Scratch/en-US/Book_Info.xml @@ -1,67 +1,71 @@ %BOOK_ENTITIES; ]> Clusters from Scratch Step-by-Step Instructions for Building Your First High-Availability Cluster Pacemaker - 1.1 + 2.0 9 - 0 + 1 - The purpose of this document is to provide a start-to-finish guide to building an example active/passive cluster with Pacemaker and show how it can be converted to an active/active one. + This document provides a step-by-step guide to building a simple high-availability cluster using Pacemaker. The example cluster will use: &DISTRO; &DISTRO_VERSION; as the host operating system Corosync to provide messaging and membership services, - Pacemaker to perform resource management, + Pacemaker 1.1.16 + While this guide is part of the document set for + Pacemaker 2.0, it demonstrates the version available in + the standard &DISTRO; repositories + to perform resource management, DRBD as a cost-effective alternative to shared storage, GFS2 as the cluster filesystem (in active/active mode) Given the graphical nature of the install process, a number of screenshots are included. However the guide is primarily composed of commands, the reasons for executing them and their expected outputs. diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt b/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt index 334267a44e..deecca3b43 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Active-Active.txt @@ -1,382 +1,374 @@ = Convert Cluster to Active/Active = The primary requirement for an Active/Active cluster is that the data required for your services is available, simultaneously, on both machines. Pacemaker makes no requirement on how this is achieved; you could use a SAN if you had one available, but since DRBD supports multiple Primaries, we can continue to use it here. == Install Cluster Filesystem Software == The only hitch is that we need to use a cluster-aware filesystem. The one we used earlier with DRBD, xfs, is not one of those. Both OCFS2 and GFS2 are supported; here, we will use GFS2. On both nodes, install the GFS2 command-line utilities and the Distributed Lock Manager (DLM) required by cluster filesystems: ---- # yum install -y gfs2-utils dlm ---- == Configure the Cluster for the DLM == The DLM needs to run on both nodes, so we'll start by creating a resource for it (using the *ocf:pacemaker:controld* resource script), and clone it: ---- [root@pcmk-1 ~]# pcs cluster cib dlm_cfg [root@pcmk-1 ~]# pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op monitor interval=60s [root@pcmk-1 ~]# pcs -f dlm_cfg resource clone dlm clone-max=2 clone-node-max=1 [root@pcmk-1 ~]# pcs -f dlm_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started Clone Set: dlm-clone [dlm] Stopped: [ pcmk-1 pcmk-2 ] ---- Activate our new configuration, and see how the cluster responds: ---- [root@pcmk-1 ~]# pcs cluster cib-push dlm_cfg CIB updated [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 11:19:36 2015 -Last change: Fri Aug 14 11:19:28 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -8 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 11:19:36 2018 +Last change: Fri Jan 12 11:19:28 2018 +2 nodes configured +8 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 ipmi-fencing (stonith:fence_ipmilan): Started pcmk-1 Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- [[GFS2_prep]] == Create and Populate GFS2 Filesystem == Before we do anything to the existing partition, we need to make sure it is unmounted. We do this by telling the cluster to stop the WebFS resource. This will ensure that other resources (in our case, Apache) using WebFS are not only stopped, but stopped in the correct order. ---- [root@pcmk-1 ~]# pcs resource disable WebFS [root@pcmk-1 ~]# pcs resource ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Stopped Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Stopped Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] ---- You can see that both Apache and WebFS have been stopped, and that *pcmk-2* is the current master for the DRBD device. Now we can create a new GFS2 filesystem on the DRBD device. [WARNING] ========= This will erase all previous content stored on the DRBD device. Ensure you have a copy of any important data. ========= [IMPORTANT] =========== Run the next command on whichever node has the DRBD Primary role. Otherwise, you will receive the message: ----- /dev/drbd1: Read-only file system ----- =========== ----- [root@pcmk-2 ~]# mkfs.gfs2 -p lock_dlm -j 2 -t mycluster:web /dev/drbd1 It appears to contain an existing filesystem (xfs) This will destroy any data on /dev/drbd1 Are you sure you want to proceed? [y/n]y Device: /dev/drbd1 Block size: 4096 Device size: 1.00 GB (262127 blocks) Filesystem size: 1.00 GB (262126 blocks) Journals: 2 Resource groups: 5 Locking protocol: "lock_dlm" Lock table: "mycluster:web" UUID: 9a72c488-d8a7-24c9-ceee-add7a8ca52c2 ----- The `mkfs.gfs2` command required a number of additional parameters: * `-p lock_dlm` specifies that we want to use the kernel's DLM. * `-j 2` indicates that the filesystem should reserve enough space for two journals (one for each node that will access the filesystem). * `-t mycluster:web` specifies the lock table name. The format for this field is +pass:[clustername:fsname]+. For +pass:[clustername]+, we need to use the same value we specified originally with `pcs cluster setup --name` (which is also the value of *cluster_name* in +/etc/corosync/corosync.conf+). If you are unsure what your cluster name is, you can look in +/etc/corosync/corosync.conf+ or execute the command `pcs cluster corosync pcmk-1 | grep cluster_name`. Now we can (re-)populate the new filesystem with data (web pages). We'll create yet another variation on our home page. ----- [root@pcmk-2 ~]# mount /dev/drbd1 /mnt [root@pcmk-2 ~]# cat <<-END >/mnt/index.html My Test Site - GFS2 END [root@pcmk-2 ~]# chcon -R --reference=/var/www/html /mnt [root@pcmk-2 ~]# umount /dev/drbd1 [root@pcmk-2 ~]# drbdadm verify wwwdata ----- == Reconfigure the Cluster for GFS2 == With the WebFS resource stopped, let's update the configuration. ---- [root@pcmk-1 ~]# pcs resource show WebFS Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=xfs Meta Attrs: target-role=Stopped Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) ---- The fstype option needs to be updated to *gfs2* instead of *xfs*. ---- [root@pcmk-1 ~]# pcs resource update WebFS fstype=gfs2 [root@pcmk-1 ~]# pcs resource show WebFS Resource: WebFS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/var/www/html fstype=gfs2 Meta Attrs: target-role=Stopped Operations: start interval=0s timeout=60 (WebFS-start-timeout-60) stop interval=0s timeout=60 (WebFS-stop-timeout-60) monitor interval=20 timeout=40 (WebFS-monitor-interval-20) ---- GFS2 requires that DLM be running, so we also need to set up new colocation and ordering constraints for it: ---- [root@pcmk-1 ~]# pcs constraint colocation add WebFS with dlm-clone INFINITY [root@pcmk-1 ~]# pcs constraint order dlm-clone then WebFS Adding dlm-clone WebFS (kind: Mandatory) (Options: first-action=start then-action=start) ---- == Clone the IP address == There's no point making the services active on both locations if we can't reach them both, so let's clone the IP address. The *IPaddr2* resource agent has built-in intelligence for when it is configured as a clone. It will utilize a multicast MAC address to have the local switch send the relevant packets to all nodes in the cluster, together with *iptables clusterip* rules on the nodes so that any given packet will be grabbed by exactly one node. This will give us a simple but effective form of load-balancing requests between our two nodes. Let's start a new config, and clone our IP: ---- [root@pcmk-1 ~]# pcs cluster cib loadbalance_cfg [root@pcmk-1 ~]# pcs -f loadbalance_cfg resource clone ClusterIP \ clone-max=2 clone-node-max=2 globally-unique=true ---- * `clone-max=2` tells the resource agent to split packets this many ways. This should equal the number of nodes that can host the IP. * `clone-node-max=2` says that one node can run up to 2 instances of the clone. This should also equal the number of nodes that can host the IP, so that if any node goes down, another node can take over the failed node's "request bucket". Otherwise, requests intended for the failed node would be discarded. * `globally-unique=true` tells the cluster that one clone isn't identical to another (each handles a different "bucket"). This also tells the resource agent to insert *iptables* rules so each host only processes packets in its bucket(s). Notice that when the ClusterIP becomes a clone, the constraints referencing ClusterIP now reference the clone. This is done automatically by pcs. ---- [root@pcmk-1 ~]# pcs -f loadbalance_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite (kind:Mandatory) promote WebDataClone then start WebFS (kind:Mandatory) start WebFS then start WebSite (kind:Mandatory) start dlm-clone then start WebFS (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP-clone (score:INFINITY) WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite with WebFS (score:INFINITY) WebFS with dlm-clone (score:INFINITY) +Ticket Constraints: ---- Now we must tell the resource how to decide which requests are processed by which hosts. To do this, we specify the *clusterip_hash* parameter. The value of *sourceip* means that the source IP address of incoming packets will be hashed; each node will process a certain range of hashes. ---- [root@pcmk-1 ~]# pcs -f loadbalance_cfg resource update ClusterIP clusterip_hash=sourceip ---- Load our configuration to the cluster, and see how it responds. ----- [root@pcmk-1 ~]# pcs cluster cib-push loadbalance_cfg CIB updated [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 11:32:07 2015 -Last change: Fri Aug 14 11:32:04 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -9 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 11:32:07 2018 +Last change: Fri Jan 12 11:32:04 2018 +2 nodes configured +9 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: WebSite (ocf::heartbeat:apache): Stopped Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Stopped ipmi-fencing (stonith:fence_ipmilan): Started pcmk-1 Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started pcmk-1 ClusterIP:1 (ocf::heartbeat:IPaddr2): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- If desired, you can demonstrate that all request buckets are working by using a tool such as `arping` from several source hosts to see which host responds to each. == Clone the Filesystem and Apache Resources == Now that we have a cluster filesystem ready to go, and our nodes can load-balance requests to a shared IP address, we can configure the cluster so both nodes mount the filesystem and respond to web requests. Clone the filesystem and Apache resources in a new configuration. Notice how pcs automatically updates the relevant constraints again. ---- [root@pcmk-1 ~]# pcs cluster cib active_cfg [root@pcmk-1 ~]# pcs -f active_cfg resource clone WebFS [root@pcmk-1 ~]# pcs -f active_cfg resource clone WebSite [root@pcmk-1 ~]# pcs -f active_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP-clone then start WebSite-clone (kind:Mandatory) promote WebDataClone then start WebFS-clone (kind:Mandatory) start WebFS-clone then start WebSite-clone (kind:Mandatory) start dlm-clone then start WebFS-clone (kind:Mandatory) Colocation Constraints: WebSite-clone with ClusterIP-clone (score:INFINITY) WebFS-clone with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite-clone with WebFS-clone (score:INFINITY) WebFS-clone with dlm-clone (score:INFINITY) +Ticket Constraints: ---- Tell the cluster that it is now allowed to promote both instances to be DRBD Primary (aka. master). ----- [root@pcmk-1 ~]# pcs -f active_cfg resource update WebDataClone master-max=2 ----- Finally, load our configuration to the cluster, and re-enable the WebFS resource (which we disabled earlier). ----- [root@pcmk-1 ~]# pcs cluster cib-push active_cfg CIB updated [root@pcmk-1 ~]# pcs resource enable WebFS ----- After all the processes are started, the status should look similar to this. ----- [root@pcmk-1 ~]# pcs resource Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 pcmk-2 ] Clone Set: dlm-clone [dlm] Started: [ pcmk-1 pcmk-2 ] Clone Set: ClusterIP-clone [ClusterIP] (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started ClusterIP:1 (ocf::heartbeat:IPaddr2): Started Clone Set: WebFS-clone [WebFS] Started: [ pcmk-1 pcmk-2 ] Clone Set: WebSite-clone [WebSite] Started: [ pcmk-1 pcmk-2 ] ----- == Test Failover == Testing failover is left as an exercise for the reader. For example, you can put one node into standby mode, use `pcs status` to confirm that its ClusterIP clone was moved to the other node, and use `arping` to verify that packets are not being lost from any source host. [NOTE] ==== You may find that when a failed node rejoins the cluster, both ClusterIP clones stay on one node, due to the resource stickiness. While this works fine, it effectively eliminates load-balancing and returns the cluster to an active-passive setup again. You can avoid this by disabling stickiness for the IP address resource: ---- [root@pcmk-1 ~]# pcs resource meta ClusterIP resource-stickiness=0 ---- ==== diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt b/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt index 8ceccd1d9f..bb3586ab7d 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Active-Passive.txt @@ -1,423 +1,391 @@ = Create an Active/Passive Cluster = == Explore the Existing Configuration == When Pacemaker starts up, it automatically records the number and details of the nodes in the cluster, as well as which stack is being used and the version of Pacemaker being used. The first few lines of output should look like this: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false -Last updated: Tue Dec 16 16:15:29 2014 -Last change: Tue Dec 16 15:49:47 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -0 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 16:15:29 2018 +Last change: Fri Jan 12 15:49:47 2018 +2 nodes configured +0 resources configured Online: [ pcmk-1 pcmk-2 ] ---- For those who are not of afraid of XML, you can see the raw cluster configuration and status by using the `pcs cluster cib` command. .The last XML you'll see in this document ====== ---- [root@pcmk-1 ~]# pcs cluster cib ---- [source,XML] ---- - + - + ---- ====== Before we make any changes, it's a good idea to check the validity of the configuration. ---- [root@pcmk-1 ~]# crm_verify -L -V error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid ---- As you can see, the tool has found some errors. In order to guarantee the safety of your data, footnote:[If the data is corrupt, there is little point in continuing to make it available] the default for STONITH footnote:[A common node fencing mechanism. Used to ensure data integrity by powering off "bad" nodes] in Pacemaker is *enabled*. However, it also knows when no STONITH configuration has been supplied and reports this as a problem (since the cluster would not be able to make progress if a situation requiring node fencing arose). We will disable this feature for now and configure it later. To disable STONITH, set the *stonith-enabled* cluster option to false: ---- [root@pcmk-1 ~]# pcs property set stonith-enabled=false [root@pcmk-1 ~]# crm_verify -L ---- With the new cluster option set, the configuration is now valid. [WARNING] ========= The use of `stonith-enabled=false` is completely inappropriate for a production cluster. It tells the cluster to simply pretend that failed nodes are safely powered off. Some vendors will refuse to support clusters that have STONITH disabled. We disable STONITH here only to defer the discussion of its configuration, which can differ widely from one installation to the next. See <<_what_is_stonith>> for information on why STONITH is important and details on how to configure it. ========= == Add a Resource == Our first resource will be a unique IP address that the cluster can bring up on either node. Regardless of where any cluster service(s) are running, end users need a consistent address to contact them on. Here, I will choose 192.168.122.120 as the floating address, give it the imaginative name ClusterIP and tell the cluster to check whether it is running every 30 seconds. [WARNING] =========== The chosen address must not already be in use on the network. Do not reuse an IP address one of the nodes already has configured. =========== ---- [root@pcmk-1 ~]# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \ ip=192.168.122.120 cidr_netmask=32 op monitor interval=30s ---- Another important piece of information here is *ocf:heartbeat:IPaddr2*. This tells Pacemaker three things about the resource you want to add: * The first field (*ocf* in this case) is the standard to which the resource script conforms and where to find it. * The second field (*heartbeat* in this case) is standard-specific; for OCF resources, it tells the cluster which OCF namespace the resource script is in. * The third field (*IPaddr2* in this case) is the name of the resource script. To obtain a list of the available resource standards (the *ocf* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource standards -ocf lsb +ocf service systemd -stonith ---- To obtain a list of the available OCF resource providers (the *heartbeat* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource providers heartbeat openstack pacemaker ---- Finally, if you want to see all the resource agents available for a specific OCF provider (the *IPaddr2* part of *ocf:heartbeat:IPaddr2*), run: ---- [root@pcmk-1 ~]# pcs resource agents ocf:heartbeat +apache +clvm +conntrackd CTDB +db2 Delay -Dummy -Filesystem -IPaddr -IPaddr2 . . (skipping lots of resources to save space) . -rsyncd -slapd symlink tomcat +VirtualDomain +Xinetd ---- Now, verify that the IP resource has been added, and display the cluster's status to see that it is now active: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Tue Dec 16 17:44:40 2014 -Last change: Tue Dec 16 17:44:26 2014 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -1 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 17:44:40 2018 +Last change: Fri Jan 12 17:44:26 2018 +2 nodes configured +1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Perform a Failover == Since our ultimate goal is high availability, we should test failover of our new resource before moving on. First, find the node on which the IP address is running. ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Tue Dec 16 17:44:40 2014 -Last change: Tue Dec 16 17:44:26 2014 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -1 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 17:44:40 2018 +Last change: Fri Jan 12 17:44:26 2018 +2 nodes configured +1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 ---- You can see that the status of the *ClusterIP* resource is *Started* on a particular node (in this example, *pcmk-1*). Shut down Pacemaker and Corosync on that machine to trigger a failover. ---- [root@pcmk-1 ~]# pcs cluster stop pcmk-1 -Stopping Cluster... +Stopping Cluster (pacemaker)... +Stopping Cluster (corosync)... ---- [NOTE] ====== A cluster command such as +pcs cluster stop pass:[nodename]+ can be run from any node in the cluster, not just the affected node. ====== Verify that pacemaker and corosync are no longer running: ---- [root@pcmk-1 ~]# pcs status Error: cluster is not currently running on this node ---- Go to the other node, and check the cluster status. ---- [root@pcmk-2 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 10:30:56 2014 -Last change: Tue Dec 16 17:44:26 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -1 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 18:30:56 2018 +Last change: Fri Jan 12 17:44:26 2018 +2 nodes configured +1 resources configured Online: [ pcmk-2 ] OFFLINE: [ pcmk-1 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- -Notice that *pcmk-1* is *OFFLINE* for cluster purposes (its *PCSD* is still -*Online*, allowing it to receive `pcs` commands, but it is not participating in +Notice that *pcmk-1* is *OFFLINE* for cluster purposes (its *pcsd* is still +active, allowing it to receive `pcs` commands, but it is not participating in the cluster). Also notice that *ClusterIP* is now running on *pcmk-2* -- failover happened automatically, and no errors are reported. [IMPORTANT] .Quorum ==== If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka. _partitions_), _quorum_ is used to prevent resources from starting on more nodes than desired, which would risk data corruption. A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true: .... total_nodes < 2 * active_nodes .... For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker's default behavior in such cases is to stop all resources, in order to prevent data corruption. Two-node clusters are a special case. By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless, footnote:[Some would argue that two-node clusters are always pointless, but that is an argument for another time] but corosync has the ability to treat two-node clusters as if only one node is required for quorum. The `pcs cluster setup` command will automatically configure *two_node: 1* in +corosync.conf+, so a two-node cluster will "just work". If you are using a different cluster shell, you will have to configure -+corosync.conf+ appropriately yourself. If you are using older versions of -corosync, you will have to ignore quorum at the pacemaker level, using `pcs -property set no-quorum-policy=ignore` (or the equivalent command if you are -using a different cluster shell). ++corosync.conf+ appropriately yourself. ==== Now, simulate node recovery by restarting the cluster stack on *pcmk-1*, and check the cluster's status. (It may take a little while before the cluster gets going on the node, but it eventually will look like the below.) ---- [root@pcmk-1 ~]# pcs cluster start pcmk-1 pcmk-1: Starting Cluster... [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 10:50:11 2014 -Last change: Tue Dec 16 17:44:26 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -1 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 18:50:11 2018 +Last change: Fri Jan 12 17:44:26 2018 +2 nodes configured +1 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- -[NOTE] -====== -With older versions of pacemaker, the cluster might move the IP back to its -original location (*pcmk-1*). Usually, this is no longer the case. -====== - == Prevent Resources from Moving after Recovery == In most circumstances, it is highly desirable to prevent healthy resources from being moved around the cluster. Moving resources almost always requires a period of downtime. For complex services such as databases, this period can be quite long. To address this, Pacemaker has the concept of resource _stickiness_, which controls how strongly a service prefers to stay running where it is. You may like to think of it as the "cost" of any downtime. By default, Pacemaker assumes there is zero cost associated with moving resources and will do so to achieve "optimal" footnote:[Pacemaker's definition of optimal may not always agree with that of a human's. The order in which Pacemaker processes lists of resources and nodes creates implicit preferences in situations where the administrator has not explicitly specified them.] resource placement. We can specify a different stickiness for every resource, but it is often sufficient to change the default. ---- [root@pcmk-1 ~]# pcs resource defaults resource-stickiness=100 [root@pcmk-1 ~]# pcs resource defaults resource-stickiness: 100 ---- - -[NOTE] -====== -Older versions of `pcs` required that `rsc` be added after `resource` in the -above commands. -====== diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt b/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt index cee112a9a7..f460015de3 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Apache.txt @@ -1,434 +1,415 @@ = Add Apache HTTP Server as a Cluster Service = indexterm:[Apache HTTP Server] Now that we have a basic but functional active/passive two-node cluster, we're ready to add some real services. We're going to start with Apache HTTP Server because it is a feature of many clusters and relatively simple to configure. == Install Apache == Before continuing, we need to make sure Apache is installed on both hosts. We also need the wget tool in order for the cluster to be able to check the status of the Apache server. ---- # yum install -y httpd wget # firewall-cmd --permanent --add-service=http # firewall-cmd --reload ---- [IMPORTANT] ==== Do *not* enable the httpd service. Services that are intended to be managed via the cluster software should never be managed by the OS. It is often useful, however, to manually start the service, verify that it works, then stop it again, before adding it to the cluster. This allows you to resolve any non-cluster-related problems before continuing. Since this is a simple example, we'll skip that step here. ==== == Create Website Documents == We need to create a page for Apache to serve. On &DISTRO; &DISTRO_VERSION;, the default Apache document root is /var/www/html, so we'll create an index file there. For the moment, we will simplify things by serving a static site and manually synchronizing the data between the two nodes, so run this command on both nodes: ----- # cat <<-END >/var/www/html/index.html My Test Site - $(hostname) END ----- == Enable the Apache status URL == indexterm:[Apache HTTP Server,/server-status] In order to monitor the health of your Apache instance, and recover it if it fails, the resource agent used by Pacemaker assumes the server-status URL is available. On both nodes, enable the URL with: ---- # cat <<-END >/etc/httpd/conf.d/status.conf SetHandler server-status Require local END ---- [NOTE] ====== If you are using a different operating system, server-status may already be enabled or may be configurable in a different location. If you are using a version of Apache HTTP Server less than 2.4, the syntax will be different. ====== == Configure the Cluster == indexterm:[Apache HTTP Server,Apache resource configuration] At this point, Apache is ready to go, and all that needs to be done is to add it to the cluster. Let's call the resource WebSite. We need to use an OCF resource script called apache in the heartbeat namespace. footnote:[Compare the key used here, *ocf:heartbeat:apache*, with the one we used earlier for the IP address, *ocf:heartbeat:IPaddr2*] The script's only required parameter is the path to the main Apache configuration file, and we'll tell the cluster to check once a minute that Apache is still running. ---- [root@pcmk-1 ~]# pcs resource create WebSite ocf:heartbeat:apache \ configfile=/etc/httpd/conf/httpd.conf \ statusurl="http://localhost/server-status" \ op monitor interval=1min ---- By default, the operation timeout for all resources' start, stop, and monitor operations is 20 seconds. In many cases, this timeout period is less than a particular resource's advised timeout period. For the purposes of this tutorial, we will adjust the global operation timeout default to 240 seconds. ---- [root@pcmk-1 ~]# pcs resource op defaults timeout=240s [root@pcmk-1 ~]# pcs resource op defaults timeout: 240s ---- [NOTE] ====== In a production cluster, it is usually better to adjust each resource's start, stop, and monitor timeouts to values that are appropriate to the behavior observed in your environment, rather than adjust the global default. ====== After a short delay, we should see the cluster start Apache. ----- [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 12:40:41 2014 -Last change: Wed Dec 17 12:40:05 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -2 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 12:40:41 2018 +Last change: Fri Jan 12 12:40:05 2018 +2 nodes configured +2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-1 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Wait a moment, the WebSite resource isn't running on the same host as our IP address! [NOTE] ====== If, in the `pcs status` output, you see the WebSite resource has failed to start, then you've likely not enabled the status URL correctly. You can check whether this is the problem by running: .... wget -O - http://localhost/server-status .... If you see *Not Found* or *Forbidden* in the output, then this is likely the problem. Ensure that the ** block is correct. ====== == Ensure Resources Run on the Same Host == To reduce the load on any one machine, Pacemaker will generally try to spread the configured resources across the cluster nodes. However, we can tell the cluster that two resources are related and need to run on the same host (or not at all). Here, we instruct the cluster that WebSite can only run on the host that ClusterIP is active on. To achieve this, we use a _colocation constraint_ that indicates it is mandatory for WebSite to run on the same node as ClusterIP. The "mandatory" part of the colocation constraint is indicated by using a score of INFINITY. The INFINITY score also means that if ClusterIP is not active anywhere, WebSite will not be permitted to run. [NOTE] ======= If ClusterIP is not active anywhere, WebSite will not be permitted to run anywhere. ======= [IMPORTANT] =========== Colocation constraints are "directional", in that they imply certain things about the order in which the two resources will have a location chosen. In this case, we're saying that *WebSite* needs to be placed on the same machine as *ClusterIP*, which implies that the cluster must know the location of *ClusterIP* before choosing a location for *WebSite*. =========== ----- [root@pcmk-1 ~]# pcs constraint colocation add WebSite with ClusterIP INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: Colocation Constraints: WebSite with ClusterIP (score:INFINITY) +Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 13:57:58 2014 -Last change: Wed Dec 17 13:57:22 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -2 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 13:57:58 2018 +Last change: Fri Jan 12 13:57:22 2018 +2 nodes configured +2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- == Ensure Resources Start and Stop in Order == Like many services, Apache can be configured to bind to specific IP addresses on a host or to the wildcard IP address. If Apache binds to the wildcard, it doesn't matter whether an IP address is added before or after Apache starts; Apache will respond on that IP just the same. However, if Apache binds only to certain IP address(es), the order matters: If the address is added after Apache starts, Apache won't respond on that address. To be sure our WebSite responds regardless of Apache's address configuration, we need to make sure ClusterIP not only runs on the same node, but starts before WebSite. A colocation constraint only ensures the resources run together, not the order in which they are started and stopped. We do this by adding an ordering constraint. By default, all order constraints are mandatory, which means that the recovery of ClusterIP will also trigger the recovery of WebSite. ----- [root@pcmk-1 ~]# pcs constraint order ClusterIP then WebSite Adding ClusterIP WebSite (kind: Mandatory) (Options: first-action=start then-action=start) [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) +Ticket Constraints: ----- == Prefer One Node Over Another == Pacemaker does not rely on any sort of hardware symmetry between nodes, so it may well be that one machine is more powerful than the other. In such cases, it makes sense to host the resources on the more powerful node if it is available. To do this, we create a location constraint. In the location constraint below, we are saying the WebSite resource prefers the node pcmk-1 with a score of 50. Here, the score indicates how badly we'd like the resource to run at this location. ----- [root@pcmk-1 ~]# pcs constraint location WebSite prefers pcmk-1=50 [root@pcmk-1 ~]# pcs constraint Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:50) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) +Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 14:11:49 2014 -Last change: Wed Dec 17 14:11:20 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -2 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 14:11:49 2018 +Last change: Fri Jan 12 14:11:20 2018 +2 nodes configured +2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Wait a minute, the resources are still on pcmk-2! Even though WebSite now prefers to run on pcmk-1, that preference is (intentionally) less than the resource stickiness (how much we preferred not to have unnecessary downtime). To see the current placement scores, you can use a tool called crm_simulate. ---- [root@pcmk-1 ~]# crm_simulate -sL Current cluster status: Online: [ pcmk-1 pcmk-2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Allocation scores: native_color: ClusterIP allocation score on pcmk-1: 50 native_color: ClusterIP allocation score on pcmk-2: 200 native_color: WebSite allocation score on pcmk-1: -INFINITY native_color: WebSite allocation score on pcmk-2: 100 Transition Summary: ---- == Move Resources Manually == There are always times when an administrator needs to override the cluster and force resources to move to a specific location. In this example, we will force the WebSite to move to pcmk-1 by updating our previous location constraint with a score of INFINITY. ----- [root@pcmk-1 ~]# pcs constraint location WebSite prefers pcmk-1=INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:INFINITY) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) +Ticket Constraints: [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Wed Dec 17 14:19:34 2014 -Last change: Wed Dec 17 14:18:37 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -2 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 14:19:34 2018 +Last change: Fri Jan 12 14:18:37 2018 +2 nodes configured +2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- Once we've finished whatever activity required us to move the resources to pcmk-1 (in our case nothing), we can then allow the cluster to resume normal operation by removing the new constraint. Since we previously configured a default stickiness, the resources will remain on pcmk-1. First, use the `--full` option to get the constraint's ID: ----- [root@pcmk-1 ~]# pcs constraint --full Location Constraints: Resource: WebSite Enabled on: pcmk-1 (score:INFINITY) (id:location-WebSite-pcmk-1-INFINITY) Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) (id:order-ClusterIP-WebSite-mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) (id:colocation-WebSite-ClusterIP-INFINITY) +Ticket Constraints: ----- Then remove the desired contraint using its ID: ----- [root@pcmk-1 ~]# pcs constraint remove location-WebSite-pcmk-1-INFINITY [root@pcmk-1 ~]# pcs constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) +Ticket Constraints: ----- Note that the location constraint is now gone. If we check the cluster status, we can also see that (as expected) the resources are still active on pcmk-1. ----- # pcs status Cluster name: mycluster -Last updated: Wed Dec 17 14:25:21 2014 -Last change: Wed Dec 17 14:24:29 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -2 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 14:25:21 2018 +Last change: Fri Jan 12 14:24:29 2018 +2 nodes configured +2 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ----- diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt b/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt index fd82fd1194..f537b6b0e5 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Installation.txt @@ -1,489 +1,489 @@ = Installation = == Install &DISTRO; &DISTRO_VERSION; == === Boot the Install Image === Download the 4GB -http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1503-01.iso[&DISTRO; +http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1708.iso[&DISTRO; &DISTRO_VERSION; DVD ISO]. Use the image to boot a virtual machine, or burn it to a DVD or USB drive and boot a physical server from that. After starting the installation, select your language and keyboard layout at the welcome screen. .&DISTRO; &DISTRO_VERSION; Installation Welcome Screen image::images/Welcome.png["Welcome to &DISTRO; &DISTRO_VERSION;",align="center",scaledwidth="100%"] === Installation Options === At this point, you get a chance to tweak the default installation options. .&DISTRO; &DISTRO_VERSION; Installation Summary Screen image::images/Installer.png["&DISTRO; &DISTRO_VERSION; Installation Summary",align="center",scaledwidth="100%"] Ignore the *SOFTWARE SELECTION* section (try saying that 10 times quickly). The *Infrastructure Server* environment does have add-ons with much of the software we need, but we will leave it as a *Minimal Install* here, so that we can see exactly what software is required later. === Configure Network === In the *NETWORK & HOSTNAME* section: - Edit *Host Name:* as desired. For this example, we will use *pcmk-1.localdomain*. - Select your network device, press *Configure...*, and manually assign a fixed IP address. For this example, we'll use 192.168.122.101 under *IPv4 Settings* (with an appropriate netmask, gateway and DNS server). - Flip the switch to turn your network device on. [IMPORTANT] =========== Do not accept the default network settings. Cluster machines should never obtain an IP address via DHCP, because DHCP's periodic address renewal will interfere with corosync. =========== === Configure Disk === By default, the installer's automatic partitioning will use LVM (which allows us to dynamically change the amount of space allocated to a given partition). However, it allocates all free space to the +/+ (aka. *root*) partition, which cannot be reduced in size later (dynamic increases are fine). In order to follow the DRBD and GFS2 portions of this guide, we need to reserve space on each machine for a replicated volume. Enter the *INSTALLATION DESTINATION* section, ensure the hard drive you want to install to is selected, select *I will configure partitioning*, and press *Done*. In the *MANUAL PARTITIONING* screen that comes next, click the option to create mountpoints automatically. Select the +/+ mountpoint, and reduce the desired capacity by 1GiB or so. Select *Modify...* by the volume group name, and change the *Size policy:* to *As large as possible*, to make the reclaimed space available inside the LVM volume group. We'll add the additional volume later. === Configure Time Synchronization === It is highly recommended to enable NTP on your cluster nodes. Doing so ensures all nodes agree on the current time and makes reading log files significantly easier. &DISTRO; will enable NTP automatically. If you want to change any time-related settings (such as time zone or NTP server), you can do this in the *TIME & DATE* section. === Finish Install === Select *Begin Installation*. Once it completes, set a root password, and reboot as instructed. For the purposes of this document, it is not necessary to create any additional users. After the node reboots, you'll see a login prompt on the console. Login using *root* and the password you created earlier. .&DISTRO; &DISTRO_VERSION; Console Prompt image::images/Console.png["&DISTRO; &DISTRO_VERSION; Console",align="center",scaledwidth="100%"] [NOTE] ====== From here on, we're going to be working exclusively from the terminal. ====== == Configure the OS == === Verify Networking === Ensure that the machine has the static IP address you configured earlier. ----- [root@pcmk-1 ~]# ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:d7:d6:08 brd ff:ff:ff:ff:ff:ff inet 192.168.122.101/24 brd 192.168.122.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fed7:d608/64 scope link valid_lft forever preferred_lft forever ----- [NOTE] ===== If you ever need to change the node's IP address from the command line, follow these instructions, replacing *${device}* with the name of your network device: .... [root@pcmk-1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-${device} # manually edit as desired [root@pcmk-1 ~]# nmcli dev disconnect ${device} [root@pcmk-1 ~]# nmcli con reload ${device} [root@pcmk-1 ~]# nmcli con up ${device} .... This makes *NetworkManager* aware that a change was made on the config file. ===== Next, ensure that the routes are as expected: ----- [root@pcmk-1 ~]# ip route default via 192.168.122.1 dev eth0 proto static metric 100 192.168.122.0/24 dev eth0 proto kernel scope link src 192.168.122.101 metric 100 ----- If there is no line beginning with *default via*, then you may need to add a line such as [source,Bash] GATEWAY="192.168.122.1" to the device configuration using the same process as described above for changing the IP address. Now, check for connectivity to the outside world. Start small by testing whether we can reach the gateway we configured. ----- [root@pcmk-1 ~]# ping -c 1 192.168.122.1 PING 192.168.122.1 (192.168.122.1) 56(84) bytes of data. 64 bytes from 192.168.122.1: icmp_req=1 ttl=64 time=0.249 ms --- 192.168.122.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.249/0.249/0.249/0.000 ms ----- Now try something external; choose a location you know should be available. ----- [root@pcmk-1 ~]# ping -c 1 www.google.com PING www.l.google.com (173.194.72.106) 56(84) bytes of data. 64 bytes from tf-in-f106.1e100.net (173.194.72.106): icmp_req=1 ttl=41 time=167 ms --- www.l.google.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 167.618/167.618/167.618/0.000 ms ----- === Login Remotely === The console isn't a very friendly place to work from, so we will now switch to accessing the machine remotely via SSH where we can use copy and paste, etc. From another host, check whether we can see the new host at all: ----- beekhof@f16 ~ # ping -c 1 192.168.122.101 PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. 64 bytes from 192.168.122.101: icmp_req=1 ttl=64 time=1.01 ms --- 192.168.122.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 1.012/1.012/1.012/0.000 ms ----- Next, login as root via SSH. ----- beekhof@f16 ~ # ssh -l root 192.168.122.101 The authenticity of host '192.168.122.101 (192.168.122.101)' can't be established. ECDSA key fingerprint is 6e:b7:8f:e2:4c:94:43:54:a8:53:cc:20:0f:29:a4:e0. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.122.101' (ECDSA) to the list of known hosts. root@192.168.122.101's password: Last login: Tue Aug 11 13:14:39 2015 [root@pcmk-1 ~]# ----- === Apply Updates === Apply any package updates released since your installation image was created: ---- [root@pcmk-1 ~]# yum update ---- === Use Short Node Names === During installation, we filled in the machine's fully qualified domain name (FQDN), which can be rather long when it appears in cluster logs and status output. See for yourself how the machine identifies itself: (((Nodes, short name))) ---- [root@pcmk-1 ~]# uname -n pcmk-1.localdomain ---- (((Nodes, Domain name (Query)))) We can use the `hostnamectl` tool to strip off the domain name: ---- [root@pcmk-1 ~]# hostnamectl set-hostname $(uname -n | sed s/\\..*//) ---- (((Nodes, Domain name (Remove from host name)))) Now, check that the machine is using the correct name: ---- [root@pcmk-1 ~]# uname -n pcmk-1 ---- == Repeat for Second Node == Repeat the Installation steps so far, so that you have two nodes ready to have the cluster software installed. For the purposes of this document, the additional node is called pcmk-2 with address 192.168.122.102. == Configure Communication Between Nodes == === Configure Host Name Resolution === Confirm that you can communicate between the two new nodes: ---- [root@pcmk-1 ~]# ping -c 3 192.168.122.102 PING 192.168.122.102 (192.168.122.102) 56(84) bytes of data. 64 bytes from 192.168.122.102: icmp_seq=1 ttl=64 time=0.343 ms 64 bytes from 192.168.122.102: icmp_seq=2 ttl=64 time=0.402 ms 64 bytes from 192.168.122.102: icmp_seq=3 ttl=64 time=0.558 ms --- 192.168.122.102 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.343/0.434/0.558/0.092 ms ---- Now we need to make sure we can communicate with the machines by their name. If you have a DNS server, add additional entries for the two machines. Otherwise, you'll need to add the machines to +/etc/hosts+ on both nodes. Below are the entries for my cluster nodes: ---- [root@pcmk-1 ~]# grep pcmk /etc/hosts 192.168.122.101 pcmk-1.clusterlabs.org pcmk-1 192.168.122.102 pcmk-2.clusterlabs.org pcmk-2 ---- We can now verify the setup by again using ping: ---- [root@pcmk-1 ~]# ping -c 3 pcmk-2 PING pcmk-2.clusterlabs.org (192.168.122.101) 56(84) bytes of data. 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=1 ttl=64 time=0.164 ms 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=2 ttl=64 time=0.475 ms 64 bytes from pcmk-1.clusterlabs.org (192.168.122.101): icmp_seq=3 ttl=64 time=0.186 ms --- pcmk-2.clusterlabs.org ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt min/avg/max/mdev = 0.164/0.275/0.475/0.141 ms ---- === Configure SSH === SSH is a convenient and secure way to copy files and perform commands remotely. For the purposes of this guide, we will create a key without a password (using the -N option) so that we can perform remote actions without being prompted. (((SSH))) [WARNING] ========= Unprotected SSH keys (those without a password) are not recommended for servers exposed to the outside world. We use them here only to simplify the demo. ========= Create a new key and allow anyone with that key to log in: .Creating and Activating a new SSH Key ---- [root@pcmk-1 ~]# ssh-keygen -t dsa -f ~/.ssh/id_dsa -N "" Generating public/private dsa key pair. Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 91:09:5c:82:5a:6a:50:08:4e:b2:0c:62:de:cc:74:44 root@pcmk-1.clusterlabs.org The key's randomart image is: +--[ DSA 1024]----+ |==.ooEo.. | |X O + .o o | | * A + | | + . | | . S | | | | | | | | | +-----------------+ [root@pcmk-1 ~]# cp ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys ---- (((Creating and Activating a new SSH Key))) Install the key on the other node: ---- [root@pcmk-1 ~]# scp -r ~/.ssh pcmk-2: The authenticity of host 'pcmk-2 (192.168.122.102)' can't be established. ECDSA key fingerprint is a4:f5:b2:34:9d:86:2b:34:a2:87:37:b9:ca:68:52:ec. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'pcmk-2,192.168.122.102' (ECDSA) to the list of known hosts. root@pcmk-2's password: id_dsa.pub 100% 616 0.6KB/s 00:00 id_dsa 100% 672 0.7KB/s 00:00 known_hosts 100% 400 0.4KB/s 00:00 authorized_keys 100% 616 0.6KB/s 00:00 ---- Test that you can now run commands remotely, without being prompted: ---- [root@pcmk-1 ~]# ssh pcmk-2 -- uname -n pcmk-2 ---- == Install the Cluster Software == Fire up a shell on both nodes and run the following to install pacemaker, and while we're at it, some command-line tools to make our lives easier: ---- # yum install -y pacemaker pcs psmisc policycoreutils-python ---- [IMPORTANT] =========== This document will show commands that need to be executed on both nodes with a simple `#` prompt. Be sure to run them on each node individually. =========== [NOTE] =========== This document uses `pcs` for cluster management. Other alternatives, such as `crmsh`, are available, but their syntax will differ from the examples used here. =========== == Configure the Cluster Software == === Allow cluster services through firewall === On each node, allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- [root@pcmk-1 ~]# setenforce 0 [root@pcmk-1 ~]# sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config [root@pcmk-1 ~]# systemctl mask firewalld.service [root@pcmk-1 ~]# systemctl stop firewalld.service [root@pcmk-1 ~]# iptables --flush ---- ====== === Enable pcs Daemon === Before the cluster can be configured, the pcs daemon must be started and enabled to start at boot time on each node. This daemon works with the pcs command-line interface to manage synchronizing the corosync configuration across all nodes in the cluster. Start and enable the daemon by issuing the following commands on each node: ---- # systemctl start pcsd.service # systemctl enable pcsd.service ln -s '/usr/lib/systemd/system/pcsd.service' '/etc/systemd/system/multi-user.target.wants/pcsd.service' ---- The installed packages will create a *hacluster* user with a disabled password. While this is fine for running `pcs` commands locally, the account needs a login password in order to perform such tasks as syncing the corosync configuration, or starting and stopping the cluster on other nodes. This tutorial will make use of such commands, so now we will set a password for the *hacluster* user, using the same password on both nodes: ---- # passwd hacluster Changing password for user hacluster. New password: Retype new password: passwd: all authentication tokens updated successfully. ---- [NOTE] =========== Alternatively, to script this process or set the password on a different machine from the one you're logged into, you can use the `--stdin` option for `passwd`: ---- -[root@pcmk-1 ~]# ssh pcmk-2 -- 'echo redhat1 | passwd --stdin hacluster' +[root@pcmk-1 ~]# ssh pcmk-2 -- 'echo mysupersecretpassword | passwd --stdin hacluster' ---- =========== === Configure Corosync === On either node, use `pcs cluster auth` to authenticate as the *hacluster* user: ---- [root@pcmk-1 ~]# pcs cluster auth pcmk-1 pcmk-2 Username: hacluster Password: pcmk-1: Authorized pcmk-2: Authorized ---- Next, use `pcs cluster setup` on the same node to generate and synchronize the corosync configuration: ---- [root@pcmk-1 ~]# pcs cluster setup --name mycluster pcmk-1 pcmk-2 Shutting down pacemaker/corosync services... Redirecting to /bin/systemctl stop pacemaker.service Redirecting to /bin/systemctl stop corosync.service Killing any remaining services... Removing all cluster configuration files... pcmk-1: Succeeded pcmk-2: Succeeded ---- If you received an authorization error for either of those commands, make sure you configured the *hacluster* user account on each node with the same password. [NOTE] ====== -Early versions of pcs required that `--name` be omitted from the above command. - If you are not using `pcs` for cluster administration, follow whatever procedures are appropriate for your tools to create a corosync.conf and copy it to all nodes. The `pcs` command will configure corosync to use UDP unicast transport; if you choose to use multicast instead, choose a multicast address carefully. -footnote:[For some subtle issues, see the now-defunct http://web.archive.org/web/20101211210054/http://29west.com/docs/THPM/multicast-address-assignment.html or the more detailed treatment in -http://www.cisco.com/c/dam/en/us/support/docs/ip/ip-multicast/ipmlt_wp.pdf[Cisco's -Guidelines for Enterprise IP Multicast Address Allocation] paper.] +footnote:[For some subtle issues, see +http://web.archive.org/web/20101211210054/http://29west.com/docs/THPM/multicast-address-assignment.html[Topics +in High-Performance Messaging: Multicast Address Assignment] or the more detailed treatment in +https://www.cisco.com/c/dam/en/us/support/docs/ip/ip-multicast/ipmlt_wp.pdf[Cisco's +Guidelines for Enterprise IP Multicast Address Allocation].] ====== The final /etc/corosync.conf configuration on each node should look something like the sample in <>. diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt b/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt index 7ed4f808b7..d8582b77e6 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Intro.txt @@ -1,26 +1,27 @@ = Read-Me-First = == The Scope of this Document == Computer clusters can be used to provide highly available services or resources. The redundancy of multiple machines is used to guard against failures of many types. This document will walk through the installation and setup of simple clusters using the &DISTRO; distribution, version &DISTRO_VERSION;. The clusters described here will use Pacemaker and Corosync to provide resource management and messaging. Required packages and modifications to their configuration files are described along with the use of the Pacemaker command line tool for generating the XML used for cluster control. Pacemaker is a central component and provides the resource management required in these systems. This management includes detecting and recovering from the failure of various nodes, resources and services under its control. -When more in depth information is required and for real world usage, -please refer to the http://www.clusterlabs.org/doc/[Pacemaker Explained] manual. +When more in-depth information is required, and for real-world usage, +please refer to the +https://www.clusterlabs.org/pacemaker/doc/[Pacemaker Explained] manual. include::../../shared/en-US/pacemaker-intro.txt[] diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt b/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt index 270c7b30f2..d756fa2d63 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Shared-Storage.txt @@ -1,564 +1,529 @@ = Replicate Storage Using DRBD = Even if you're serving up static websites, having to manually synchronize the contents of that website to all the machines in the cluster is not ideal. For dynamic websites, such as a wiki, it's not even an option. Not everyone care afford network-attached storage, but somehow the data needs to be kept in sync. Enter DRBD, which can be thought of as network-based RAID-1. footnote:[See http://www.drbd.org/ for details.] == Install the DRBD Packages == DRBD itself is included in the upstream kernel,footnote:[Since version 2.6.33] but we do need some utilities to use it effectively. CentOS does not ship these utilities, so we need to enable a third-party repository to get them. Supported packages for many OSes are available from DRBD's maker http://www.linbit.com/[LINBIT], but here we'll use the free http://elrepo.org/[ELRepo] repository. On both nodes, import the ELRepo package signing key, and enable the repository: ---- # rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org -# rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm +# rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm ---- Now, we can install the DRBD kernel module and utilities: ---- # yum install -y kmod-drbd84 drbd84-utils ---- -[IMPORTANT] -=========== -The version of drbd84-utils shipped with CentOS 7.1 has a bug in the -Pacemaker integration script. Until a fix is packaged, download the -affected script directly from the upstream, on both nodes: ----- -# curl -o /usr/lib/ocf/resource.d/linbit/drbd 'http://git.linbit.com/gitweb.cgi?p=drbd-utils.git;a=blob_plain;f=scripts/drbd.ocf;h=cf6b966341377a993d1bf5f585a5b9fe72eaa5f2;hb=c11ba026bbbbc647b8112543df142f2185cb4b4b' ----- -This is a temporary fix that will be overwritten if the package -is upgraded. -=========== - DRBD will not be able to run under the default SELinux security policies. If you are familiar with SELinux, you can modify the policies in a more fine-grained manner, but here we will simply exempt DRBD processes from SELinux control: ---- # semanage permissive -a drbd_t ---- We will configure DRBD to use port 7789, so allow that port from each host to the other: ---- [root@pcmk-1 ~]# firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.122.102" port port="7789" protocol="tcp" accept' success [root@pcmk-1 ~]# firewall-cmd --reload success ---- ---- [root@pcmk-2 ~]# firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.122.101" port port="7789" protocol="tcp" accept' success [root@pcmk-2 ~]# firewall-cmd --reload success ---- [NOTE] ====== In this example, we have only two nodes, and all network traffic is on the same LAN. In production, it is recommended to use a dedicated, isolated network for cluster-related traffic, so the firewall configuration would likely be different; one approach would be to add the dedicated network interfaces to the trusted zone. ====== == Allocate a Disk Volume for DRBD == DRBD will need its own block device on each node. This can be a physical disk partition or logical volume, of whatever size you need for your data. For this document, we will use a 1GiB logical volume, which is more than sufficient for a single HTML file and (later) GFS2 metadata. ---- [root@pcmk-1 ~]# vgdisplay | grep -e Name -e Free VG Name centos_pcmk-1 Free PE / Size 382 / 1.49 GiB [root@pcmk-1 ~]# lvcreate --name drbd-demo --size 1G centos_pcmk-1 Logical volume "drbd-demo" created [root@pcmk-1 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert drbd-demo centos_pcmk-1 -wi-a----- 1.00g root centos_pcmk-1 -wi-ao---- 5.00g swap centos_pcmk-1 -wi-ao---- 1.00g ---- Repeat for the second node, making sure to use the same size: ---- [root@pcmk-1 ~]# ssh pcmk-2 -- lvcreate --name drbd-demo --size 1G centos_pcmk-2 Logical volume "drbd-demo" created ---- == Configure DRBD == There is no series of commands for building a DRBD configuration, so simply run this on both nodes to use this sample configuration: ---- # cat </etc/drbd.d/wwwdata.res resource wwwdata { protocol C; meta-disk internal; device /dev/drbd1; syncer { verify-alg sha1; } net { allow-two-primaries; } on pcmk-1 { disk /dev/centos_pcmk-1/drbd-demo; address 192.168.122.101:7789; } on pcmk-2 { disk /dev/centos_pcmk-2/drbd-demo; address 192.168.122.102:7789; } } END ---- [IMPORTANT] ========= Edit the file to use the hostnames, IP addresses and logical volume paths of your nodes if they differ from the ones used in this guide. ========= [NOTE] ======= Detailed information on the directives used in this configuration (and other alternatives) is available at http://www.drbd.org/users-guide/ch-configure.html The *allow-two-primaries* option would not normally be used in an active/passive cluster. We are adding it here for the convenience of changing to an active/active cluster later. ======= == Initialize DRBD == With the configuration in place, we can now get DRBD running. These commands create the local metadata for the DRBD resource, ensure the DRBD kernel module is loaded, and bring up the DRBD resource. Run them on one node: ---- [root@pcmk-1 ~]# drbdadm create-md wwwdata initializing activity log NOT initializing bitmap Writing meta data... New drbd meta data block successfully created. [root@pcmk-1 ~]# modprobe drbd [root@pcmk-1 ~]# drbdadm up wwwdata ---- We can confirm DRBD's status on this node: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----s ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1048508 ---- Because we have not yet initialized the data, this node's data is marked as *Inconsistent*. Because we have not yet initialized the second node, the local state is *WFConnection* (waiting for connection), and the partner node's status is marked as *Unknown*. Now, repeat the above commands on the second node. This time, when we check the status, it shows: ---- [root@pcmk-2 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1048508 ---- You can see the state has changed to *Connected*, meaning the two DRBD nodes are communicating properly, and both nodes are in *Secondary* role with *Inconsistent* data. To make the data consistent, we need to tell DRBD which node should be considered to have the correct data. In this case, since we are creating a new resource, both have garbage, so we'll just pick pcmk-1 and run this command on it: ---- [root@pcmk-1 ~]# drbdadm primary --force wwwdata ---- [NOTE] ====== -If you are using an older version of DRBD, the required syntax may be different. +If you are using a different version of DRBD, the required syntax may be different. See the documentation for your version for how to perform these commands. ====== If we check the status immediately, we'll see something like this: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- ns:2872 nr:0 dw:0 dr:3784 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1045636 [>....................] sync'ed: 0.4% (1045636/1048508)K finish: 0:10:53 speed: 1,436 (1,436) K/sec ---- We can see that this node has the *Primary* role, the partner node has the *Secondary* role, this node's data is now considered *UpToDate*, the partner node's data is still *Inconsistent*, and a progress bar shows how far along the partner node is in synchronizing the data. After a while, the sync should finish, and you'll see something like: ---- [root@pcmk-1 ~]# cat /proc/drbd version: 8.4.6 (api:1/proto:86-101) GIT-hash: 833d830e0152d1e457fa7856e71e11248ccf3f70 build by phil@Build64R7, 2015-04-10 05:13:52 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:1048508 nr:0 dw:0 dr:1049420 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 ---- Both sets of data are now *UpToDate*, and we can proceed to creating and populating a filesystem for our WebSite resource's documents. == Populate the DRBD Disk == On the node with the primary role (pcmk-1 in this example), create a filesystem on the DRBD device: ---- [root@pcmk-1 ~]# mkfs.xfs /dev/drbd1 meta-data=/dev/drbd1 isize=256 agcount=4, agsize=65532 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=262127, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=853, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 ---- [NOTE] ==== In this example, we create an xfs filesystem with no special options. In a production environment, you should choose a filesystem type and options that are suitable for your application. ==== Mount the newly created filesystem, populate it with our web document, give it the same SELinux policy as the web document root, then unmount it (the cluster will handle mounting and unmounting it later): ---- [root@pcmk-1 ~]# mount /dev/drbd1 /mnt [root@pcmk-1 ~]# cat <<-END >/mnt/index.html My Test Site - DRBD END [root@pcmk-1 ~]# chcon -R --reference=/var/www/html /mnt [root@pcmk-1 ~]# umount /dev/drbd1 ---- == Configure the Cluster for the DRBD device == One handy feature `pcs` has is the ability to queue up several changes into a file and commit those changes all at once. To do this, start by populating the file with the current raw XML config from the CIB. ---- [root@pcmk-1 ~]# pcs cluster cib drbd_cfg ---- Using the `pcs -f` option, make changes to the configuration saved in the +drbd_cfg+ file. These changes will not be seen by the cluster until the +drbd_cfg+ file is pushed into the live cluster's CIB later. Here, we create a cluster resource for the DRBD device, and an additional _clone_ resource to allow the resource to run on both nodes at the same time. ---- [root@pcmk-1 ~]# pcs -f drbd_cfg resource create WebData ocf:linbit:drbd \ drbd_resource=wwwdata op monitor interval=60s [root@pcmk-1 ~]# pcs -f drbd_cfg resource master WebDataClone WebData \ master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \ notify=true [root@pcmk-1 ~]# pcs -f drbd_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Stopped: [ pcmk-1 pcmk-2 ] ---- After you are satisfied with all the changes, you can commit them all at once by pushing the drbd_cfg file into the live CIB. ---- [root@pcmk-1 ~]# pcs cluster cib-push drbd_cfg CIB updated ---- -[NOTE] -==== -Early versions of `pcs` required `push cib` in place of `cib-push` above. -==== - Let's see what the cluster did with the new configuration: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 09:29:41 2015 -Last change: Fri Aug 14 09:29:25 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -4 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 09:29:41 2018 +Last change: Fri Jan 12 09:29:25 2018 +2 nodes configured +4 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- We can see that *WebDataClone* (our DRBD device) is running as master (DRBD's primary role) on *pcmk-1* and slave (DRBD's secondary role) on *pcmk-2*. [IMPORTANT] ==== The resource agent should load the DRBD module when needed if it's not already loaded. If that does not happen, configure your operating system to load the module at boot time. For &DISTRO; &DISTRO_VERSION;, you would run this on both nodes: ---- # echo drbd >/etc/modules-load.d/drbd.conf ---- ==== == Configure the Cluster for the Filesystem == Now that we have a working DRBD device, we need to mount its filesystem. In addition to defining the filesystem, we also need to tell the cluster where it can be located (only on the DRBD Primary) and when it is allowed to start (after the Primary was promoted). We are going to take a shortcut when creating the resource this time. Instead of explicitly saying we want the *ocf:heartbeat:Filesystem* script, we are only going to ask for *Filesystem*. We can do this because we know there is only one resource script named *Filesystem* available to pacemaker, and that pcs is smart enough to fill in the *ocf:heartbeat:* portion for us correctly in the configuration. If there were multiple *Filesystem* scripts from different OCF providers, we would need to specify the exact one we wanted. Once again, we will queue our changes to a file and then push the new configuration to the cluster as the final step. ---- [root@pcmk-1 ~]# pcs cluster cib fs_cfg [root@pcmk-1 ~]# pcs -f fs_cfg resource create WebFS Filesystem \ device="/dev/drbd1" directory="/var/www/html" fstype="xfs" [root@pcmk-1 ~]# pcs -f fs_cfg constraint colocation add WebFS with WebDataClone INFINITY with-rsc-role=Master [root@pcmk-1 ~]# pcs -f fs_cfg constraint order promote WebDataClone then start WebFS Adding WebDataClone WebFS (kind: Mandatory) (Options: first-action=promote then-action=start) ---- We also need to tell the cluster that Apache needs to run on the same machine as the filesystem and that it must be active before Apache can start. ---- [root@pcmk-1 ~]# pcs -f fs_cfg constraint colocation add WebSite with WebFS INFINITY [root@pcmk-1 ~]# pcs -f fs_cfg constraint order WebFS then WebSite Adding WebFS WebSite (kind: Mandatory) (Options: first-action=start then-action=start) ---- Review the updated configuration. ---- [root@pcmk-1 ~]# pcs -f fs_cfg constraint Location Constraints: Ordering Constraints: start ClusterIP then start WebSite (kind:Mandatory) promote WebDataClone then start WebFS (kind:Mandatory) start WebFS then start WebSite (kind:Mandatory) Colocation Constraints: WebSite with ClusterIP (score:INFINITY) WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master) WebSite with WebFS (score:INFINITY) +Ticket Constraints: ---- ---- [root@pcmk-1 ~]# pcs -f fs_cfg resource show ClusterIP (ocf::heartbeat:IPaddr2): Started WebSite (ocf::heartbeat:apache): Started Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Stopped ---- After reviewing the new configuration, upload it and watch the cluster put it into effect. ---- [root@pcmk-1 ~]# pcs cluster cib-push fs_cfg [root@pcmk-1 ~]# pcs status -Last updated: Fri Aug 14 09:34:11 2015 -Last change: Fri Aug 14 09:34:09 2015 +Cluster name: mycluster Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -5 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 09:34:11 2018 +Last change: Fri Jan 12 09:34:09 2018 +2 nodes configured +5 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 WebSite (ocf::heartbeat:apache): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-1 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Test Cluster Failover == Previously, we used `pcs cluster stop pcmk-1` to stop all cluster services on *pcmk-1*, failing over the cluster resources, but there is another way to safely simulate node failure. We can put the node into _standby mode_. Nodes in this state continue to run corosync and pacemaker but are not allowed to run resources. Any resources found active there will be moved elsewhere. This feature can be particularly useful when performing system administration tasks such as updating packages used by cluster resources. Put the active node into standby mode, and observe the cluster move all the resources to the other node. The node's status will change to indicate that it can no longer host resources. ---- [root@pcmk-1 ~]# pcs cluster standby pcmk-1 [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 09:36:49 2015 -Last change: Fri Aug 14 09:36:43 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -5 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 09:36:49 2018 +Last change: Fri Jan 12 09:36:43 2018 +2 nodes configured +5 resources configured Node pcmk-1 (1): standby Online: [ pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Stopped: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Once we've done everything we needed to on pcmk-1 (in this case nothing, we just wanted to see the resources move), we can allow the node to be a full cluster member again. ---- [root@pcmk-1 ~]# pcs cluster unstandby pcmk-1 [root@pcmk-1 ~]# pcs status Cluster name: mycluster -Last updated: Fri Aug 14 09:38:02 2015 -Last change: Fri Aug 14 09:37:56 2015 Stack: corosync -Current DC: pcmk-1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -5 Resources configured +Current DC: pcmk-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 09:38:02 2018 +Last change: Fri Jan 12 09:37:56 2018 +2 nodes configured +5 resources configured Online: [ pcmk-1 pcmk-2 ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-2 WebSite (ocf::heartbeat:apache): Started pcmk-2 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-2 ] Slaves: [ pcmk-1 ] WebFS (ocf::heartbeat:Filesystem): Started pcmk-2 -PCSD Status: - pcmk-1: Online - pcmk-2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Notice that *pcmk-1* is back to the *Online* state, and that the cluster resources stay where they are due to our resource stickiness settings configured earlier. diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt b/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt index 2f85501b14..3597cc4ffb 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Stonith.txt @@ -1,153 +1,165 @@ = Configure STONITH = == What is STONITH? == STONITH (Shoot The Other Node In The Head aka. fencing) protects your data from being corrupted by rogue nodes or unintended concurrent access. Just because a node is unresponsive doesn't mean it has stopped accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH to ensure that the node is truly offline before allowing the data to be accessed from another node. STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere. == Choose a STONITH Device == It is crucial that your STONITH device can allow the cluster to differentiate between a node failure and a network failure. A common mistake people make when choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. If the power fails in such a case, the cluster cannot be sure whether the node is really offline, or active and suffering from a network fault, so the cluster will stop all resources to avoid a possible split-brain situation. Likewise, any device that relies on the machine being active (such as SSH-based "devices" sometimes used during testing) is inappropriate. == Configure the Cluster for STONITH == . Install the STONITH agent(s). To see what packages are available, run `yum search fence-`. Be sure to install the package(s) on all cluster nodes. . Configure the STONITH device itself to be able to fence your nodes and accept fencing requests. This includes any necessary configuration on the device and on the nodes, and any firewall or SELinux changes needed. Test the communication between the device and your nodes. . Find the correct STONITH agent script: `pcs stonith list` . Find the parameters associated with the device: +pcs stonith describe pass:[agent_name]+ . Create a local copy of the CIB: `pcs cluster cib stonith_cfg` . Create the fencing resource: +pcs -f stonith_cfg stonith create pass:[stonith_id stonith_device_type [stonith_device_options]]+ + Any flags that do not take arguments, such as +--ssl+, should be passed as +ssl=1+. . Enable STONITH in the cluster: `pcs -f stonith_cfg property set stonith-enabled=true` . If the device does not know how to fence nodes based on their uname, you may also need to set the special *pcmk_host_map* parameter. See `man stonithd` for details. . If the device does not support the *list* command, you may also need to set the special *pcmk_host_list* and/or *pcmk_host_check* parameters. See `man stonithd` for details. . If the device does not expect the victim to be specified with the *port* parameter, you may also need to set the special *pcmk_host_argument* parameter. See `man stonithd` for details. . Commit the new configuration: `pcs cluster cib-push stonith_cfg` . Once the STONITH resource is running, test it (you might want to stop the cluster on that machine first): +stonith_admin --reboot pass:[nodename]+ == Example == For this example, assume we have a chassis containing four nodes and an IPMI device active on 10.0.0.1. Following the steps above would go something like this: Step 1: Install the *fence-agents-ipmilan* package on both nodes. Step 2: Configure the IP address, authentication credentials, etc. in the IPMI device itself. Step 3: Choose the *fence_ipmilan* STONITH agent. Step 4: Obtain the agent's possible parameters: ---- [root@pcmk-1 ~]# pcs stonith describe fence_ipmilan -Stonith options for: fence_ipmilan +fence_ipmilan - Fence agent for IPMI + +fence_ipmilan is an I/O Fencing agentwhich can be used with machines controlled by IPMI.This agent calls support software ipmitool (http://ipmitool.sf.net/). WARNING! This fence agent might report success before the node is powered off. You should use -m/method onoff if your fence device works correctly with that option. + +Stonith options: ipport: TCP/UDP port to use for connection with device + port: IP address or hostname of fencing device (together with --port-as-ip) inet6_only: Forces agent to use IPv6 addresses only - ipaddr (required): IP Address or Hostname + ipaddr: IP Address or Hostname passwd_script: Script to retrieve password method: Method to fence (onoff|cycle) inet4_only: Forces agent to use IPv4 addresses only passwd: Login password or passphrase lanplus: Use Lanplus to improve security of connection auth: IPMI Lan Auth type. + action: Fencing Action WARNING: specifying 'action' is deprecated and not necessary with current Pacemaker versions. cipher: Ciphersuite to use (same as ipmitool -C parameter) + target: Bridge IPMI requests to the remote target address privlvl: Privilege level on IPMI device - action (required): Fencing Action + timeout: Timeout (sec) for IPMI operation login: Login Name - verbose: Verbose mode - debug: Write debug information to given file - version: Display version information and exit - help: Display help and exit power_wait: Wait X seconds after issuing ON/OFF login_timeout: Wait X seconds for cmd prompt after login - power_timeout: Test X seconds for status change after ON/OFF delay: Wait X seconds before fencing is started + power_timeout: Test X seconds for status change after ON/OFF ipmitool_path: Path to ipmitool binary shell_timeout: Wait X seconds for cmd prompt after issuing command + port_as_ip: Make "port/plug" to be an alias to IP address retry_on: Count of attempts to retry power on sudo: Use sudo (without password) when calling 3rd party sotfware. - stonith-timeout: How long to wait for the STONITH action (reboot, on, off) to complete per a stonith device. priority: The priority of the stonith resource. Devices are tried in order of highest priority to lowest. - pcmk_host_map: A mapping of host names to ports numbers for devices that do not support host names. + pcmk_host_map: A mapping of host names to ports numbers for devices that do not support host names. Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and ports + 2 and 3 for node2 pcmk_host_list: A list of machines controlled by this device (Optional unless pcmk_host_check=static-list). - pcmk_host_check: How to determine which machines are controlled by the device. + pcmk_host_check: How to determine which machines are controlled by the device. Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), + none (assume every device can fence every machine) + pcmk_delay_max: Enable random delay for stonith actions and specify the maximum of random delay This prevents double fencing when using slow devices such as sbd. Use this to + enable random delay for stonith actions and specify the maximum of random delay. + pcmk_action_limit: The maximum number of actions can be performed in parallel on this device Pengine property concurrent-fencing=true needs to be configured first. Then use this + to specify the maximum number of actions can be performed in parallel on this device. -1 is unlimited. + +Default operations: + monitor: interval=60s ---- Step 5: `pcs cluster cib stonith_cfg` Step 6: Here are example parameters for creating our STONITH resource: ---- [root@pcmk-1 ~]# pcs -f stonith_cfg stonith create ipmi-fencing fence_ipmilan \ pcmk_host_list="pcmk-1 pcmk-2" ipaddr=10.0.0.1 login=testuser \ passwd=acd123 op monitor interval=60s [root@pcmk-1 ~]# pcs -f stonith_cfg stonith ipmi-fencing (stonith:fence_ipmilan): Stopped ---- Steps 7-10: Enable STONITH in the cluster: ---- [root@pcmk-1 ~]# pcs -f stonith_cfg property set stonith-enabled=true [root@pcmk-1 ~]# pcs -f stonith_cfg property Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster - dc-version: 1.1.12-a14efad + dc-version: 1.1.16-12.el7_4.5-94ff4df have-watchdog: false stonith-enabled: true ---- Step 11: `pcs cluster cib-push stonith_cfg` Step 12: Test: ---- [root@pcmk-1 ~]# pcs cluster stop pcmk-2 [root@pcmk-1 ~]# stonith_admin --reboot pcmk-2 ---- After a successful test, login to any rebooted nodes, and start the cluster (with `pcs cluster start`). diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt b/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt index f16858d348..fda3476caa 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Tools.txt @@ -1,106 +1,131 @@ = Pacemaker Tools = == Simplify administration using a cluster shell == In the dark past, configuring Pacemaker required the administrator to read and write XML. In true UNIX style, there were also a number of different commands that specialized in different aspects of querying and updating the cluster. All of that has been greatly simplified with the creation of unified command-line shells (and GUIs) that hide all the messy XML scaffolding. These shells take all the individual aspects required for managing and configuring a cluster, and pack them into one simple-to-use command line tool. They even allow you to queue up several changes at once and commit them all at once. Two popular command-line shells are `pcs` and `crmsh`. This edition of Clusters from Scratch is based on `pcs`. [NOTE] =========== The two shells share many concepts but the scope, layout and syntax does differ, so make sure you read the version of this guide that corresponds to the software installed on your system. =========== == Explore pcs == Start by taking some time to familiarize yourself with what `pcs` can do. ---- [root@pcmk-1 ~]# pcs + Usage: pcs [-f file] [-h] [commands]... Control and configure pacemaker and corosync. Options: - -h, --help Display usage and exit - -f file Perform actions on file instead of active CIB - --debug Print all network traffic and external commands run - --version Print pcs version information + -h, --help Display usage and exit. + -f file Perform actions on file instead of active CIB. + --debug Print all network traffic and external commands run. + --version Print pcs version information. + --request-timeout Timeout for each outgoing request to another node in + seconds. Default is 60s. Commands: - cluster Configure cluster options and nodes - resource Manage cluster resources - stonith Configure fence devices - constraint Set resource constraints - property Set pacemaker properties - acl Set pacemaker access control lists - status View cluster status - config View and manage cluster configuration + cluster Configure cluster options and nodes. + resource Manage cluster resources. + stonith Manage fence devices. + constraint Manage resource constraints. + property Manage pacemaker properties. + acl Manage pacemaker access control lists. + qdevice Manage quorum device provider on the local host. + quorum Manage cluster quorum settings. + booth Manage booth (cluster ticket manager). + status View cluster status. + config View and manage cluster configuration. + pcsd Manage pcs daemon. + node Manage cluster nodes. + alert Manage pacemaker alerts. + ---- As you can see, the different aspects of cluster management are separated -into categories: resource, cluster, stonith, property, constraint, -and status. To discover the functionality available in each of these +into categories. To discover the functionality available in each of these categories, one can issue the command +pcs pass:[category] help+. Below is an example of all the options available under the status category. ---- [root@pcmk-1 ~]# pcs status help + Usage: pcs status [commands]... View current cluster and resource status Commands: - [status] [--full] + [status] [--full | --hide-inactive] View all information about the cluster and resources (--full provides - more details) + more details, --hide-inactive hides inactive resources). - resources - View current status of cluster resources + resources [ | --full | --groups | --hide-inactive] + Show all currently configured resources or if a resource is specified + show the options for the configured resource. If --full is specified, + all configured resource options will be displayed. If --groups is + specified, only show groups (and their resources). If --hide-inactive + is specified, only show active resources. groups - View currently configured groups and their resources + View currently configured groups and their resources. cluster - View current cluster status + View current cluster status. corosync - View current membership information as seen by corosync + View current membership information as seen by corosync. - nodes [corosync|both|config] + quorum + View current quorum status. + + qdevice [--full] [] + Show runtime status of specified model of quorum device provider. Using + --full will give more detailed output. If is specified, + only information about the specified cluster will be displayed. + + nodes [corosync | both | config] View current status of nodes from pacemaker. If 'corosync' is - specified, print nodes currently configured in corosync, if 'both' - is specified, print nodes from both corosync & pacemaker. If 'config' - is specified, print nodes from corosync & pacemaker configuration. + specified, view current status of nodes from corosync instead. If + 'both' is specified, view current status of nodes from both corosync & + pacemaker. If 'config' is specified, print nodes from corosync & + pacemaker configuration. - pcsd ... - Show the current status of pcsd on the specified nodes + pcsd []... + Show current status of pcsd on nodes specified, or on all nodes + configured in the local cluster if no nodes are specified. xml - View xml version of status (output from crm_mon -r -1 -X) + View xml version of status (output from crm_mon -r -1 -X). + ---- Additionally, if you are interested in the version and supported cluster stack(s) available with your Pacemaker installation, run: ---- [root@pcmk-1 ~]# pacemakerd --features -Pacemaker 1.1.12 (Build: a14efad) - Supporting v3.0.9: generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc upstart systemd nagios corosync-native atomic-attrd acls +Pacemaker 1.1.16-12.el7_4.5 (Build: 94ff4df) + Supporting v3.0.12: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls ---- diff --git a/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt b/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt index 217a5181e3..784a3b2723 100644 --- a/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt +++ b/doc/Clusters_from_Scratch/en-US/Ch-Verification.txt @@ -1,153 +1,147 @@ = Start and Verify Cluster = == Start the Cluster == Now that corosync is configured, it is time to start the cluster. The command below will start corosync and pacemaker on both nodes in the cluster. If you are issuing the start command from a different node than the one you ran the `pcs cluster auth` command on earlier, you must authenticate on the current node you are logged into before you will be allowed to start the cluster. ---- [root@pcmk-1 ~]# pcs cluster start --all pcmk-1: Starting Cluster... pcmk-2: Starting Cluster... ---- [NOTE] ====== An alternative to using the `pcs cluster start --all` command is to issue either of the below command sequences on each node in the cluster separately: ---- # pcs cluster start Starting Cluster... ---- or ---- # systemctl start corosync.service # systemctl start pacemaker.service ---- ====== [IMPORTANT] ==== In this example, we are not enabling the corosync and pacemaker services to start at boot. If a cluster node fails or is rebooted, you will need to run +pcs cluster start pass:[nodename]+ (or `--all`) to start the cluster on it. While you could enable the services to start at boot, requiring a manual start of cluster services gives you the opportunity to do a post-mortem investigation of a node failure before returning it to the cluster. ==== == Verify Corosync Installation == First, use `corosync-cfgtool` to check whether cluster communication is happy: ---- [root@pcmk-1 ~]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.122.101 status = ring 0 active with no faults ---- We can see here that everything appears normal with our fixed IP address (not a 127.0.0.x loopback address) listed as the *id*, and *no faults* for the status. If you see something different, you might want to start by checking the node's network, firewall and selinux configurations. Next, check the membership and quorum APIs: ---- [root@pcmk-1 ~]# corosync-cmapctl | grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.122.101) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.122.102) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 2 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@pcmk-1 ~]# pcs status corosync Membership information -------------------------- Nodeid Votes Name 1 1 pcmk-1 (local) 2 1 pcmk-2 ---- You should see both nodes have joined the cluster. == Verify Pacemaker Installation == Now that we have confirmed that Corosync is functional, we can check the rest of the stack. Pacemaker has already been started, so verify the necessary processes are running: ---- [root@pcmk-1 ~]# ps axf PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] ...lots of processes... 1362 ? Ssl 0:35 corosync 1379 ? Ss 0:00 /usr/sbin/pacemakerd -f 1380 ? Ss 0:00 \_ /usr/libexec/pacemaker/cib 1381 ? Ss 0:00 \_ /usr/libexec/pacemaker/stonithd 1382 ? Ss 0:00 \_ /usr/libexec/pacemaker/lrmd 1383 ? Ss 0:00 \_ /usr/libexec/pacemaker/attrd 1384 ? Ss 0:00 \_ /usr/libexec/pacemaker/pengine 1385 ? Ss 0:00 \_ /usr/libexec/pacemaker/crmd ---- If that looks OK, check the `pcs status` output: ---- [root@pcmk-1 ~]# pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false -Last updated: Tue Dec 16 16:15:29 2014 -Last change: Tue Dec 16 15:49:47 2014 Stack: corosync -Current DC: pcmk-2 (2) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -0 Resources configured +Current DC: pcmk-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 16:15:29 2018 +Last change: Fri Jan 12 15:49:47 2018 +2 nodes configured +0 resources configured Online: [ pcmk-1 pcmk-2 ] -Full list of resources: - - -PCSD Status: - pcmk-1: Online - pcmk-2: Online +No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Finally, ensure there are no startup errors (aside from messages relating to not having STONITH configured, which are OK at this point): ---- [root@pcmk-1 ~]# journalctl | grep -i error ---- [NOTE] ====== Other operating systems may report startup errors in other locations, for example +/var/log/messages+. ====== Repeat these checks on the other node. The results should be the same. diff --git a/doc/Clusters_from_Scratch/en-US/Clusters_from_Scratch.ent b/doc/Clusters_from_Scratch/en-US/Clusters_from_Scratch.ent index 626f4582c7..85bc1dba45 100644 --- a/doc/Clusters_from_Scratch/en-US/Clusters_from_Scratch.ent +++ b/doc/Clusters_from_Scratch/en-US/Clusters_from_Scratch.ent @@ -1,6 +1,6 @@ - + - + diff --git a/doc/Clusters_from_Scratch/en-US/Revision_History.xml b/doc/Clusters_from_Scratch/en-US/Revision_History.xml index 5dae3c67ba..d0eb49728e 100644 --- a/doc/Clusters_from_Scratch/en-US/Revision_History.xml +++ b/doc/Clusters_from_Scratch/en-US/Revision_History.xml @@ -1,74 +1,79 @@ %BOOK_ENTITIES; ]> Revision History 1-0 Mon May 17 2010 AndrewBeekhofandrew@beekhof.net Import from Pages.app 2-0 Wed Sep 22 2010 RaoulScarazzinirasca@miamammausalinux.org Italian translation 3-0 Wed Feb 9 2011 AndrewBeekhofandrew@beekhof.net Updated for Fedora 13 4-0 Wed Oct 5 2011 AndrewBeekhofandrew@beekhof.net Update the GFS2 section to use CMAN 5-0 Fri Feb 10 2012 AndrewBeekhofandrew@beekhof.net Generate docbook content from asciidoc sources 6-0 Tues July 3 2012 AndrewBeekhofandrew@beekhof.net Updated for Fedora 17 7-0 Fri Sept 14 2012 DavidVosseldavidvossel@gmail.com Updated for pcs 8-0 Mon Jan 05 2015 KenGaillotkgaillot@redhat.com Updated for Fedora 21 8-1 Thu Jan 08 2015 KenGaillotkgaillot@redhat.com Minor corrections, plus use include file for intro 9-0 Fri Aug 14 2015 KenGaillotkgaillot@redhat.com Update for CentOS 7.1 and leaving firewalld/SELinux enabled + + 10-0 + Fri Jan 12 2018 + KenGaillotkgaillot@redhat.com + Update banner for Pacemaker 2.0 and content for CentOS 7.4 with Pacemaker 1.1.16 + - diff --git a/doc/Pacemaker_Development/en-US/Ch-Python.txt b/doc/Pacemaker_Development/en-US/Ch-Python.txt index dd8c72fabd..ed2a06dfb5 100644 --- a/doc/Pacemaker_Development/en-US/Ch-Python.txt +++ b/doc/Pacemaker_Development/en-US/Ch-Python.txt @@ -1,141 +1,141 @@ = Python Coding Guidelines = //// We prefer [[ch-NAME]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-python-coding[Chapter 3, Python Coding Guidelines] [[s-python-boilerplate]] == Python Boilerplate == indexterm:[Python,boilerplate] indexterm:[licensing,Python boilerplate] Every Python file should start like this: ==== [source,Python] ---- [] """ """ # Pacemaker targets compatibility with Python 2.6+ and 3.2+ from __future__ import print_function, unicode_literals, absolute_import, division __copyright__ = "Copyright (C) Andrew Beekhof " __license__ = " WITHOUT ANY WARRANTY" ---- ==== If the file is meant to be directly executed, the first line (++) should be +#!/usr/bin/python+. If it is meant to be imported, omit this line. ++ is obviously a brief description of the file's purpose. The string may contain any other information typically used in a Python file https://www.python.org/dev/peps/pep-0257/[docstring]. The +import+ statement is discussed further in <>. ++ is the year the code was 'originally' created (it is the most important date for copyright purposes, as it establishes priority and the point from which expiration is calculated). If the code is modified in later years, add +-YYYY+ with the most recent year of modification. ++ should follow the policy set forth in the https://github.com/ClusterLabs/pacemaker/blob/master/COPYING[+COPYING+] file, generally one of "GNU General Public License version 2 or later (GPLv2+)" or "GNU Lesser General Public License version 2.1 or later (LGPLv2.1+)". == Python Compatibility == indexterm:[Python,2] indexterm:[Python,3] indexterm:[Python,versions] Pacemaker targets compatibility with Python 2.6 and later, and Python 3.2 and later. These versions have added features to be more compatible with each other, allowing us to support both the 2 and 3 series with the same code. It is a good idea to test any changes with both Python 2 and 3. [[s-python-future-imports]] === Python Future Imports === The future imports used in <> mean: * All print statements must use parentheses, and printing without a newline is accomplished with the +end=' '+ parameter rather than a trailing comma. * All string literals will be treated as Unicode (the +u+ prefix is unnecessary, and must not be used, because it is not available in Python 3.2). * Local modules must be imported using +from . import+ (rather than just +import+). To import one item from a local module, use +from .modulename import+ (rather than +from modulename import+). * Division using +/+ will always return a floating-point result (use +//+ if you want the integer floor instead). === Other Python Compatibility Requirements === * When specifying an exception variable, always use +as+ instead of a comma (e.g. +except Exception as e+ or +except (TypeError, IOError) as e+). Use +e.args+ to access the error arguments (instead of iterating over or subscripting +e+). * Use +in+ (not +has_key()+) to determine if a dictionary has a particular key. * Always use the I/O functions from the +io+ module rather than the native I/O functions (e.g. +io.open()+ rather than +open()+). * When opening a file, always use the +t+ (text) or +b+ (binary) mode flag. * When creating classes, always specify a parent class to ensure that it is a "new-style" class (e.g. +class Foo(object):+ rather than +class Foo:+) * Be aware of the bytes type added in Python 3. Many places where strings are used in Python 2 use bytes or bytearrays in Python 3 (for example, the pipes used with +subprocess.Popen()+). Code should handle both possibilities. * Be aware that the +items()+, +keys()+, and +values()+ methods of dictionaries return lists in Python 2 and views in Python 3. In many case, no special handling is required, but if the code needs to use list methods on the result, cast the result to list first. * Do not name variables +with+ or +as+. * Do not raise or catch strings as exceptions (e.g. +raise "Bad thing"+). * Do not use the +cmp+ parameter of sorting functions (use +key+ instead, if needed) or the +$$__cmp__()$$+ method of classes (implement rich comparison methods such as +$$__lt__()$$+ instead, if needed). * Do not use the +buffer+ type. * Do not use features not available in all targeted Python versions. Common examples include: ** The +argparse+, +html+, +ipaddress+, +sysconfig+, and +UserDict+ modules ** The +collections.OrderedDict+ class ** The +subprocess.run()+ function ** The +subprocess.DEVNULL+ constant ** +subprocess+ module-specific exceptions ** Set literals (+{1, 2, 3}+) === Python Usages to Avoid === Avoid the following if possible, otherwise research the compatibility issues involved (hacky workarounds are often available): * long integers * octal integer literals * mixed binary and string data in one data file or variable * metaclasses * +locale.strcoll+ and +locale.strxfrm+ * the +configparser+ and +ConfigParser+ modules * importing compatibility modules such as +six+ (so we don't have to add them to Pacemaker's dependencies) == Formatting Python Code == indexterm:[Python,formatting] * Indentation must be 4 spaces, no tabs. * Do not leave trailing whitespace. * Lines should be no longer than 80 characters unless limiting line length significantly impacts readability. For Python, this limitation is flexible since breaking a line often impacts readability, but definitely keep it under 120 characters. * Where not conflicting with this style guide, it is recommended (but not - required) to follow https://www.python.org/dev/peps/pep-0008/:[PEP 8]. + required) to follow https://www.python.org/dev/peps/pep-0008/[PEP 8]. * It is recommended (but not required) to format Python code such that `pylint --disable=line-too-long,too-many-lines,too-many-instance-attributes,too-many-arguments,too-many-statements` produces minimal complaints (even better if you don't need to disable all those checks). diff --git a/doc/Pacemaker_Development/en-US/Pacemaker_Development.ent b/doc/Pacemaker_Development/en-US/Pacemaker_Development.ent index 8679f6ffb1..1a9959792b 100644 --- a/doc/Pacemaker_Development/en-US/Pacemaker_Development.ent +++ b/doc/Pacemaker_Development/en-US/Pacemaker_Development.ent @@ -1,4 +1,4 @@ - + diff --git a/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt b/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt index b1254e45de..1a6beb9af7 100644 --- a/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt +++ b/doc/Pacemaker_Explained/en-US/Ap-FAQ.txt @@ -1,72 +1,59 @@ [appendix] [[ap-faq]] == FAQ == [qanda] Why is the Project Called Pacemaker?:: indexterm:[Pacemaker] First of all, the reason it's not called the CRM is because of the abundance of terms footnote:[http://en.wikipedia.org/wiki/CRM] that are commonly abbreviated to those three letters. The Pacemaker name came from Kham, footnote:[http://khamsouk.souvanlasy.com/] a good friend of Pacemaker developer Andrew Beekhof's, and was originally used by a Java GUI that Beekhof was prototyping in early 2007. Alas, other commitments prevented the GUI from progressing much and, when it came time to choose a name for this project, Lars Marowsky-Bree suggested it was an even better fit for an independent CRM. The idea stems from the analogy between the role of this software and that of the little device that keeps the human heart pumping. Pacemaker monitors the cluster and intervenes when necessary to ensure the smooth operation of the services it provides. There were a number of other names (and acronyms) tossed around, but suffice to say "Pacemaker" was the best. Why was the Pacemaker Project Created?:: Pacemaker was spun off from an earlier project called http://linux-ha.org/[Heartbeat], which combined a cluster layer and a cluster resource manager. The CRM was made into its own project, Pacemaker, in order to: * support both the Corosync and Heartbeat cluster stacks equally (Heartbeat support was dropped in Pacemaker 2.0, as the project had faded out by then) * decouple the release cycles of the cluster layer and the cluster resource manager at very different stages of their life-cycles * foster clearer package boundaries, thus leading to better and more stable interfaces What Messaging Layers are Supported?:: indexterm:[Messaging Layers] * http://www.corosync.org/[Corosync] version 2 - * Historically, Pacemaker also supported Corosync version 1 (with either CMAN - or a pacemaker plugin) and Heartbeat. Support for these legacy stacks was - dropped with Pacemaker 2.0. + * Historically, Pacemaker 1 also supported Corosync version 1 (with either + CMAN or a pacemaker plugin) and Heartbeat. Support for these legacy stacks + was dropped with Pacemaker 2.0. - -Can I Choose Which Messaging Layer to Use at Run Time?:: - - Yes. The CRM will automatically detect which started it and behave accordingly. - -[[q-messaging-layer]] Which Messaging Layer Should I Choose?:: - indexterm:[Cluster Stack,Corosync] indexterm:[Corosync] - Corosync version 2 is the current state of the art due to its - more advanced features and better support for Pacemaker, - but often the best choice is to use whatever comes with - your Linux distribution, and follow the distribution's - setup instructions. - Where Can I Get Pre-built Packages?:: Most major Linux distributions have pacemaker packages in their standard package repositories. See the http://clusterlabs.org/wiki/Install[Install wiki page] for details. What Versions of Pacemaker Are Supported?:: Some Linux distributions (such as Red Hat Enterprise Linux and SUSE Linux Enterprise) offer technical support for their customers; contact them for details of such support. For help within the community (mailing lists, IRC, etc.) from Pacemaker developers and users, refer to the http://clusterlabs.org/wiki/Releases[Releases wiki page] for an up-to-date list of versions considered to be supported by the project. When seeking assistance, please try to ensure you have one of these versions. diff --git a/doc/Pacemaker_Explained/en-US/Ap-Install.txt b/doc/Pacemaker_Explained/en-US/Ap-Install.txt index c0ac777f7d..25cb889319 100644 --- a/doc/Pacemaker_Explained/en-US/Ap-Install.txt +++ b/doc/Pacemaker_Explained/en-US/Ap-Install.txt @@ -1,110 +1,107 @@ [appendix] == Installing == === Installing the Software === Most major Linux distributions have pacemaker packages in their standard package repositories, or the software can be built from source code. See the http://clusterlabs.org/wiki/Install[Install wiki page] for details. -See <> -for information about choosing a messaging layer. - === Enabling Pacemaker === ==== Enabling Pacemaker For Corosync 2._x_ ==== High-level cluster management tools are available that can configure corosync for you. This document focuses on the lower-level details if you want to configure corosync yourself. Corosync configuration is normally located in +/etc/corosync/corosync.conf+. .Corosync 2._x_ configuration file for two nodes *myhost1* and *myhost2* ==== ---- totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: myhost1 nodeid: 1 } node { ring0_addr: myhost2 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_syslog: yes } ---- ==== .Corosync 2._x_ configuration file for three nodes *myhost1*, *myhost2* and *myhost3* ==== ---- totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: myhost1 nodeid: 1 } node { ring0_addr: myhost2 nodeid: 2 } node { ring0_addr: myhost3 nodeid: 3 } } quorum { provider: corosync_votequorum } logging { to_syslog: yes } ---- ==== In the above examples, the +totem+ section defines what protocol version and options (including encryption) to use, footnote:[ Please consult the Corosync website (http://www.corosync.org/) and documentation for details on enabling encryption and peer authentication for the cluster. ] and gives the cluster a unique name (+mycluster+ in these examples). The +node+ section lists the nodes in this cluster. (See <> for how this affects pacemaker.) The +quorum+ section defines how the cluster uses quorum. The important thing is that two-node clusters must be handled specially, so +two_node: 1+ must be defined for two-node clusters (and only for two-node clusters). The +logging+ section should be self-explanatory. diff --git a/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt b/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt index 7be43e924d..367cf49789 100644 --- a/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt +++ b/doc/Pacemaker_Explained/en-US/Ap-Upgrade.txt @@ -1,399 +1,443 @@ [appendix] == Upgrading == [[ap-upgrade]] +=== Pacemaker Versioning === + +Pacemaker has an overall release version, plus separate version numbers for +certain internal components. + +* *Pacemaker release version:* This version consists of three numbers + (_x.y.z_). ++ +The major version number (the _x_ in _x.y.z_) increases when at least some +rolling upgrades are not possible from the previous major version. For example, +a rolling upgrade from 1.0.8 to 1.1.15 should always be supported, but a +rolling upgrade from 1.0.8 to 2.0.0 may not be possible. ++ +The minor version (the _y_ in _x.y.z_) increases when there are significant +changes in cluster default behavior, tool behavior, and/or the API interface +(for software that utilizes Pacemaker libraries). The main benefit is to alert +you to pay closer attention to the release notes, to see if you might be +affected. ++ +The release counter (the _z_ in _x.y.z_) is increased with all public releases +of Pacemaker, which typically include both bug fixes and new features. + +* *CRM feature set:* This version number applies to the communication between + full cluster nodes. ++ +It increases when a cluster node running the older version would have +problems if the cluster's Designated Controller (DC) has the newer version. +To avoid these problems, Pacemaker ensures that the longest-running node is the +DC, and that nodes with an older feature set cannot join the cluster. + +* *LRMD protocol version:* This version applies to communication between a + Pacemaker Remote node and the cluster. It increases when an older cluster + node would have problems hosting the connection to a newer Pacemaker Remote + node. To avoid these problems, Pacemaker Remote nodes will accept connections + only from cluster nodes with the same or newer LRMD protocol version. ++ +Unlike with CRM feature set differences between full cluster nodes, +mixed LRMD protocol versions between Pacemaker Remote nodes and full cluster +nodes are fine, as long as the Pacemaker Remote nodes have the older version. +This can be useful, for example, to host a legacy application in an +older operating system version used as a Pacemaker Remote node. + +* *XML schema version:* Pacemaker’s configuration syntax — what's allowed in + the Configuration Information Base (CIB) — has its own version. This allows + the configuration syntax to evolve over time while still allowing clusters + with older configurations to work without change. + === Upgrading Cluster Software === There are three approaches to upgrading a cluster, each with advantages and disadvantages. .Upgrade Methods [width="95%",cols="s,6*",options="header",align="center"] |========================================================= |Method |Available between all versions |Can be used with Pacemaker Remote nodes |Service outage during upgrade |Service recovery during upgrade |Exercises failover logic |Allows change of messaging layer indexterm:[Cluster,switching between stacks] indexterm:[Changing cluster stack] -footnote:[Currently, Corosync 2 is the only supported cluster stack, but that -is likely to change in the future.] +footnote:[Currently, Corosync 2 is the only supported cluster stack, but other +stacks have been supported by past versions, and may be supported by future +versions.] |Complete cluster shutdown indexterm:[upgrade,shutdown] indexterm:[shutdown upgrade] |yes |yes |always |N/A |no |yes |Rolling (node by node) indexterm:[upgrade,rolling] indexterm:[rolling upgrade] |no |yes |always footnote:[Any active resources will be moved off the node being upgraded, so there will be at least a brief outage unless all resources can be migrated "live".] |yes |yes |no |Detach and reattach indexterm:[upgrade,reattach] indexterm:[reattach upgrade] |yes |no |only due to failure |no |no |yes |========================================================= ==== Complete Cluster Shutdown ==== In this scenario, one shuts down all cluster nodes and resources, then upgrades all the nodes before restarting the cluster. . On each node: .. Shutdown the cluster software (pacemaker and the messaging layer). .. Upgrade the Pacemaker software. This may also include upgrading the messaging layer and/or the underlying operating system. .. Check the configuration with the `crm_verify` tool. . On each node: .. Start the cluster software. Currently, only Corosync version 2 is supported as the cluster layer, but if another stack is supported in the future, the stack does not need to be the same one before the upgrade. One variation of this approach is to build a new cluster on new hosts. This allows the new version to be tested beforehand, and minimizes downtime by having the new nodes ready to be placed in production as soon as the old nodes are shut down. ==== Rolling (node by node) ==== In this scenario, each node is removed from the cluster, upgraded, and then brought back online, until all nodes are running the newest version. -If you plan to upgrade other cluster software -- such as the messaging layer -- -at the same time, consult that software's documentation for its compatibility -with a rolling upgrade. +Special considerations when planning a rolling upgrade: -Pacemaker has three version numbers that affect rolling upgrades: +* If you plan to upgrade other cluster software -- such as the messaging layer -- + at the same time, consult that software's documentation for its compatibility + with a rolling upgrade. -* *Pacemaker release version:* Rolling upgrades are possible as long as the - major version number (the _x_ in _x.y.z_) stays the same. For example, - a rolling upgrade may be done from 1.0.8 to 1.1.15, but not from - 0.6.7 to 1.0.0. +* If the major version number is changing in the Pacemaker version you are + upgrading to, a rolling upgrade may not be possible. Read the new version's + release notes (as well the information here) for what limitations may exist. -* *CRM feature set:* This version number applies to the communication between - full cluster nodes. -+ -It increases when a cluster node running the older version would have -problems if the cluster's Designated Controller (DC) has the newer version. -To avoid these problems, Pacemaker ensures that the longest-running node is the -DC, and that nodes with an older feature set cannot join the cluster. -+ -Therefore, if the CRM feature set is changing in the Pacemaker version you -are upgrading to, you should run a mixed-version cluster only during a small -rolling upgrade window. If one of the older nodes drops out of the -cluster for any reason, it will not be able to rejoin until it is upgraded. +* If the CRM feature set is changing in the Pacemaker version you are upgrading + to, you should run a mixed-version cluster only during a small rolling + upgrade window. If one of the older nodes drops out of the cluster for any + reason, it will not be able to rejoin until it is upgraded. -* *LRMD protocol version:* This version number applies to communication between a - Pacemaker Remote node and the cluster. It increases when an older cluster - node would have problems hosting the connection to a newer Pacemaker Remote - node. To avoid these problems, Pacemaker Remote nodes will accept connections - only from cluster nodes with the same or newer LRMD protocol version. -+ -For rolling upgrades, this means that all cluster nodes should be upgraded -before upgrading any Pacemaker Remote nodes. -+ -Unlike with CRM feature set differences between full cluster nodes, -mixed LRMD protocol versions between Pacemaker Remote nodes and full cluster -nodes are fine, as long as the Pacemaker Remote nodes have the older version. -This can be useful, for example, to host a legacy application in an -older operating system version used as a Pacemaker Remote node. +* If the LRMD protocol version is changing, all cluster nodes should be + upgraded before upgrading any Pacemaker Remote nodes. See the ClusterLabs wiki's http://clusterlabs.org/wiki/ReleaseCalendar[Release Calendar] to figure out whether the CRM feature set and/or LRMD protocol version changed between the the Pacemaker release versions in your rolling upgrade. -[WARNING] -==== -The interpretation of the LRMD protocol version changed in Pacemaker 1.1.15. -If you are planning a rolling upgrade from an earlier Pacemaker version to -Pacemaker 1.1.15 or later involving Pacemaker Remote nodes, you will need to -take special precautions to avoid problems. See -http://clusterlabs.org/wiki/Upgrading_to_Pacemaker_1.1.15_or_later_from_an_earlier_version[Upgrading -to Pacemaker 1.1.15 or later from an earlier version] on the ClusterLabs wiki. -==== - To perform a rolling upgrade, on each node in turn: . Put the node into standby mode, and wait for any active resources to be moved cleanly to another node. (This step is optional, but allows you to deal with any resource issues before the upgrade.) . Shutdown the cluster software (pacemaker and the messaging layer) on the node. . Upgrade the Pacemaker software. This may also include upgrading the messaging layer and/or the underlying operating system. . If this is the first node to be upgraded, check the configuration with the `crm_verify` tool. . Start the messaging layer. This must be the same messaging layer (currently only Corosync 2 is supported) that the rest of the cluster is using. [NOTE] ==== -Rolling upgrades were not always possible with older -pacemaker versions. Rolling upgrades that cross compatibility -boundaries listed in the following table must be performed in multiple steps. +Even if a rolling upgrade from the current version of the cluster to the newest +version is not directly possible, it may be possible to perform a rolling +upgrade in multiple steps, by upgrading to an intermediate version first. .Version Compatibility Table [width="95%",cols="2*",options="header",align="center"] |========================================================= |Version being Installed |Oldest Compatible Version -|Pacemaker 1.x.y +|Pacemaker 2.y.z +|Pacemaker 1.1.11 +footnote:[Rolling upgrades from Pacemaker 1.1.z to 2.y.z are possible only if +the cluster uses corosync 2 as its messaging layer, and the Cluster Information +Base (CIB) uses schema 1.0 or higher in its validate-with property.] + +|Pacemaker 1.y.z |Pacemaker 1.0.0 -|Pacemaker 0.7.x -|Pacemaker 0.6.0 +|Pacemaker 0.7.z +|Pacemaker 0.6.z |========================================================= ==== ==== Detach and Reattach ==== The reattach method is a variant of a complete cluster shutdown, where the resources are left active and get re-detected when the cluster is restarted. This method may not be used if the cluster contains any Pacemaker Remote nodes. . Tell the cluster to stop managing services. This is required to allow the services to remain active after the cluster shuts down. + ---- # crm_attribute --name maintenance-mode --update true ---- . On each node, shutdown the cluster software (pacemaker and the messaging layer), and upgrade the Pacemaker software. This may also include upgrading the messaging layer. While the underlying operating system may be upgraded at the same time, that will be more likely to cause outages in the detached services (certainly, if a reboot is required). . Check the configuration with the `crm_verify` tool. . On each node, start the cluster software. Currently, only Corosync version 2 is supported as the cluster layer, but if another stack is supported in the future, the stack does not need to be the same one before the upgrade. . Verify that the cluster re-detected all resources correctly. . Allow the cluster to resume managing resources again: + ---- # crm_attribute --name maintenance-mode --delete ---- -[NOTE] -=========== -Support for maintenance mode was added in Pacemaker 1.0.0. If you are upgrading -from an earlier version, you can detach by setting +is-managed+ to +false+ for -all resources. -=========== - === Upgrading the Configuration === indexterm:[upgrade,Configuration] indexterm:[Configuration,upgrading] -Pacemaker's configuration -- the Configuration Information Base (CIB) -- has -its own XML schema version, independent of the Pacemaker software version. +The CIB schema version can change from one Pacemaker version to another. After cluster software is upgraded, the cluster will continue to use the older schema version that it was previously using. This can be useful, for example, when administrators have written tools that modify the configuration, and are based on the older syntax. +footnote:[As of Pacemaker 2.0.0, only schema versions pacemaker-1.0 and higher +are supported (excluding pacemaker-1.1, which was an experimental schema +now known as pacemaker-next).] However, when using an older syntax, new features may be unavailable, and there is a performance impact, since the cluster must do a non-persistent configuration upgrade before each transition. So while using the old syntax is possible, it is not advisable to continue using it indefinitely. Even if you wish to continue using the old syntax, it is a good idea to follow the upgrade procedure outlined below, except for the last step, to ensure that the new software has no problems with your existing configuration (since it will perform much the same task internally). If you are brave, it is sufficient simply to run `cibadmin --upgrade`. A more cautious approach would proceed like this: . Create a shadow copy of the configuration. The later commands will automatically operate on this copy, rather than the live configuration. + ----- # crm_shadow --create shadow ----- . Verify the configuration is valid with the new software (which may be stricter about syntax mistakes, or may have dropped support for deprecated features): indexterm:[Configuration,verify] indexterm:[verify,Configuration] + ----- # crm_verify --live-check ----- . Fix any errors or warnings. . Perform the upgrade: + ----- # cibadmin --upgrade ----- . If this step fails, there are three main possibilities: .. The configuration was not valid to start with (did you do steps 2 and 3?). .. The transformation failed - http://bugs.clusterlabs.org/[report a bug] or mailto:users@clusterlabs.org?subject=Transformation%20failed%20during%20upgrade[email the project]. .. The transformation was successful but produced an invalid result. + If the result of the transformation is invalid, you may see a number of errors from the validation library. If these are not helpful, visit the http://clusterlabs.org/wiki/Validation_FAQ[Validation FAQ wiki page] and/or try the manual upgrade procedure described below. + . Check the changes: + ----- # crm_shadow --diff ----- + If at this point there is anything about the upgrade that you wish to fine-tune (for example, to change some of the automatic IDs), now is the time to do so: + ----- # crm_shadow --edit ----- + This will open the configuration in your favorite editor (whichever is specified by the standard *$EDITOR* environment variable). + . Preview how the cluster will react: + ------ # crm_simulate --live-check --save-dotfile shadow.dot -S # graphviz shadow.dot ------ + Verify that either no resource actions will occur or that you are happy with any that are scheduled. If the output contains actions you do not expect (possibly due to changes to the score calculations), you may need to make further manual changes. See <> for further details on how to interpret the output of `crm_simulate` and `graphviz`. + . Upload the changes: + ----- # crm_shadow --commit shadow --force ----- + In the unlikely event this step fails, please report a bug. [NOTE] ==== indexterm:[Configuration,upgrade manually] It is also possible to perform the configuration upgrade steps manually: . Locate the +upgrade*.xsl+ conversion scripts provided with the source code. These will often be installed in a location such as +/usr/share/pacemaker+, or may be obtained from the https://github.com/ClusterLabs/pacemaker/tree/master/xml[source repository]. . Run the conversion scripts that apply to your older version, for example: indexterm:[XML,convert] + ----- # xsltproc /path/to/upgrade06.xsl config06.xml > config10.xml ----- + . Locate the +pacemaker.rng+ script (from the same location as the xsl files). . Check the XML validity: indexterm:[validate configuration]indexterm:[Configuration,validate XML] + ---- # xmllint --relaxng /path/to/pacemaker.rng config10.xml ---- The advantage of this method is that it can be performed without the cluster running, and any validation errors are often more informative. ==== +=== What Changed in 2.0 === + +The main goal of the 2.0 release was to remove support for deprecated syntax, +along with some small changes in default configuration behavior and tool +behavior. Highlights: + +* Only Corosync version 2 is now supported as the underlying cluster + layer. Support for Heartbeat and Corosync 1 (including CMAN) is removed. + +* The Pacemaker detail log file is now stored in + /var/log/pacemaker/pacemaker.log by default. + +* The record-pending cluster property now defaults to true, which + allows status tools such as crm_mon to show operations that are in + progress. + +* Support for a number of deprecated build options, environment variables, + and configuration settings has been removed. + +* The public API for Pacemaker libraries that software applications can use + has changed significantly. + +For a detailed list of changes, see the release notes and the +https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes[Pacemaker 2.0 Changes] +page on the ClusterLabs wiki. + === What Changed in 1.0 === ==== New ==== * Failure timeouts. See <> * New section for resource and operation defaults. See <> and <> * Tool for making offline configuration changes. See <> * +Rules, instance_attributes, meta_attributes+ and sets of operations can be defined once and referenced in multiple places. See <> * The CIB now accepts XPath-based create/modify/delete operations. See the pass:[cibadmin] help text. * Multi-dimensional colocation and ordering constraints. See <> and <> * The ability to connect to the CIB from non-cluster machines. See <> * Allow recurring actions to be triggered at known times. See <> ==== Changed ==== * Syntax ** All resource and cluster options now use dashes (-) instead of underscores (_) ** +master_slave+ was renamed to +master+ ** The +attributes+ container tag was removed ** The operation field +pre-req+ has been renamed +requires+ ** All operations must have an +interval+, +start+/+stop+ must have it set to zero * The +stonith-enabled+ option now defaults to true. * The cluster will refuse to start resources if +stonith-enabled+ is true (or unset) and no STONITH resources have been defined * The attributes of colocation and ordering constraints were renamed for clarity. See <> and <> * +resource-failure-stickiness+ has been replaced by +migration-threshold+. See <> * The parameters for command-line tools have been made consistent * Switched to 'RelaxNG' schema validation and 'libxml2' parser ** id fields are now XML IDs which have the following limitations: *** id's cannot contain colons (:) *** id's cannot begin with a number *** id's must be globally unique (not just unique for that tag) ** Some fields (such as those in constraints that refer to resources) are IDREFs. + This means that they must reference existing resources or objects in order for the configuration to be valid. Removing an object which is referenced elsewhere will therefore fail. + ** The CIB representation, from which a MD5 digest is calculated to verify CIBs on the nodes, has changed. + This means that every CIB update will require a full refresh on any upgraded nodes until the cluster is fully upgraded to 1.0. This will result in significant performance degradation and it is therefore highly inadvisable to run a mixed 1.0/0.6 cluster for any longer than absolutely necessary. + * Ping node information no longer needs to be added to _ha.cf_. + Simply include the lists of hosts in your ping resource(s). ==== Removed ==== * Syntax ** It is no longer possible to set resource meta options as top-level attributes. Use meta attributes instead. ** Resource and operation defaults are no longer read from +crm_config+. See <> and <> instead. diff --git a/doc/Pacemaker_Explained/en-US/Book_Info.xml b/doc/Pacemaker_Explained/en-US/Book_Info.xml index 0d9d73c5ca..da196e34ae 100644 --- a/doc/Pacemaker_Explained/en-US/Book_Info.xml +++ b/doc/Pacemaker_Explained/en-US/Book_Info.xml @@ -1,35 +1,35 @@ Configuration Explained An A-Z guide to Pacemaker's Configuration Options Pacemaker - 1.1 + 2.0 - 10 + 11 0 The purpose of this document is to definitively explain the concepts used to configure Pacemaker. To achieve this, it will focus exclusively on the XML syntax used to configure Pacemaker's Cluster Information Base (CIB). diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt index c06ef81316..2c39af1799 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt @@ -1,817 +1,813 @@ = Advanced Configuration = [[s-remote-connection]] == Connecting from a Remote Machine == indexterm:[Cluster,Remote connection] indexterm:[Cluster,Remote administration] Provided Pacemaker is installed on a machine, it is possible to connect to the cluster even if the machine itself is not in the same cluster. To do this, one simply sets up a number of environment variables and runs the same commands as when working on a cluster node. .Environment Variables Used to Connect to Remote Instances of the CIB [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Environment Variable |Default |Description |CIB_user |$USER |The user to connect as. Needs to be part of the +haclient+ group on the target host. indexterm:[Environment Variable,CIB_user] |CIB_passwd | |The user's password. Read from the command line if unset. indexterm:[Environment Variable,CIB_passwd] |CIB_server |localhost |The host to contact indexterm:[Environment Variable,CIB_server] |CIB_port | |The port on which to contact the server; required. indexterm:[Environment Variable,CIB_port] |CIB_encrypted |TRUE |Whether to encrypt network traffic indexterm:[Environment Variable,CIB_encrypted] |========================================================= So, if *c001n01* is an active cluster node and is listening on port 1234 for connections, and *someuser* is a member of the *haclient* group, then the following would prompt for *someuser*'s password and return the cluster's current configuration: ---- # export CIB_port=1234; export CIB_server=c001n01; export CIB_user=someuser; # cibadmin -Q ---- For security reasons, the cluster does not listen for remote connections by default. If you wish to allow remote access, you need to set the +remote-tls-port+ (encrypted) or +remote-clear-port+ (unencrypted) CIB properties (i.e., those kept in the +cib+ tag, like +num_updates+ and +epoch+). .Extra top-level CIB properties for remote access [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |remote-tls-port |_none_ |Listen for encrypted remote connections on this port. indexterm:[remote-tls-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-tls-port] |remote-clear-port |_none_ |Listen for plaintext remote connections on this port. indexterm:[remote-clear-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-clear-port] |========================================================= [[s-recurring-start]] == Specifying When Recurring Actions are Performed == By default, recurring actions are scheduled relative to when the resource started. So if your resource was last started at 14:32 and you have a backup set to be performed every 24 hours, then the backup will always run in the middle of the business day -- hardly desirable. To specify a date and time that the operation should be relative to, set the operation's +interval-origin+. The cluster uses this point to calculate the correct +start-delay+ such that the operation will occur at _origin + (interval * N)_. So, if the operation's interval is 24h, its interval-origin is set to 02:00 and it is currently 14:32, then the cluster would initiate the operation with a start delay of 11 hours and 28 minutes. If the resource is moved to another node before 2am, then the operation is cancelled. The value specified for +interval+ and +interval-origin+ can be any date/time conforming to the http://en.wikipedia.org/wiki/ISO_8601[ISO8601 standard]. By way of example, to specify an operation that would run on the first Monday of 2009 and every Monday after that, you would add: .Specifying a Base for Recurring Action Intervals ===== [source,XML] ===== [[s-failure-handling]] == Handling Resource Failure == By default, Pacemaker will attempt to recover failed resources by restarting them. However, failure recovery is highly configurable. === Failure Counts === Pacemaker tracks resource failures for each combination of node, resource, and operation (start, stop, monitor, etc.). You can query the fail count for a particular node, resource, and/or operation using the `crm_failcount` command. For example, to see how many times the 10-second monitor for +myrsc+ has failed on +node1+, run: ---- # crm_failcount --query -r myrsc -N node1 -n monitor -I 10s ---- If you omit the node, `crm_failcount` will use the local node. If you omit the operation and interval, `crm_failcount` will display the sum of the fail counts for all operations on the resource. You can use `crm_resource --cleanup` or `crm_failcount --delete` to clear fail counts. For example, to clear the above monitor failures, run: ---- # crm_resource --cleanup -r myrsc -N node1 -n monitor -I 10s ---- If you omit the resource, `crm_resource --cleanup` will clear failures for all resources. If you omit the node, it will clear failures on all nodes. If you omit the operation and interval, it will clear the failures for all operations on the resource. [NOTE] ==== Even when cleaning up only a single operation, all failed operations will disappear from the status display. This allows us to trigger a re-check of the resource's current status. ==== Higher-level tools may provide other commands for querying and clearing fail counts. The `crm_mon` tool shows the current cluster status, including any failed operations. To see the current fail counts for any failed resources, call `crm_mon` with the `--failcounts` option. This shows the fail counts per resource (that is, the sum of any operation fail counts for the resource). === Failure Response === Normally, if a running resource fails, pacemaker will try to stop it and start it again. Pacemaker will choose the best location to start it each time, which may be the same node that it failed on. However, if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case. Pacemaker allows you to set your preference via the +migration-threshold+ resource meta-attribute. footnote:[ The naming of this option was perhaps unfortunate as it is easily confused with live migration, the process of moving a resource from one node to another without stopping it. Xen virtual guests are the most common example of resources that can be migrated in this manner. ] If you define +migration-threshold=pass:[N]+ for a resource, it will be banned from the original node after 'N' failures. [NOTE] ==== The +migration-threshold+ is per 'resource', even though fail counts are tracked per 'operation'. The operation fail counts are added together to compare against the +migration-threshold+. ==== By default, fail counts remain until manually cleared by an administrator using `crm_resource --cleanup` or `crm_failcount --delete` (hopefully after first fixing the failure's cause). It is possible to have fail counts expire automatically by setting the +failure-timeout+ resource meta-attribute. [IMPORTANT] ==== A successful operation does not clear past failures. If a recurring monitor operation fails once, succeeds many times, then fails again days later, its fail count is 2. Fail counts are cleared only by manual intervention or falure timeout. ==== For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+ would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute. [NOTE] ==== +failure-timeout+ is measured since the most recent failure. That is, older failures do not individually time out and lower the fail count. Instead, all failures are timed out simultaneously (and the fail count is reset to 0) if there is no new failure for the timeout period. ==== There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop. If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the default), start failures cause the fail count to be set to +INFINITY+ and thus always cause the resource to move immediately. Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout. [IMPORTANT] Please read <> to understand how timeouts work before configuring a +failure-timeout+. == Moving Resources == indexterm:[Moving,Resources] indexterm:[Resource,Moving] === Moving Resources Manually === There are primarily two occasions when you would want to move a resource from its current location: when the whole node is under maintenance, and when a single resource needs to be moved. ==== Standby Mode ==== Since everything eventually comes down to a score, you could create constraints for every resource to prevent them from running on one node. While pacemaker configuration can seem convoluted at times, not even we would require this of administrators. Instead, one can set a special node attribute which tells the cluster "don't let anything run here". There is even a helpful tool to help query and set it, called `crm_standby`. To check the standby status of the current machine, run: ---- # crm_standby -G ---- A value of +on+ indicates that the node is _not_ able to host any resources, while a value of +off+ says that it _can_. You can also check the status of other nodes in the cluster by specifying the `--node` option: ---- # crm_standby -G --node sles-2 ---- To change the current node's standby status, use `-v` instead of `-G`: ---- # crm_standby -v on ---- Again, you can change another host's value by supplying a hostname with `--node`. ==== Moving One Resource ==== When only one resource is required to move, we could do this by creating location constraints. However, once again we provide a user-friendly shortcut as part of the `crm_resource` command, which creates and modifies the extra constraints for you. If +Email+ were running on +sles-1+ and you wanted it moved to a specific location, the command would look something like: ---- # crm_resource -M -r Email -H sles-2 ---- Behind the scenes, the tool will create the following location constraint: [source,XML] It is important to note that subsequent invocations of `crm_resource -M` are not cumulative. So, if you ran these commands ---- # crm_resource -M -r Email -H sles-2 # crm_resource -M -r Email -H sles-3 ---- then it is as if you had never performed the first command. To allow the resource to move back again, use: ---- # crm_resource -U -r Email ---- Note the use of the word _allow_. The resource can move back to its original location but, depending on +resource-stickiness+, it might stay where it is. To be absolutely certain that it moves back to +sles-1+, move it there before issuing the call to `crm_resource -U`: ---- # crm_resource -M -r Email -H sles-1 # crm_resource -U -r Email ---- Alternatively, if you only care that the resource should be moved from its current location, try: ---- # crm_resource -B -r Email ---- Which will instead create a negative constraint, like [source,XML] This will achieve the desired effect, but will also have long-term consequences. As the tool will warn you, the creation of a +-INFINITY+ constraint will prevent the resource from running on that node until `crm_resource -U` is used. This includes the situation where every other cluster node is no longer available! In some cases, such as when +resource-stickiness+ is set to +INFINITY+, it is possible that you will end up with the problem described in <>. The tool can detect some of these cases and deals with them by creating both positive and negative constraints. E.g. +Email+ prefers +sles-1+ with a score of +-INFINITY+ +Email+ prefers +sles-2+ with a score of +INFINITY+ which has the same long-term consequences as discussed earlier. === Moving Resources Due to Connectivity Changes === You can configure the cluster to move resources when external connectivity is lost in two steps. ==== Tell Pacemaker to Monitor Connectivity ==== First, add an *ocf:pacemaker:ping* resource to the cluster. The *ping* resource uses the system utility of the same name to a test whether list of machines (specified by DNS hostname or IPv4/IPv6 address) are reachable and uses the results to maintain a node attribute called +pingd+ by default. footnote:[ The attribute name is customizable, in order to allow multiple ping groups to be defined. ] [NOTE] =========== Older versions of Pacemaker used a different agent *ocf:pacemaker:pingd* which is now deprecated in favor of *ping*. If your version of Pacemaker does not contain the *ping* resource agent, download the latest version from https://github.com/ClusterLabs/pacemaker/tree/master/extra/resources/ping =========== Normally, the ping resource should run on all cluster nodes, which means that you'll need to create a clone. A template for this can be found below along with a description of the most interesting parameters. .Common Options for a 'ping' Resource [width="95%",cols="1m,4<",options="header",align="center"] |========================================================= |Field |Description |dampen |The time to wait (dampening) for further changes to occur. Use this to prevent a resource from bouncing around the cluster when cluster nodes notice the loss of connectivity at slightly different times. indexterm:[dampen,Ping Resource Option] indexterm:[Ping Resource,Option,dampen] |multiplier |The number of connected ping nodes gets multiplied by this value to get a score. Useful when there are multiple ping nodes configured. indexterm:[multiplier,Ping Resource Option] indexterm:[Ping Resource,Option,multiplier] |host_list |The machines to contact in order to determine the current connectivity status. Allowed values include resolvable DNS host names, IPv4 and IPv6 addresses. indexterm:[host_list,Ping Resource Option] indexterm:[Ping Resource,Option,host_list] |========================================================= .An example ping cluster resource that checks node connectivity once every minute ===== [source,XML] ------------ ------------ ===== [IMPORTANT] =========== You're only half done. The next section deals with telling Pacemaker how to deal with the connectivity status that +ocf:pacemaker:ping+ is recording. =========== ==== Tell Pacemaker How to Interpret the Connectivity Data ==== [IMPORTANT] ====== Before attempting the following, make sure you understand <>. ====== There are a number of ways to use the connectivity data. The most common setup is for people to have a single ping target (e.g. the service network's default gateway), to prevent the cluster from running a resource on any unconnected node. .Don't run a resource on unconnected nodes ===== [source,XML] ------- ------- ===== A more complex setup is to have a number of ping targets configured. You can require the cluster to only run resources on nodes that can connect to all (or a minimum subset) of them. .Run only on nodes connected to three or more ping targets. ===== [source,XML] ------- ... ... ... ------- ===== Alternatively, you can tell the cluster only to _prefer_ nodes with the best connectivity. Just be sure to set +multiplier+ to a value higher than that of +resource-stickiness+ (and don't set either of them to +INFINITY+). .Prefer the node with the most connected ping nodes ===== [source,XML] ------- ------- ===== It is perhaps easier to think of this in terms of the simple constraints that the cluster translates it into. For example, if *sles-1* is connected to all five ping nodes but *sles-2* is only connected to two, then it would be as if you instead had the following constraints in your configuration: .How the cluster translates the above location constraint ===== [source,XML] ------- ------- ===== The advantage is that you don't have to manually update any constraints whenever your network connectivity changes. You can also combine the concepts above into something even more complex. The example below shows how you can prefer the node with the most connected ping nodes provided they have connectivity to at least three (again assuming that +multiplier+ is set to 1000). .A more complex example of choosing a location based on connectivity ===== [source,XML] ------- ------- ===== [[s-migrating-resources]] === Migrating Resources === Normally, when the cluster needs to move a resource, it fully restarts the resource (i.e. stops the resource on the current node and starts it on the new node). However, some types of resources, such as Xen virtual guests, are able to move to another location without loss of state (often referred to as live migration or hot migration). In pacemaker, this is called resource migration. Pacemaker can be configured to migrate a resource when moving it, rather than restarting it. Not all resources are able to migrate; see the Migration Checklist below, and those that can, won't do so in all situations. Conceptually, there are two requirements from which the other prerequisites follow: * The resource must be active and healthy at the old location; and * everything required for the resource to run must be available on both the old and new locations. The cluster is able to accommodate both 'push' and 'pull' migration models by requiring the resource agent to support two special actions: +migrate_to+ (performed on the current location) and +migrate_from+ (performed on the destination). In push migration, the process on the current location transfers the resource to the new location where is it later activated. In this scenario, most of the work would be done in the +migrate_to+ action and, if anything, the activation would occur during +migrate_from+. Conversely for pull, the +migrate_to+ action is practically empty and +migrate_from+ does most of the work, extracting the relevant resource state from the old location and activating it. There is no wrong or right way for a resource agent to implement migration, as long as it works. .Migration Checklist * The resource may not be a clone. * The resource must use an OCF style agent. * The resource must not be in a failed or degraded state. * The resource agent must support +migrate_to+ and +migrate_from+ actions, and advertise them in its metadata. * The resource must have the +allow-migrate+ meta-attribute set to +true+ (which is not the default). If an otherwise migratable resource depends on another resource via an ordering constraint, there are special situations in which it will be restarted rather than migrated. For example, if the resource depends on a clone, and at the time the resource needs to be moved, the clone has instances that are stopping and instances that are starting, then the resource will be restarted. The Policy Engine is not yet able to model this situation correctly and so takes the safer (if less optimal) path. -In pacemaker 1.1.11 and earlier, a migratable resource will be restarted -when moving if it directly or indirectly depends on 'any' primitive or group -resources. - -Even in newer versions, if a migratable resource depends on a non-migratable -resource, and both need to be moved, the migratable resource will be restarted. +Also, if a migratable resource depends on a non-migratable resource, and both +need to be moved, the migratable resource will be restarted. [[s-node-health]] == Tracking Node Health == A node may be functioning adequately as far as cluster membership is concerned, and yet be "unhealthy" in some respect that makes it an undesirable location for resources. For example, a disk drive may be reporting SMART errors, or the CPU may be highly loaded. Pacemaker offers a way to automatically move resources off unhealthy nodes. === Node Health Attributes === Pacemaker will treat any node attribute whose name starts with +#health+ as an indicator of node health. Node health attributes may have one of the following values: .Allowed Values for Node Health Attributes [width="95%",cols="1,3<",options="header",align="center"] |========================================================= |Value |Intended significance |+red+ |This indicator is unhealthy indexterm:[Node health,red] |+yellow+ |This indicator is becoming unhealthy indexterm:[Node health,yellow] |+green+ |This indicator is healthy indexterm:[Node health,green] |'integer' |A numeric score to apply to all resources on this node (0 or positive is healthy, negative is unhealthy) indexterm:[Node health,score] |========================================================= === Node Health Strategy === Pacemaker assigns a node health score to each node, as the sum of the values of all its node health attributes. This score will be used as a location constraint applied to this node for all resources. The +node-health-strategy+ cluster option controls how Pacemaker responds to changes in node health attributes, and how it translates +red+, +yellow+, and +green+ to scores. Allowed values are: .Node Health Strategies [width="95%",cols="1m,3<",options="header",align="center"] |========================================================= |Value |Effect |none |Do not track node health attributes at all. indexterm:[Node health,none] |migrate-on-red |Assign the value of +-INFINITY+ to +red+, and 0 to +yellow+ and +green+. This will cause all resources to move off the node if any attribute is +red+. indexterm:[Node health,migrate-on-red] |only-green |Assign the value of +-INFINITY+ to +red+ and +yellow+, and 0 to +green+. This will cause all resources to move off the node if any attribute is +red+ or +yellow+. indexterm:[Node health,only-green] |progressive |Assign the value of the +node-health-red+ cluster option to +red+, the value of +node-health-yellow+ to +yellow+, and the value of +node-health-green+ to +green+. Each node is additionally assigned a score of +node-health-base+ (this allows resources to start even if some attributes are +yellow+). This strategy gives the administrator finer control over how important each value is. indexterm:[Node health,progressive] |custom |Track node health attributes using the same values as +progressive+ for +red+, +yellow+, and +green+, but do not take them into account. The administrator is expected to implement a policy by defining rules (see <>) referencing node health attributes. indexterm:[Node health,custom] |========================================================= === Measuring Node Health === Since Pacemaker calculates node health based on node attributes, any method that sets node attributes may be used to measure node health. The most common ways are resource agents or separate daemons. Pacemaker provides examples that can be used directly or as a basis for custom code. The +ocf:pacemaker:HealthCPU+ and +ocf:pacemaker:HealthSMART+ resource agents set node health attributes based on CPU and disk parameters. The +ipmiservicelogd+ daemon sets node health attributes based on IPMI values (the +ocf:pacemaker:SystemHealth+ resource agent can be used to manage the daemon as a cluster resource). == Reloading Services After a Definition Change == The cluster automatically detects changes to the definition of services it manages. The normal response is to stop the service (using the old definition) and start it again (with the new definition). This works well, but some services are smarter and can be told to use a new set of options without restarting. To take advantage of this capability, the resource agent must: . Accept the +reload+ operation and perform any required actions. _The actions here depend completely on your application!_ + .The DRBD agent's logic for supporting +reload+ ===== [source,Bash] ------- case $1 in start) drbd_start ;; stop) drbd_stop ;; reload) drbd_reload ;; monitor) drbd_monitor ;; *) drbd_usage exit $OCF_ERR_UNIMPLEMENTED ;; esac exit $? ------- ===== . Advertise the +reload+ operation in the +actions+ section of its metadata + .The DRBD Agent Advertising Support for the +reload+ Operation ===== [source,XML] ------- 1.1 Master/Slave OCF Resource Agent for DRBD ... ------- ===== . Advertise one or more parameters that can take effect using +reload+. + Any parameter with the +unique+ set to 0 is eligible to be used in this way. + .Parameter that can be changed using reload ===== [source,XML] ------- Full path to the drbd.conf file. Path to drbd.conf ------- ===== Once these requirements are satisfied, the cluster will automatically know to reload the resource (instead of restarting) when a non-unique field changes. [NOTE] ====== Metadata will not be re-read unless the resource needs to be started. This may mean that the resource will be restarted the first time, even though you changed a parameter with +unique=0+. ====== [NOTE] ====== If both a unique and non-unique field are changed simultaneously, the resource will still be restarted. ====== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt index 68cedb5b11..0830af4e3d 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Resources.txt @@ -1,1520 +1,1519 @@ = Advanced Resource Types = [[group-resources]] == Groups - A Syntactic Shortcut == indexterm:[Group Resources] indexterm:[Resource,Groups] One of the most common elements of a cluster is a set of resources that need to be located together, start sequentially, and stop in the reverse order. To simplify this configuration, we support the concept of groups. .A group of two primitive resources ====== [source,XML] ------- ------- ====== Although the example above contains only two resources, there is no limit to the number of resources a group can contain. The example is also sufficient to explain the fundamental properties of a group: * Resources are started in the order they appear in (+Public-IP+ first, then +Email+) * Resources are stopped in the reverse order to which they appear in (+Email+ first, then +Public-IP+) If a resource in the group can't run anywhere, then nothing after that is allowed to run, too. * If +Public-IP+ can't run anywhere, neither can +Email+; * but if +Email+ can't run anywhere, this does not affect +Public-IP+ in any way The group above is logically equivalent to writing: .How the cluster sees a group resource ====== [source,XML] ------- ------- ====== Obviously as the group grows bigger, the reduced configuration effort can become significant. Another (typical) example of a group is a DRBD volume, the filesystem mount, an IP address, and an application that uses them. === Group Properties === .Properties of a Group Resource [width="95%",cols="3m,5<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the group indexterm:[id,Group Resource Property] indexterm:[Resource,Group Property,id] |========================================================= === Group Options === Groups inherit the +priority+, +target-role+, and +is-managed+ properties from primitive resources. See <> for information about those properties. === Group Instance Attributes === Groups have no instance attributes. However, any that are set for the group object will be inherited by the group's children. === Group Contents === Groups may only contain a collection of cluster resources (see <>). To refer to a child of a group resource, just use the child's +id+ instead of the group's. === Group Constraints === Although it is possible to reference a group's children in constraints, it is usually preferable to reference the group itself. .Some constraints involving groups ====== [source,XML] ------- ------- ====== === Group Stickiness === indexterm:[resource-stickiness,Groups] Stickiness, the measure of how much a resource wants to stay where it is, is additive in groups. Every active resource of the group will contribute its stickiness value to the group's total. So if the default +resource-stickiness+ is 100, and a group has seven members, five of which are active, then the group as a whole will prefer its current location with a score of 500. [[s-resource-clone]] == Clones - Resources That Get Active on Multiple Hosts == indexterm:[Clone Resources] indexterm:[Resource,Clones] Clones were initially conceived as a convenient way to start multiple instances of an IP address resource and have them distributed throughout the cluster for load balancing. They have turned out to quite useful for a number of purposes including integrating with the Distributed Lock Manager (used by many cluster filesystems), the fencing subsystem, and OCFS2. You can clone any resource, provided the resource agent supports it. Three types of cloned resources exist: * Anonymous * Globally unique * Stateful 'Anonymous' clones are the simplest. These behave completely identically everywhere they are running. Because of this, there can be only one copy of an anonymous clone active per machine. 'Globally unique' clones are distinct entities. A copy of the clone running on one machine is not equivalent to another instance on another node, nor would any two copies on the same node be equivalent. 'Stateful' clones are covered later in <>. .A clone of an LSB resource ====== [source,XML] ------- ------- ====== === Clone Properties === .Properties of a Clone Resource [width="95%",cols="3m,5<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the clone indexterm:[id,Clone Property] indexterm:[Clone,Property,id] |========================================================= === Clone Options === Options inherited from <> resources: +priority, target-role, is-managed+ .Clone-specific configuration options [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |clone-max |number of nodes in cluster |How many copies of the resource to start indexterm:[clone-max,Clone Option] indexterm:[Clone,Option,clone-max] |clone-node-max |1 |How many copies of the resource can be started on a single node indexterm:[clone-node-max,Clone Option] indexterm:[Clone,Option,clone-node-max] |clone-min |1 |Require at least this number of clone instances to be runnable before allowing -resources depending on the clone to be runnable '(since 1.1.14)' +resources depending on the clone to be runnable indexterm:[clone-min,Clone Option] indexterm:[Clone,Option,clone-min] |notify |true |When stopping or starting a copy of the clone, tell all the other copies beforehand and again when the action was successful. Allowed values: +false+, +true+ indexterm:[notify,Clone Option] indexterm:[Clone,Option,notify] |globally-unique |false |Does each copy of the clone perform a different function? Allowed values: +false+, +true+ indexterm:[globally-unique,Clone Option] indexterm:[Clone,Option,globally-unique] |ordered |false |Should the copies be started in series (instead of in parallel)? Allowed values: +false+, +true+ indexterm:[ordered,Clone Option] indexterm:[Clone,Option,ordered] |interleave |false |If this clone depends on another clone via an ordering constraint, is it allowed to start after the local instance of the other clone starts, rather than wait for all instances of the other clone to start? Allowed values: +false+, +true+ indexterm:[interleave,Clone Option] indexterm:[Clone,Option,interleave] |========================================================= === Clone Instance Attributes === Clones have no instance attributes; however, any that are set here will be inherited by the clone's children. === Clone Contents === Clones must contain exactly one primitive or group resource. [WARNING] You should never reference the name of a clone's child. If you think you need to do this, you probably need to re-evaluate your design. === Clone Constraints === In most cases, a clone will have a single copy on each active cluster node. If this is not the case, you can indicate which nodes the cluster should preferentially assign copies to with resource location constraints. These constraints are written no differently from those for primitive resources except that the clone's +id+ is used. .Some constraints involving clones ====== [source,XML] ------- ------- ====== Ordering constraints behave slightly differently for clones. In the example above, +apache-stats+ will wait until all copies of +apache-clone+ that need to be started have done so before being started itself. Only if _no_ copies can be started will +apache-stats+ be prevented from being active. Additionally, the clone will wait for +apache-stats+ to be stopped before stopping itself. Colocation of a primitive or group resource with a clone means that the resource can run on any machine with an active copy of the clone. The cluster will choose a copy based on where the clone is running and the resource's own location preferences. Colocation between clones is also possible. If one clone +A+ is colocated with another clone +B+, the set of allowed locations for +A+ is limited to nodes on which +B+ is (or will be) active. Placement is then performed normally. [[s-clone-stickiness]] === Clone Stickiness === indexterm:[resource-stickiness,Clones] To achieve a stable allocation pattern, clones are slightly sticky by default. If no value for +resource-stickiness+ is provided, the clone will use a value of 1. Being a small value, it causes minimal disturbance to the score calculations of other resources but is enough to prevent Pacemaker from needlessly moving copies around the cluster. [NOTE] ==== For globally unique clones, this may result in multiple instances of the clone staying on a single node, even after another eligible node becomes active (for example, after being put into standby mode then made active again). If you do not want this behavior, specify a +resource-stickiness+ of 0 for the clone temporarily and let the cluster adjust, then set it back to 1 if you want the default behavior to apply again. ==== === Clone Resource Agent Requirements === Any resource can be used as an anonymous clone, as it requires no additional support from the resource agent. Whether it makes sense to do so depends on your resource and its resource agent. Globally unique clones do require some additional support in the resource agent. In particular, it must only respond with +$\{OCF_SUCCESS}+ if the node has that exact instance active. All other probes for instances of the clone should result in +$\{OCF_NOT_RUNNING}+ (or one of the other OCF error codes if they are failed). Individual instances of a clone are identified by appending a colon and a numerical offset, e.g. +apache:2+. Resource agents can find out how many copies there are by examining the +OCF_RESKEY_CRM_meta_clone_max+ environment variable and which copy it is by examining +OCF_RESKEY_CRM_meta_clone+. The resource agent must not make any assumptions (based on +OCF_RESKEY_CRM_meta_clone+) about which numerical instances are active. In particular, the list of active copies will not always be an unbroken sequence, nor always start at 0. ==== Clone Notifications ==== Supporting notifications requires the +notify+ action to be implemented. If supported, the notify action will be passed a number of extra variables which, when combined with additional context, can be used to calculate the current state of the cluster and what is about to happen to it. .Environment variables supplied with Clone notify actions [width="95%",cols="5,3<",options="header",align="center"] |========================================================= |Variable |Description |OCF_RESKEY_CRM_meta_notify_type |Allowed values: +pre+, +post+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,type] indexterm:[type,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_operation |Allowed values: +start+, +stop+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,operation] indexterm:[operation,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_resource |Resources to be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_resource] indexterm:[start_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_resource |Resources to be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_resource] indexterm:[stop_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_resource |Resources that are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_resource] indexterm:[active_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_inactive_resource |Resources that are not running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,inactive_resource] indexterm:[inactive_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_uname |Nodes on which resources will be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_uname] indexterm:[start_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_uname |Nodes on which resources will be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_uname] indexterm:[stop_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_uname |Nodes on which resources are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_uname] indexterm:[active_uname,Notification Environment Variable] |========================================================= The variables come in pairs, such as +OCF_RESKEY_CRM_meta_notify_start_resource+ and +OCF_RESKEY_CRM_meta_notify_start_uname+ and should be treated as an array of whitespace-separated elements. +OCF_RESKEY_CRM_meta_notify_inactive_resource+ is an exception as the matching +uname+ variable does not exist since inactive resources are not running on any node. Thus in order to indicate that +clone:0+ will be started on +sles-1+, +clone:2+ will be started on +sles-3+, and +clone:3+ will be started on +sles-2+, the cluster would set .Notification variables ====== [source,Bash] ------- OCF_RESKEY_CRM_meta_notify_start_resource="clone:0 clone:2 clone:3" OCF_RESKEY_CRM_meta_notify_start_uname="sles-1 sles-3 sles-2" ------- ====== ==== Proper Interpretation of Notification Environment Variables ==== .Pre-notification (stop): * Active resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (stop) / Pre-notification (start): * Active resources ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Inactive resources ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (start): * Active resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ [[s-resource-multistate]] == Multi-state - Resources That Have Multiple Modes == indexterm:[Multi-state Resources] indexterm:[Resource,Multi-state] Multi-state resources are a specialization of clone resources; please ensure you understand <> before continuing! Multi-state resources allow the instances to be in one of two operating modes (called 'roles'). The roles are called 'master' and 'slave', but can mean whatever you wish them to mean. The only limitation is that when an instance is started, it must come up in the slave role. === Multi-state Properties === .Properties of a Multi-State Resource [width="95%",cols="3m,5<",options="header",align="center"] |========================================================= |Field |Description |id |Your name for the multi-state resource indexterm:[id,Multi-State Property] indexterm:[Multi-State,Property,id] |========================================================= === Multi-state Options === Options inherited from <> resources: +priority+, +target-role+, +is-managed+ Options inherited from <> resources: +clone-max+, +clone-node-max+, +notify+, +globally-unique+, +ordered+, +interleave+ .Multi-state-specific resource configuration options [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |master-max |1 |How many copies of the resource can be promoted to the +master+ role indexterm:[master-max,Multi-State Option] indexterm:[Multi-State,Option,master-max] |master-node-max |1 |How many copies of the resource can be promoted to the +master+ role on a single node indexterm:[master-node-max,Multi-State Option] indexterm:[Multi-State,Option,master-node-max] |========================================================= === Multi-state Instance Attributes === Multi-state resources have no instance attributes; however, any that are set here will be inherited by a master's children. === Multi-state Contents === Masters must contain exactly one primitive or group resource. [WARNING] You should never reference the name of a master's child. If you think you need to do this, you probably need to re-evaluate your design. === Monitoring Multi-State Resources === The usual monitor actions are insufficient to monitor a multi-state resource, because pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one. Define two monitoring actions: the usual one will cover the slave role, and an additional one with +role="master"+ will cover the master role. .Monitoring both states of a multi-state resource ====== [source,XML] ------- ------- ====== [IMPORTANT] =========== It is crucial that _every_ monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a master/slave resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status -- which would cause unexpected return codes, and therefore unnecessary complications. =========== === Multi-state Constraints === In most cases, multi-state resources will have a single copy on each active cluster node. If this is not the case, you can indicate which nodes the cluster should preferentially assign copies to with resource location constraints. These constraints are written no differently from those for primitive resources except that the master's +id+ is used. When considering multi-state resources in constraints, for most purposes it is sufficient to treat them as clones. The exception is that the +first-action+ and/or +then-action+ fields for ordering constraints may be set to +promote+ or +demote+ to constrain the master role, and colocation constraints may contain +rsc-role+ and/or +with-rsc-role+ fields. .Additional colocation constraint options for multi-state resources [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |rsc-role |Started |An additional attribute of colocation constraints that specifies the role that +rsc+ must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[rsc-role,Ordering Constraints] indexterm:[Constraints,Ordering,rsc-role] |with-rsc-role |Started |An additional attribute of colocation constraints that specifies the role that +with-rsc+ must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[with-rsc-role,Ordering Constraints] indexterm:[Constraints,Ordering,with-rsc-role] |========================================================= .Constraints involving multi-state resources ====== [source,XML] ------- ------- ====== In the example above, +myApp+ will wait until one of the database copies has been started and promoted to master before being started itself on the same node. Only if no copies can be promoted will +myApp+ be prevented from being active. Additionally, the cluster will wait for +myApp+ to be stopped before demoting the database. Colocation of a primitive or group resource with a multi-state resource means that it can run on any machine with an active copy of the multi-state resource that has the specified role (+master+ or +slave+). In the example above, the cluster will choose a location based on where database is running as a +master+, and if there are multiple +master+ instances it will also factor in +myApp+'s own location preferences when deciding which location to choose. Colocation with regular clones and other multi-state resources is also possible. In such cases, the set of allowed locations for the +rsc+ clone is (after role filtering) limited to nodes on which the +with-rsc+ multi-state resource is (or will be) in the specified role. Placement is then performed as normal. ==== Using Multi-state Resources in Colocation Sets ==== .Additional colocation set options relevant to multi-state resources [width="95%",cols="1m,1,6<",options="header",align="center"] |========================================================= |Field |Default |Description |role |Started |The role that 'all members' of the set must be in. Allowed values: +Started+, +Master+, +Slave+. indexterm:[role,Ordering Constraints] indexterm:[Constraints,Ordering,role] |========================================================= In the following example +B+'s master must be located on the same node as +A+'s master. Additionally resources +C+ and +D+ must be located on the same node as +A+'s and +B+'s masters. .Colocate C and D with A's and B's master instances ====== [source,XML] ------- ------- ====== ==== Using Multi-state Resources in Ordering Sets ==== .Additional ordered set options relevant to multi-state resources [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |action |value of +first-action+ |An additional attribute of ordering constraint sets that specifies the action that applies to 'all members' of the set. Allowed values: +start+, +stop+, +promote+, +demote+. indexterm:[action,Ordering Constraints] indexterm:[Constraints,Ordering,action] |========================================================= .Start C and D after first promoting A and B ====== [source,XML] ------- ------- ====== In the above example, +B+ cannot be promoted to a master role until +A+ has been promoted. Additionally, resources +C+ and +D+ must wait until +A+ and +B+ have been promoted before they can start. === Multi-state Stickiness === indexterm:[resource-stickiness,Multi-State] As with regular clones, multi-state resources are slightly sticky by default. See <> for details. [[s-master-scores]] === Which Resource Instance is Promoted === During the start operation, most resource agents should call the `crm_master` utility. This tool automatically detects both the resource and host and should be used to set a preference for being promoted. Based on this, +master-max+, and +master-node-max+, the instance(s) with the highest preference will be promoted. An alternative is to create a location constraint that indicates which nodes are most preferred as masters. .Explicitly preferring node1 to be promoted to master ====== [source,XML] ------- ------- ====== === Requirements for Multi-state Resource Agents === Since multi-state resources are an extension of cloned resources, all the requirements for resource agents that support clones are also requirements for resource agents that support multi-state resources. Additionally, multi-state resources require two extra actions, +demote+ and +promote+, which are responsible for changing the state of the resource. Like +start+ and +stop+, they should return +$\{OCF_SUCCESS}+ if they completed successfully or a relevant error code if they did not. The states can mean whatever you wish, but when the resource is started, it must come up in the mode called +slave+. From there the cluster will decide which instances to promote to +master+. In addition to the clone requirements for monitor actions, agents must also _accurately_ report which state they are in. The cluster relies on the agent to report its status (including role) accurately and does not indicate to the agent what role it currently believes it to be in. .Role implications of OCF return codes [width="95%",cols="1,1<",options="header",align="center"] |========================================================= |Monitor Return Code |Description |OCF_NOT_RUNNING |Stopped indexterm:[Return Code,OCF_NOT_RUNNING] |OCF_SUCCESS |Running (Slave) indexterm:[Return Code,OCF_SUCCESS] |OCF_RUNNING_MASTER |Running (Master) indexterm:[Return Code,OCF_RUNNING_MASTER] |OCF_FAILED_MASTER |Failed (Master) indexterm:[Return Code,OCF_FAILED_MASTER] |Other |Failed (Slave) |========================================================= ==== Multi-state Notifications ==== Like clones, supporting notifications requires the +notify+ action to be implemented. If supported, the notify action will be passed a number of extra variables which, when combined with additional context, can be used to calculate the current state of the cluster and what is about to happen to it. .Environment variables supplied with multi-state notify actions footnote:[Emphasized variables are specific to +Master+ resources, and all behave in the same manner as described for Clone resources.] [width="95%",cols="5,3<",options="header",align="center"] |========================================================= |Variable |Description |OCF_RESKEY_CRM_meta_notify_type |Allowed values: +pre+, +post+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,type] indexterm:[type,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_operation |Allowed values: +start+, +stop+ indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,operation] indexterm:[operation,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_resource |Resources that are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_resource] indexterm:[active_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_inactive_resource |Resources that are not running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,inactive_resource] indexterm:[inactive_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_master_resource_ |Resources that are running in +Master+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,master_resource] indexterm:[master_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_slave_resource_ |Resources that are running in +Slave+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,slave_resource] indexterm:[slave_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_resource |Resources to be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_resource] indexterm:[start_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_resource |Resources to be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_resource] indexterm:[stop_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_promote_resource_ |Resources to be promoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,promote_resource] indexterm:[promote_resource,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_demote_resource_ |Resources to be demoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,demote_resource] indexterm:[demote_resource,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_start_uname |Nodes on which resources will be started indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,start_uname] indexterm:[start_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_stop_uname |Nodes on which resources will be stopped indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,stop_uname] indexterm:[stop_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_promote_uname_ |Nodes on which resources will be promoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,promote_uname] indexterm:[promote_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_demote_uname_ |Nodes on which resources will be demoted indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,demote_uname] indexterm:[demote_uname,Notification Environment Variable] |OCF_RESKEY_CRM_meta_notify_active_uname |Nodes on which resources are running indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,active_uname] indexterm:[active_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_master_uname_ |Nodes on which resources are running in +Master+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,master_uname] indexterm:[master_uname,Notification Environment Variable] |_OCF_RESKEY_CRM_meta_notify_slave_uname_ |Nodes on which resources are running in +Slave+ mode indexterm:[Environment Variable,OCF_RESKEY_CRM_meta_notify_,slave_uname] indexterm:[slave_uname,Notification Environment Variable] |========================================================= ==== Proper Interpretation of Multi-state Notification Environment Variables ==== .Pre-notification (demote): * +Active+ resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * +Master+ resources: +$OCF_RESKEY_CRM_meta_notify_master_resource+ * +Slave+ resources: +$OCF_RESKEY_CRM_meta_notify_slave_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (demote) / Pre-notification (stop): * +Active+ resources: +$OCF_RESKEY_CRM_meta_notify_active_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: +$OCF_RESKEY_CRM_meta_notify_slave_resource+ * Inactive resources: +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ .Post-notification (stop) / Pre-notification (start) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (start) / Pre-notification (promote) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ .Post-notification (promote) * +Active+ resources: ** +$OCF_RESKEY_CRM_meta_notify_active_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * +Master+ resources: ** +$OCF_RESKEY_CRM_meta_notify_master_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_demote_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * +Slave+ resources: ** +$OCF_RESKEY_CRM_meta_notify_slave_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_start_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Inactive resources: ** +$OCF_RESKEY_CRM_meta_notify_inactive_resource+ ** plus +$OCF_RESKEY_CRM_meta_notify_stop_resource+ ** minus +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources to be promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources to be demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources to be stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ * Resources that were started: +$OCF_RESKEY_CRM_meta_notify_start_resource+ * Resources that were promoted: +$OCF_RESKEY_CRM_meta_notify_promote_resource+ * Resources that were demoted: +$OCF_RESKEY_CRM_meta_notify_demote_resource+ * Resources that were stopped: +$OCF_RESKEY_CRM_meta_notify_stop_resource+ [[s-resource-bundle]] == Bundles - Isolated Environments == indexterm:[bundle] indexterm:[Resource,bundle] indexterm:[Docker,bundle] indexterm:[rkt,bundle] Pacemaker supports a special syntax for launching a https://en.wikipedia.org/wiki/Operating-system-level_virtualization[container] with any infrastructure it requires: the 'bundle'. -Pacemaker bundles support https://www.docker.com/[Docker] (since version -1.1.17) and https://coreos.com/rkt/[rkt] (since version 1.1.18) container -technologies. +Pacemaker bundles support https://www.docker.com/[Docker] and +https://coreos.com/rkt/[rkt] container technologies. footnote:[Docker is a trademark of Docker, Inc. No endorsement by or association with Docker, Inc. is implied.] .A bundle for a containerized web server ==== [source,XML] ----

---- ==== === Bundle Properties === .Properties of a Bundle [width="95%",cols="3m,5<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the bundle (required) indexterm:[id,bundle] indexterm:[bundle,Property,id] |description |Arbitrary text (not used by Pacemaker) indexterm:[description,bundle] indexterm:[bundle,Property,description] |========================================================= A bundle must contain exactly one ++ or ++ element. === Docker Properties === Before configuring a Docker bundle in Pacemaker, the user must install Docker and supply a fully configured Docker image on every node allowed to run the bundle. Pacemaker will create an implicit +ocf:heartbeat:docker+ resource to manage a bundle's Docker container. The user must ensure that resource agent is installed on every node allowed to run the bundle. .Properties of a Bundle's Docker Element [width="95%",cols="3m,4,5<",options="header",align="center"] |========================================================= |Field |Default |Description |image | |Docker image tag (required) indexterm:[image,Docker] indexterm:[Docker,Property,image] |replicas |Value of +masters+ if that is positive, else 1 |A positive integer specifying the number of container instances to launch indexterm:[replicas,Docker] indexterm:[Docker,Property,replicas] |replicas-per-host |1 |A positive integer specifying the number of container instances allowed to run on a single node indexterm:[replicas-per-host,Docker] indexterm:[Docker,Property,replicas-per-host] |masters |0 |A non-negative integer that, if positive, indicates that the containerized service should be treated as a multistate service, with this many replicas allowed to run the service in the master role indexterm:[masters,Docker] indexterm:[Docker,Property,masters] |network | |If specified, this will be passed to +docker run+ as the https://docs.docker.com/engine/reference/run/#network-settings[network setting] for the Docker container. indexterm:[network,Docker] indexterm:[Docker,Property,network] |run-command |`/usr/sbin/pacemaker_remoted` if bundle contains a +primitive+, otherwise none |This command will be run inside the container when launching it ("PID 1"). If the bundle contains a +primitive+, this command 'must' start pacemaker_remoted (but could, for example, be a script that does other stuff, too). indexterm:[run-command,Docker] indexterm:[Docker,Property,run-command] |options | |Extra command-line options to pass to `docker run` indexterm:[options,Docker] indexterm:[Docker,Property,options] |========================================================= === rkt Properties === Before configuring a rkt bundle in Pacemaker, the user must install rkt and supply a fully configured container image on every node allowed to run the bundle. Pacemaker will create an implicit +ocf:heartbeat:rkt+ resource to manage a bundle's rkt container. The user must ensure that resource agent is installed on every node allowed to run the bundle. .Properties of a Bundle's rkt Element [width="95%",cols="3m,4,5<",options="header",align="center"] |========================================================= |Field |Default |Description |image | |Container image tag (required) indexterm:[image,rkt] indexterm:[rkt,Property,image] |replicas |Value of +masters+ if that is positive, else 1 |A positive integer specifying the number of container instances to launch indexterm:[replicas,rkt] indexterm:[rkt,Property,replicas] |replicas-per-host |1 |A positive integer specifying the number of container instances allowed to run on a single node indexterm:[replicas-per-host,rkt] indexterm:[rkt,Property,replicas-per-host] |masters |0 |A non-negative integer that, if positive, indicates that the containerized service should be treated as a multistate service, with this many replicas allowed to run the service in the master role indexterm:[masters,rkt] indexterm:[rkt,Property,masters] |network | |If specified, this will be passed to +rkt run+ as the network setting for the rkt container. indexterm:[network,rkt] indexterm:[rkt,Property,network] |run-command |`/usr/sbin/pacemaker_remoted` if bundle contains a +primitive+, otherwise none |This command will be run inside the container when launching it ("PID 1"). If the bundle contains a +primitive+, this command 'must' start pacemaker_remoted (but could, for example, be a script that does other stuff, too). indexterm:[run-command,rkt] indexterm:[rkt,Property,run-command] |options | |Extra command-line options to pass to `rkt run` indexterm:[options,rkt] indexterm:[rkt,Property,options] |========================================================= === Bundle Network Properties === A bundle may optionally contain one ++ element. indexterm:[bundle,network] .Properties of a Bundle's Network Element [width="95%",cols="2m,1,4<",options="header",align="center"] |========================================================= |Field |Default |Description |ip-range-start | |If specified, Pacemaker will create an implicit +ocf:heartbeat:IPaddr2+ resource for each container instance, starting with this IP address, using up to +replicas+ sequential addresses. These addresses can be used from the host's network to reach the service inside the container, though it is not visible within the container itself. Only IPv4 addresses are currently supported. indexterm:[ip-range-start,network] indexterm:[network,Property,ip-range-start] |host-netmask |32 |If +ip-range-start+ is specified, the IP addresses are created with this CIDR netmask (as a number of bits). indexterm:[host-netmask,network] indexterm:[network,Property,host-netmask] |host-interface | |If +ip-range-start+ is specified, the IP addresses are created on this host interface (by default, it will be determined from the IP address). indexterm:[host-interface,network] indexterm:[network,Property,host-interface] |control-port |3121 |If the bundle contains a +primitive+, the cluster will use this integer TCP port for communication with Pacemaker Remote inside the container. Changing this is useful when the container is unable to listen on the default port, for example, when the container uses the host's network rather than +ip-range-start+ (in which case +replicas-per-host+ must be 1), or when the bundle may run on a Pacemaker Remote node that is already listening on the default port. Any PCMK_remote_port environment variable set on the host or in the container is ignored for bundle connections. indexterm:[control-port,network] indexterm:[network,Property,control-port] |========================================================= [[s-resource-bundle-note-replica-names]] [NOTE] ==== If +ip-range-start+ is used, Pacemaker will automatically ensure that +/etc/hosts+ inside the containers has entries for each replica and its assigned IP. Replicas are named by the bundle id plus a dash and an integer counter starting with zero. For example, if a bundle named +httpd-bundle+ has +replicas=2+, its containers will be named +httpd-bundle-0+ and +httpd-bundle-1+. ==== Additionally, a ++ element may optionally contain one or more ++ elements. indexterm:[bundle,network,port-mapping] .Properties of a Bundle's Port-Mapping Element [width="95%",cols="2m,1,4<",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the port mapping (required) indexterm:[id,port-mapping] indexterm:[port-mapping,Property,id] |port | |If this is specified, connections to this TCP port number on the host network (on the container's assigned IP address, if +ip-range-start+ is specified) will be forwarded to the container network. Exactly one of +port+ or +range+ must be specified in a +port-mapping+. indexterm:[port,port-mapping] indexterm:[port-mapping,Property,port] |internal-port |value of +port+ |If +port+ and this are specified, connections to +port+ on the host's network will be forwarded to this port on the container network. indexterm:[internal-port,port-mapping] indexterm:[port-mapping,Property,internal-port] |range | |If this is specified, connections to these TCP port numbers (expressed as 'first_port'-'last_port') on the host network (on the container's assigned IP address, if +ip-range-start+ is specified) will be forwarded to the same ports in the container network. Exactly one of +port+ or +range+ must be specified in a +port-mapping+. indexterm:[range,port-mapping] indexterm:[port-mapping,Property,range] |========================================================= [NOTE] ==== If the bundle contains a +primitive+, Pacemaker will automatically map the +control-port+, so it is not necessary to specify that port in a +port-mapping+. ==== === Bundle Storage Properties === A bundle may optionally contain one ++ element. A ++ element has no properties of its own, but may contain one or more ++ elements. indexterm:[bundle,storage,storage-mapping] .Properties of a Bundle's Storage-Mapping Element [width="95%",cols="2m,1,4<",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the storage mapping (required) indexterm:[id,storage-mapping] indexterm:[storage-mapping,Property,id] |source-dir | |The absolute path on the host's filesystem that will be mapped into the container. Exactly one of +source-dir+ and +source-dir-root+ must be specified in a +storage-mapping+. indexterm:[source-dir,storage-mapping] indexterm:[storage-mapping,Property,source-dir] |source-dir-root | |The start of a path on the host's filesystem that will be mapped into the container, using a different subdirectory on the host for each container instance. The subdirectory will be named the same as the bundle host name, as described in <>. Exactly one of +source-dir+ and +source-dir-root+ must be specified in a +storage-mapping+. indexterm:[source-dir-root,storage-mapping] indexterm:[storage-mapping,Property,source-dir-root] |target-dir | |The path name within the container where the host storage will be mapped (required) indexterm:[target-dir,storage-mapping] indexterm:[storage-mapping,Property,target-dir] |options | |File system mount options to use when mapping the storage indexterm:[options,storage-mapping] indexterm:[storage-mapping,Property,options] |========================================================= [NOTE] ==== Pacemaker does not define the behavior if the source directory does not already exist on the host. However, it is expected that the container technology and/or its resource agent will create the source directory in that case. ==== [NOTE] ==== If the bundle contains a +primitive+, Pacemaker will automatically map the equivalent of +source-dir=/etc/pacemaker/authkey target-dir=/etc/pacemaker/authkey+ and +source-dir-root=/var/log/pacemaker/bundles target-dir=/var/log+ into the container, so it is not necessary to specify those paths in a +storage-mapping+. ==== [IMPORTANT] ==== The +PCMK_authkey_location+ environment variable must not be set to anything other than the default of `/etc/pacemaker/authkey` on any node in the cluster. ==== === Bundle Primitive === A bundle may optionally contain one ++ resource (see <>). The primitive may have operations, instance attributes and meta-attributes defined, as usual. If a bundle contains a primitive resource, the container image must include the Pacemaker Remote daemon, and at least one of +ip-range-start+ or +control-port+ must be configured in the bundle. Pacemaker will create an implicit +ocf:pacemaker:remote+ resource for the connection, launch Pacemaker Remote within the container, and monitor and manage the primitive resource via Pacemaker Remote. If the bundle has more than one container instance (replica), the primitive resource will function as an implicit clone (see <>) -- a multistate clone if the bundle has +masters+ greater than zero (see <>). [IMPORTANT] ==== Containers in bundles with a +primitive+ must have an accessible networking environment, so that Pacemaker on the cluster nodes can contact Pacemaker Remote inside the container. For example, the Docker option `--net=none` should not be used with a +primitive+. The default (using a distinct network space inside the container) works in combination with +ip-range-start+. If the Docker option `--net=host` is used (making the container share the host's network space), a unique +control-port+ should be specified for each bundle. Any firewall must allow access to the +control-port+. ==== [[s-bundle-attributes]] === Bundle Node Attributes === If the bundle has a +primitive+, the primitive's resource agent may want to set node attributes such as <>. However, with containers, it is not apparent which node should get the attribute. If the container uses shared storage that is the same no matter which node the container is hosted on, then it is appropriate to use the master score on the bundle node itself. On the other hand, if the container uses storage exported from the underlying host, then it may be more appropriate to use the master score on the underlying host. Since this depends on the particular situation, the +container-attribute-target+ resource meta-attribute allows the user to specify which approach to use. If it is set to +host+, then user-defined node attributes will be checked on the underlying host. If it is anything else, the local node (in this case the bundle node) is used as usual. This only applies to user-defined attributes; the cluster will always check the local node for cluster-defined attributes such as +#uname+. If +container-attribute-target+ is +host+, the cluster will pass additional environment variables to the primitive's resource agent that allow it to set node attributes appropriately: +container_attribute_target+ (identical to the meta-attribute value) and +physical_host+ (the name of the underlying host). [NOTE] ==== It is up to the resource agent to check for the additional variables and use them when setting node attributes. ==== === Bundle Meta-Attributes === Any meta-attribute set on a bundle will be inherited by the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. This includes options such as +priority+, +target-role+, and +is-managed+. See <> for more information. === Limitations of Bundles === Restarting pacemaker while a bundle is unmanaged or the cluster is in maintenance mode may cause the bundle to fail. Bundles may not be cloned or included in groups. This includes the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. Bundles do not have instance attributes, utilization attributes, or operations, though a bundle's primitive may have them. A bundle with a primitive can run on a Pacemaker Remote node only if the bundle uses a distinct +control-port+. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt b/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt index d3d51deff6..9a722c3f7b 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Alerts.txt @@ -1,406 +1,404 @@ = Alerts = //// We prefer [[ch-alerts]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-alerts[Chapter 7, Alerts] indexterm:[Resource,Alerts] -'Alerts' (available since Pacemaker 1.1.15) may be configured to take some -external action when a cluster event occurs (node failure, resource starting or -stopping, etc.). +'Alerts' may be configured to take some external action when a cluster event +occurs (node failure, resource starting or stopping, etc.). == Alert Agents == As with resource agents, the cluster calls an external program (an 'alert agent') to handle alerts. The cluster passes information about the event to the agent via environment variables. Agents can do anything desired with this information (send an e-mail, log to a file, update a monitoring system, etc.). .Simple alert configuration ===== [source,XML] ----- ----- ===== In the example above, the cluster will call +my-script.sh+ for each event. Multiple alert agents may be configured; the cluster will call all of them for each event. Alert agents will be called only on cluster nodes. They will be called for events involving Pacemaker Remote nodes, but they will never be called _on_ those nodes. == Alert Recipients == Usually alerts are directed towards a recipient. Thus each alert may be additionally configured with one or more recipients. The cluster will call the agent separately for each recipient. .Alert configuration with recipient ===== [source,XML] ----- ----- ===== In the above example, the cluster will call +my-script.sh+ for each event, passing the recipient +some-address+ as an environment variable. The recipient may be anything the alert agent can recognize -- an IP address, an e-mail address, a file name, whatever the particular agent supports. == Alert Meta-Attributes == As with resource agents, meta-attributes can be configured for alert agents to affect how Pacemaker calls them. .Meta-Attributes of an Alert [width="95%",cols="m,1,2 ----- ===== In the above example, the +my-script.sh+ will get called twice for each event, with each call using a 15-second timeout. One call will be passed the recipient +someuser@example.com+ and a timestamp in the format +%D %H:%M+, while the other call will be passed the recipient +otheruser@example.com+ and a timestamp in the format +%c+. == Alert Instance Attributes == As with resource agents, agent-specific configuration values may be configured as instance attributes. These will be passed to the agent as additional environment variables. The number, names and allowed values of these instance attributes are completely up to the particular agent. .Alert configuration with instance attributes ===== [source,XML] ----- ----- ===== == Alert Filters == By default, an alert agent will be called for node events, fencing events, and resource events. An agent may choose to ignore certain types of events, but there is still the overhead of calling it for those events. To eliminate that -overhead, you may select which types of events the agent should receive -(since version 1.1.18). +overhead, you may select which types of events the agent should receive. .Alert configuration to receive only node events and fencing events ===== [source,XML] ----- ----- ===== The possible options within + ----- ===== Node attribute alerts are currently considered experimental. Alerts may be limited to attributes set via attrd_updater, and agents may be called multiple times with the same attribute value. == Using the Sample Alert Agents == Pacemaker provides several sample alert agents, installed in +/usr/share/pacemaker/alerts+ by default. While these sample scripts may be copied and used as-is, they are provided mainly as templates to be edited to suit your purposes. See their source code for the full set of instance attributes they support. .Sending cluster events as SNMP traps ===== [source,XML] ----- ----- ===== .Sending cluster events as e-mails ===== [source,XML] ----- ----- ===== == Writing an Alert Agent == .Environment variables passed to alert agents [width="95%",cols="m,2 ------- ====== The empty configuration above contains the major sections that make up a CIB: * +cib+: The entire CIB is enclosed with a +cib+ tag. Certain fundamental settings are defined as attributes of this tag. ** +configuration+: This section -- the primary focus of this document -- contains traditional configuration information such as what resources the cluster serves and the relationships among them. *** +crm_config+: cluster-wide configuration options *** +nodes+: the machines that host the cluster *** +resources+: the services run by the cluster *** +constraints+: indications of how resources should be placed ** +status+: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local resource manager (lrmd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way. In this document, configuration settings will be described as 'properties' or 'options' based on how they are defined in the CIB: * Properties are XML attributes of an XML element. * Options are name-value pairs expressed as +nvpair+ child elements of an XML element. Normally you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak. == The Current State of the Cluster == Before one starts to configure a cluster, it is worth explaining how to view the finished product. For this purpose we have created the `crm_mon` utility, which will display the current state of an active cluster. It can show the cluster status by node or by resource and can be used in either single-shot or dynamically-updating mode. There are also modes for displaying a list of the operations performed (grouped by node and resource) as well as information about failures. Using this tool, you can examine the state of the cluster for irregularities and see how it responds when you cause or simulate failures. Details on all the available options can be obtained using the `crm_mon --help` command. .Sample output from crm_mon ====== ------- ============ Last updated: Fri Nov 23 15:26:13 2007 Current DC: sles-3 (2298606a-6a8c-499a-9d25-76242f7006ec) 3 Nodes configured. 5 Resources configured. ============ Node: sles-1 (1186dc9a-324d-425a-966e-d757e693dc86): online 192.168.100.181 (heartbeat::ocf:IPaddr): Started sles-1 192.168.100.182 (heartbeat:IPaddr): Started sles-1 192.168.100.183 (heartbeat::ocf:IPaddr): Started sles-1 rsc_sles-1 (heartbeat::ocf:IPaddr): Started sles-1 child_DoFencing:2 (stonith:external/vmware): Started sles-1 Node: sles-2 (02fb99a8-e30e-482f-b3ad-0fb3ce27d088): standby Node: sles-3 (2298606a-6a8c-499a-9d25-76242f7006ec): online rsc_sles-2 (heartbeat::ocf:IPaddr): Started sles-3 rsc_sles-3 (heartbeat::ocf:IPaddr): Started sles-3 child_DoFencing:0 (stonith:external/vmware): Started sles-3 ------- ====== .Sample output from crm_mon -n ====== ------- ============ Last updated: Fri Nov 23 15:26:13 2007 Current DC: sles-3 (2298606a-6a8c-499a-9d25-76242f7006ec) 3 Nodes configured. 5 Resources configured. ============ Node: sles-1 (1186dc9a-324d-425a-966e-d757e693dc86): online Node: sles-2 (02fb99a8-e30e-482f-b3ad-0fb3ce27d088): standby Node: sles-3 (2298606a-6a8c-499a-9d25-76242f7006ec): online Resource Group: group-1 192.168.100.181 (heartbeat::ocf:IPaddr): Started sles-1 192.168.100.182 (heartbeat:IPaddr): Started sles-1 192.168.100.183 (heartbeat::ocf:IPaddr): Started sles-1 rsc_sles-1 (heartbeat::ocf:IPaddr): Started sles-1 rsc_sles-2 (heartbeat::ocf:IPaddr): Started sles-3 rsc_sles-3 (heartbeat::ocf:IPaddr): Started sles-3 Clone Set: DoFencing child_DoFencing:0 (stonith:external/vmware): Started sles-3 child_DoFencing:1 (stonith:external/vmware): Stopped child_DoFencing:2 (stonith:external/vmware): Started sles-1 ------- ====== The DC (Designated Controller) node is where all the decisions are made, and if the current DC fails a new one is elected from the remaining cluster nodes. The choice of DC is of no significance to an administrator beyond the fact that its logs will generally be more interesting. == How Should the Configuration be Updated? == There are three basic rules for updating the cluster configuration: * Rule 1 - Never edit the +cib.xml+ file manually. Ever. I'm not making this up. * Rule 2 - Read Rule 1 again. * Rule 3 - The cluster will notice if you ignored rules 1 & 2 and refuse to use the configuration. Now that it is clear how 'not' to update the configuration, we can begin to explain how you 'should'. === Editing the CIB Using XML === The most powerful tool for modifying the configuration is the +cibadmin+ command. With +cibadmin+, you can query, add, remove, update or replace any part of the configuration. All changes take effect immediately, so there is no need to perform a reload-like operation. The simplest way of using `cibadmin` is to use it to save the current configuration to a temporary file, edit that file with your favorite text or XML editor, and then upload the revised configuration. footnote:[This process might appear to risk overwriting changes that happen after the initial cibadmin call, but pacemaker will reject any update that is "too old". If the CIB is updated in some other fashion after the initial cibadmin, the second cibadmin will be rejected because the version number will be too low.] .Safely using an editor to modify the cluster configuration ====== -------- # cibadmin --query > tmp.xml # vi tmp.xml # cibadmin --replace --xml-file tmp.xml -------- ====== Some of the better XML editors can make use of a Relax NG schema to help make sure any changes you make are valid. The schema describing the configuration can be found in +pacemaker.rng+, which may be deployed in a location such as +/usr/share/pacemaker+ or +/usr/lib/heartbeat+ depending on your operating system and how you installed the software. If you want to modify just one section of the configuration, you can query and replace just that section to avoid modifying any others. .Safely using an editor to modify only the resources section ====== -------- # cibadmin --query --scope resources > tmp.xml # vi tmp.xml # cibadmin --replace --scope resources --xml-file tmp.xml -------- ====== === Quickly Deleting Part of the Configuration === Identify the object you wish to delete by XML tag and id. For example, you might search the CIB for all STONITH-related configuration: .Searching for STONITH-related configuration items ====== ---- # cibadmin -Q | grep stonith ---- ====== If you wanted to delete the +primitive+ tag with id +child_DoFencing+, you would run: ---- # cibadmin --delete --xml-text '' ---- === Updating the Configuration Without Using XML === Most tasks can be performed with one of the other command-line tools provided with pacemaker, avoiding the need to read or edit XML. To enable STONITH for example, one could run: ---- # crm_attribute --name stonith-enabled --update 1 ---- Or, to check whether *somenode* is allowed to run resources, there is: ---- -# crm_standby --get-value --node somenode +# crm_standby --query --node somenode ---- Or, to find the current location of *my-test-rsc*, one can use: ---- # crm_resource --locate --resource my-test-rsc ---- Examples of using these tools for specific cases will be given throughout this document where appropriate. -[NOTE] -==== -Old versions of pacemaker (1.0.3 and earlier) had different -command-line tool syntax. If you are using an older version, -check your installed manual pages for the proper syntax to use. -==== - [[s-config-sandboxes]] == Making Configuration Changes in a Sandbox == Often it is desirable to preview the effects of a series of changes before updating the configuration all at once. For this purpose, we have created `crm_shadow` which creates a "shadow" copy of the configuration and arranges for all the command line tools to use it. To begin, simply invoke `crm_shadow --create` with the name of a configuration to create footnote:[Shadow copies are identified with a name, making it possible to have more than one.], and follow the simple on-screen instructions. [WARNING] ==== Read this section and the on-screen instructions carefully; failure to do so could result in destroying the cluster's active configuration! ==== .Creating and displaying the active sandbox ====== ---- # crm_shadow --create test Setting up shadow instance Type Ctrl-D to exit the crm_shadow shell shadow[test]: shadow[test] # crm_shadow --which test ---- ====== From this point on, all cluster commands will automatically use the shadow copy instead of talking to the cluster's active configuration. Once you have finished experimenting, you can either make the changes active via the `--commit` option, or discard them using the `--delete` option. Again, be sure to follow the on-screen instructions carefully! For a full list of `crm_shadow` options and commands, invoke it with the `--help` option. .Use sandbox to make multiple changes all at once, discard them, and verify real configuration is untouched ====== ---- shadow[test] # crm_failcount -r rsc_c001n01 -G scope=status name=fail-count-rsc_c001n01 value=0 shadow[test] # crm_standby --node c001n02 -v on shadow[test] # crm_standby --node c001n02 -G scope=nodes name=standby value=on shadow[test] # cibadmin --erase --force shadow[test] # cibadmin --query - - - - - - - - + + + + + + + + shadow[test] # crm_shadow --delete test --force Now type Ctrl-D to exit the crm_shadow shell shadow[test] # exit # crm_shadow --which No active shadow configuration defined # cibadmin -Q - + ---- ====== [[s-config-testing-changes]] == Testing Your Configuration Changes == We saw previously how to make a series of changes to a "shadow" copy of the configuration. Before loading the changes back into the cluster (e.g. `crm_shadow --commit mytest --force`), it is often advisable to simulate the effect of the changes with +crm_simulate+. For example: ---- # crm_simulate --live-check -VVVVV --save-graph tmp.graph --save-dotfile tmp.dot ---- This tool uses the same library as the live cluster to show what it would have done given the supplied input. Its output, in addition to a significant amount of logging, is stored in two files +tmp.graph+ and +tmp.dot+. Both files are representations of the same thing: the cluster's response to your changes. The graph file stores the complete transition from the existing cluster state to your desired new state, containing a list of all the actions, their parameters and their pre-requisites. Because the transition graph is not terribly easy to read, the tool also generates a Graphviz footnote:[Graph visualization software. See http://www.graphviz.org/ for details.] dot-file representing the same information. For information on the options supported by `crm_simulate`, use its `--help` option. .Interpreting the Graphviz output * Arrows indicate ordering dependencies * Dashed arrows indicate dependencies that are not present in the transition graph * Actions with a dashed border of any color do not form part of the transition graph * Actions with a green border form part of the transition graph * Actions with a red border are ones the cluster would like to execute but cannot run * Actions with a blue border are ones the cluster does not feel need to be executed * Actions with orange text are pseudo/pretend actions that the cluster uses to simplify the graph * Actions with black text are sent to the LRM * Resource actions have text of the form pass:[rsc]_pass:[action]_pass:[interval] pass:[node] * Any action depending on an action with a red border will not be able to execute. * Loops are _really_ bad. Please report them to the development team. === Small Cluster Transition === image::images/Policy-Engine-small.png["An example transition graph as represented by Graphviz",width="16cm",height="6cm",align="center"] In the above example, it appears that a new node, *pcmk-2*, has come online and that the cluster is checking to make sure *rsc1*, *rsc2* and *rsc3* are not already running there (Indicated by the *rscN_monitor_0* entries). Once it did that, and assuming the resources were not active there, it would have liked to stop *rsc1* and *rsc2* on *pcmk-1* and move them to *pcmk-2*. However, there appears to be some problem and the cluster cannot or is not permitted to perform the stop actions which implies it also cannot perform the start actions. For some reason the cluster does not want to start *rsc3* anywhere. === Complex Cluster Transition === image::images/Policy-Engine-big.png["Another, slightly more complex, transition graph that you're not expected to be able to read",width="16cm",height="20cm",align="center"] == Do I Need to Update the Configuration on All Cluster Nodes? == No. Any changes are immediately synchronized to the other active members of the cluster. To reduce bandwidth, the cluster only broadcasts the incremental updates that result from your changes and uses MD5 checksums to ensure that each copy is completely consistent. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt index 8296577ec3..c7ed30e59f 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt @@ -1,878 +1,878 @@ = Resource Constraints = indexterm:[Resource,Constraints] == Scores == Scores of all kinds are integral to how the cluster works. Practically everything from moving a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in some way. Scores are calculated per resource and node. Any node with a negative score for a resource can't run that resource. The cluster places a resource on the node with the highest score for it. === Infinity Math === Pacemaker implements +INFINITY+ (or equivalently, ++INFINITY+) internally as a score of 1,000,000. Addition and subtraction with it follow these three basic rules: * Any value + +INFINITY+ = +INFINITY+ * Any value - +INFINITY+ = +-INFINITY+ * +INFINITY+ - +INFINITY+ = +-INFINITY+ [NOTE] ====== What if you want to use a score higher than 1,000,000? Typically this possibility arises when someone wants to base the score on some external metric that might go above 1,000,000. The short answer is you can't. The long answer is it is sometimes possible work around this limitation creatively. You may be able to set the score to some computed value based on the external metric rather than use the metric directly. For nodes, you can store the metric as a node attribute, and query the attribute when computing the score (possibly as part of a custom resource agent). ====== == Deciding Which Nodes a Resource Can Run On == indexterm:[Location Constraints] indexterm:[Resource,Constraints,Location] 'Location constraints' tell the cluster which nodes a resource can run on. There are two alternative strategies. One way is to say that, by default, resources can run anywhere, and then the location constraints specify nodes that are not allowed (an 'opt-out' cluster). The other way is to start with nothing able to run anywhere, and use location constraints to selectively enable allowed nodes (an 'opt-in' cluster). Whether you should choose opt-in or opt-out depends on your personal preference and the make-up of your cluster. If most of your resources can run on most of the nodes, then an opt-out arrangement is likely to result in a simpler configuration. On the other-hand, if most resources can only run on a small subset of nodes, an opt-in configuration might be simpler. === Location Properties === .Properties of a rsc_location Constraint [width="95%",cols="2m,1,5>), the submatches can be referenced as +%0+ through +%9+ in the rule's - +score-attribute+ or a rule expression's +attribute+ '(since 1.1.16)' + +score-attribute+ or a rule expression's +attribute+ indexterm:[rsc-pattern,Location Constraints] indexterm:[Constraints,Location,rsc-pattern] |node | |A node's name indexterm:[node,Location Constraints] indexterm:[Constraints,Location,node] |score | |Positive values indicate a preference for running the affected resource(s) on this node -- the higher the value, the stronger the preference. Negative values indicate the resource(s) should avoid this node (a value of +-INFINITY+ changes "should" to "must"). indexterm:[score,Location Constraints] indexterm:[Constraints,Location,score] |resource-discovery |always |Whether Pacemaker should perform resource discovery (that is, check whether the resource is already running) for this resource on this node. This should normally be left as the default, so that rogue instances of a service can be stopped when they are running where they are not supposed to be. However, there are two situations where disabling resource discovery is a good idea: when a service is not installed on a node, discovery might return an error (properly written OCF agents will not, so this is usually only seen with other agent types); and when Pacemaker Remote is used to scale a cluster to hundreds of nodes, limiting resource discovery to allowed nodes can significantly boost - performance. '(since 1.1.13)' + performance. * +always:+ Always perform resource discovery for the specified resource on this node. * +never:+ Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score, although that is not strictly required. * +exclusive:+ Perform resource discovery for the specified resource only on this node (and other nodes similarly marked as +exclusive+). Multiple location constraints using +exclusive+ discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for +exclusive+ discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes. indexterm:[Resource Discovery,Location Constraints] indexterm:[Constraints,Location,Resource Discovery] |========================================================= [WARNING] ========= Setting resource-discovery to +never+ or +exclusive+ removes Pacemaker's ability to detect and stop unwanted instances of a service running where it's not supposed to be. It is up to the system administrator (you!) to make sure that the service can 'never' be active on nodes without resource-discovery (such as by leaving the relevant software uninstalled). ========= === Asymmetrical "Opt-In" Clusters === indexterm:[Asymmetrical Opt-In Clusters] indexterm:[Cluster Type,Asymmetrical Opt-In] To create an opt-in cluster, start by preventing resources from running anywhere by default: ---- # crm_attribute --name symmetric-cluster --update false ---- Then start enabling nodes. The following fragment says that the web server prefers *sles-1*, the database prefers *sles-2* and both can fail over to *sles-3* if their most preferred node fails. .Opt-in location constraints for two resources ====== [source,XML] ------- ------- ====== === Symmetrical "Opt-Out" Clusters === indexterm:[Symmetrical Opt-Out Clusters] indexterm:[Cluster Type,Symmetrical Opt-Out] To create an opt-out cluster, start by allowing resources to run anywhere by default: ---- # crm_attribute --name symmetric-cluster --update true ---- Then start disabling nodes. The following fragment is the equivalent of the above opt-in configuration. .Opt-out location constraints for two resources ====== [source,XML] ------- ------- ====== [[node-score-equal]] === What if Two Nodes Have the Same Score === If two nodes have the same score, then the cluster will choose one. This choice may seem random and may not be what was intended, however the cluster was not given enough information to know any better. .Constraints where a resource prefers two nodes equally ====== [source,XML] ------- ------- ====== In the example above, assuming no other constraints and an inactive cluster, +Webserver+ would probably be placed on +sles-1+ and +Database+ on +sles-2+. It would likely have placed +Webserver+ based on the node's uname and +Database+ based on the desire to spread the resource load evenly across the cluster. However other factors can also be involved in more complex configurations. [[s-resource-ordering]] == Specifying the Order in which Resources Should Start/Stop == indexterm:[Resource,Constraints,Ordering] indexterm:[Resource,Start Order] indexterm:[Ordering Constraints] 'Ordering constraints' tell the cluster the order in which resources should start. [IMPORTANT] ==== Ordering constraints affect 'only' the ordering of resources; they do 'not' require that the resources be placed on the same node. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint 'and' a colocation constraint (see <>), or alternatively, a group (see <>). ==== === Ordering Properties === .Properties of a rsc_order Constraint [width="95%",cols="1m,1,4> resources. === Optional and mandatory ordering === Here is an example of ordering constraints where +Database+ 'must' start before +Webserver+, and +IP+ 'should' start before +Webserver+ if they both need to be started: .Optional and mandatory ordering constraints ====== [source,XML] ------- ------- ====== Because the above example lets +symmetrical+ default to TRUE, +Webserver+ must be stopped before +Database+ can be stopped, and +Webserver+ should be stopped before +IP+ if they both need to be stopped. [[s-resource-colocation]] == Placing Resources Relative to other Resources == indexterm:[Resource,Constraints,Colocation] indexterm:[Resource,Location Relative to other Resources] 'Colocation constraints' tell the cluster that the location of one resource depends on the location of another one. Colocation has an important side-effect: it affects the order in which resources are assigned to a node. Think about it: You can't place A relative to B unless you know where B is. footnote:[ While the human brain is sophisticated enough to read the constraint in any order and choose the correct one depending on the situation, the cluster is not quite so smart. Yet. ] So when you are creating colocation constraints, it is important to consider whether you should colocate A with B, or B with A. Another thing to keep in mind is that, assuming A is colocated with B, the cluster will take into account A's preferences when deciding which node to choose for B. For a detailed look at exactly how this occurs, see http://clusterlabs.org/doc/Colocation_Explained.pdf[Colocation Explained]. [IMPORTANT] ==== Colocation constraints affect 'only' the placement of resources; they do 'not' require that the resources be started in a particular order. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint (see <>) 'and' a colocation constraint, or alternatively, a group (see <>). ==== === Colocation Properties === .Properties of a rsc_colocation Constraint [width="95%",cols="1m,1,4<",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the constraint (required). indexterm:[id,Colocation Constraints] indexterm:[Constraints,Colocation,id] |rsc | |The name of a resource that should be located relative to +with-rsc+ (required). indexterm:[rsc,Colocation Constraints] indexterm:[Constraints,Colocation,rsc] |with-rsc | |The name of the resource used as the colocation target. The cluster will decide where to put this resource first and then decide where to put +rsc+ (required). indexterm:[with-rsc,Colocation Constraints] indexterm:[Constraints,Colocation,with-rsc] |node-attribute |#uname |The node attribute that must be the same on the node running +rsc+ and the node running +with-rsc+ for the constraint to be satisfied. (For details, see <>.) indexterm:[node-attribute,Colocation Constraints] indexterm:[Constraints,Colocation,node-attribute] |score | |Positive values indicate the resources should run on the same node. Negative values indicate the resources should run on different nodes. Values of \+/- +INFINITY+ change "should" to "must". indexterm:[score,Colocation Constraints] indexterm:[Constraints,Colocation,score] |========================================================= === Mandatory Placement === Mandatory placement occurs when the constraint's score is ++INFINITY+ or +-INFINITY+. In such cases, if the constraint can't be satisfied, then the +rsc+ resource is not permitted to run. For +score=INFINITY+, this includes cases where the +with-rsc+ resource is not active. If you need resource +A+ to always run on the same machine as resource +B+, you would add the following constraint: .Mandatory colocation constraint for two resources ==== [source,XML] ==== Remember, because +INFINITY+ was used, if +B+ can't run on any of the cluster nodes (for whatever reason) then +A+ will not be allowed to run. Whether +A+ is running or not has no effect on +B+. Alternatively, you may want the opposite -- that +A+ 'cannot' run on the same machine as +B+. In this case, use +score="-INFINITY"+. .Mandatory anti-colocation constraint for two resources ==== [source,XML] ==== Again, by specifying +-INFINITY+, the constraint is binding. So if the only place left to run is where +B+ already is, then +A+ may not run anywhere. As with +INFINITY+, +B+ can run even if +A+ is stopped. However, in this case +A+ also can run if +B+ is stopped, because it still meets the constraint of +A+ and +B+ not running on the same node. === Advisory Placement === If mandatory placement is about "must" and "must not", then advisory placement is the "I'd prefer if" alternative. For constraints with scores greater than +-INFINITY+ and less than +INFINITY+, the cluster will try to accommodate your wishes but may ignore them if the alternative is to stop some of the cluster resources. As in life, where if enough people prefer something it effectively becomes mandatory, advisory colocation constraints can combine with other elements of the configuration to behave as if they were mandatory. .Advisory colocation constraint for two resources ==== [source,XML] ==== [[s-coloc-attribute]] === Colocation by Node Attribute === -The +node+attribute+ property of a colocation constraints allows you to express +The +node-attribute+ property of a colocation constraints allows you to express the requirement, "these resources must be on similar nodes". As an example, imagine that you have two Storage Area Networks (SANs) that are not controlled by the cluster, and each node is connected to one or the other. You may have two resources +r1+ and +r2+ such that +r2+ needs to use the same SAN as +r1+, but doesn't necessarily have to be on the same exact node. In such a case, you could define a <> named +san+, with the value +san1+ or +san2+ on each node as appropriate. Then, you could colocate +r2+ with +r1+ using +node-attribute+ set to +san+. [[s-resource-sets]] == Resource Sets == 'Resource sets' allow multiple resources to be affected by a single constraint. .A set of 3 resources ==== [source,XML] ---- ---- ==== Resource sets are valid inside +rsc_location+, +rsc_order+ (see <>), +rsc_colocation+ (see <>), and +rsc_ticket+ (see <>) constraints. A resource set has a number of properties that can be set, though not all have an effect in all contexts. .Properties of a resource_set [width="95%",cols="2m,1,5 ------- ====== .Visual representation of the four resources' start order for the above constraints image::images/resource-set.png["Ordered set",width="16cm",height="2.5cm",align="center"] === Ordered Set === To simplify this situation, resource sets (see <>) can be used within ordering constraints: .A chain of ordered resources expressed as a set ====== [source,XML] ------- ------- ====== While the set-based format is not less verbose, it is significantly easier to get right and maintain. [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A then B+, or +B then A+. ========= === Ordering Multiple Sets === The syntax can be expanded to allow sets of resources to be ordered relative to each other, where the members of each individual set may be ordered or unordered (controlled by the +sequential+ property). In the example below, +A+ and +B+ can both start in parallel, as can +C+ and +D+, however +C+ and +D+ can only start once _both_ +A+ _and_ +B+ are active. .Ordered sets of unordered resources ====== [source,XML] ------- ------- ====== .Visual representation of the start order for two ordered sets of unordered resources image::images/two-sets.png["Two ordered sets",width="13cm",height="7.5cm",align="center"] Of course either set -- or both sets -- of resources can also be internally ordered (by setting +sequential="true"+) and there is no limit to the number of sets that can be specified. .Advanced use of set ordering - Three ordered sets, two of which are internally unordered ====== [source,XML] ------- ------- ====== .Visual representation of the start order for the three sets defined above image::images/three-sets.png["Three ordered sets",width="16cm",height="7.5cm",align="center"] [IMPORTANT] ==== An ordered set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== === Resource Set OR Logic === The unordered set logic discussed so far has all been "AND" logic. To illustrate this take the 3 resource set figure in the previous section. Those sets can be expressed, +(A and B) then \(C) then (D) then (E and F)+. Say for example we want to change the first set, +(A and B)+, to use "OR" logic so the sets look like this: +(A or B) then \(C) then (D) then (E and F)+. This functionality can be achieved through the use of the +require-all+ option. This option defaults to TRUE which is why the "AND" logic is used by default. Setting +require-all=false+ means only one resource in the set needs to be started before continuing on to the next set. .Resource Set "OR" logic: Three ordered sets, where the first set is internally unordered with "OR" logic ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== An ordered set with +require-all=false+ makes sense only in conjunction with +sequential=false+. Think of it like this: +sequential=false+ modifies the set to be an unordered set using "AND" logic by default, and adding +require-all=false+ flips the unordered set's "AND" logic to "OR" logic. ==== [[s-resource-sets-colocation]] == Colocating Sets of Resources == Another common situation is for an administrator to create a set of colocated resources. One way to do this would be to define a resource group (see <>), but that cannot always accurately express the desired state. Another way would be to define each relationship as an individual constraint, but that causes a constraint explosion as the number of resources and combinations grow. An example of this approach: .Chain of colocated resources ====== [source,XML] ------- ------- ====== To make things easier, resource sets (see <>) can be used within colocation constraints. As with the chained version, a resource that can't be active prevents any resource that must be colocated with it from being active. For example, if +B+ is not able to run, then both +C+ and by inference +D+ must also remain stopped. Here is an example +resource_set+: .Equivalent colocation chain expressed using +resource_set+ ====== [source,XML] ------- ------- ====== [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A with B+, or +B with A+. ========= This notation can also be used to tell the cluster that sets of resources must be colocated relative to each other, where the individual members of each set may or may not depend on each other being active (controlled by the +sequential+ property). In this example, +A+, +B+, and +C+ will each be colocated with +D+. +D+ must be active, but any of +A+, +B+, or +C+ may be inactive without affecting any other resources. .Using colocated sets to specify a common peer ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== A colocated set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== There is no inherent limit to the number and size of the sets used. The only thing that matters is that in order for any member of one set in the constraint to be active, all members of sets listed after it must also be active (and naturally on the same node); and if a set has +sequential="true"+, then in order for one member of that set to be active, all members listed before it must also be active. If desired, you can restrict the dependency to instances of multistate resources that are in a specific role, using the set's +role+ property. .Colocation chain in which the members of the middle set have no interdependencies, and the last listed set (which the cluster places first) is restricted to instances in master status. ====== [source,XML] ------- ------- ====== .Visual representation the above example (resources to the left are placed first) image::images/three-sets-complex.png["Colocation chain",width="16cm",height="9cm",align="center"] [NOTE] ==== Pay close attention to the order in which resources and sets are listed. While the colocation dependency for members of any one set is last-to-first, the colocation dependency for multiple sets is first-to-last. In the above example, +B+ is colocated with +A+, but +colocated-set-1+ is colocated with +colocated-set-2+. Unlike ordered sets, colocated sets do not use the +require-all+ option. ==== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt b/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt index 376e6a59a7..b55ff83911 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Nodes.txt @@ -1,139 +1,134 @@ = Cluster Nodes = == Defining a Cluster Node == Each node in the cluster will have an entry in the nodes section containing its UUID, uname, and type. .Example Corosync cluster node entry ====== [source,XML] - + ====== In normal circumstances, the admin should let the cluster populate this information automatically from the communications and membership data. [[s-node-name]] == Where Pacemaker Gets the Node Name == Traditionally, Pacemaker required nodes to be referred to by the value returned by `uname -n`. This can be problematic for services that require the `uname -n` to be a specific value (e.g. for a licence file). This requirement has been relaxed for clusters using Corosync 2.0 or later. The name Pacemaker uses is: . The value stored in +corosync.conf+ under *ring0_addr* in the *nodelist*, if it does not contain an IP address; otherwise . The value stored in +corosync.conf+ under *name* in the *nodelist*; otherwise . The value of `uname -n` Pacemaker provides the `crm_node -n` command which displays the name used by a running cluster. If a Corosync *nodelist* is used, `crm_node --name-for-id` pass:[number] is also available to display the name used by the node with the corosync *nodeid* of pass:[number], for example: `crm_node --name-for-id 2`. [[s-node-attributes]] == Node Attributes == indexterm:[Node,attribute] 'Node attributes' are a special type of option (name-value pair) that applies to a node object. Beyond the basic definition of a node, the administrator can describe the node's attributes, such as how much RAM, disk, what OS or kernel version it has, perhaps even its physical location. This information can then be used by the cluster when deciding where to place resources. For more information on the use of node attributes, see <>. Node attributes can be specified ahead of time or populated later, when the cluster is running, using `crm_attribute`. Below is what the node's definition would look like if the admin ran the command: .Result of using crm_attribute to specify which kernel pcmk-1 is running ====== ------- # crm_attribute --type nodes --node pcmk-1 --name kernel --update $(uname -r) ------- [source,XML] ------- ------- ====== Rather than having to read the XML, a simpler way to determine the current value of an attribute is to use `crm_attribute` again: ---- # crm_attribute --type nodes --node pcmk-1 --name kernel --query scope=nodes name=kernel value=3.10.0-123.13.2.el7.x86_64 ---- By specifying `--type nodes` the admin tells the cluster that this attribute is persistent. There are also transient attributes which are kept in the status section which are "forgotten" whenever the node rejoins the cluster. The cluster uses this area to store a record of how many times a resource has failed on that node, but administrators can also read and write to this section by specifying `--type status`. == Managing Nodes in a Corosync-Based Cluster == === Adding a New Corosync Node === indexterm:[Corosync,Add Cluster Node] indexterm:[Add Cluster Node,Corosync] To add a new node: . Install Corosync and Pacemaker on the new host. . Copy +/etc/corosync/corosync.conf+ and +/etc/corosync/authkey+ (if it exists) from an existing node. You may need to modify the *mcastaddr* option to match the new node's IP address. . Start the cluster software on the new host. If a log message containing "Invalid digest" appears from Corosync, the keys are not consistent between the machines. === Removing a Corosync Node === indexterm:[Corosync,Remove Cluster Node] indexterm:[Remove Cluster Node,Corosync] Because the messaging and membership layers are the authoritative source for cluster nodes, deleting them from the CIB is not a complete solution. First, one must arrange for corosync to forget about the node (*pcmk-1* in the example below). . Stop the cluster on the host to be removed. How to do this will vary with your operating system and installed versions of cluster software, for example, `pcs cluster stop` if you are using pcs for cluster management. . From one of the remaining active cluster nodes, tell Pacemaker to forget about the removed host, which will also delete the node from the CIB: + ---- # crm_node -R pcmk-1 ---- -[NOTE] -====== -This procedure only works for pacemaker 1.1.8 and later. -====== - === Replacing a Corosync Node === indexterm:[Corosync,Replace Cluster Node] indexterm:[Replace Cluster Node,Corosync] To replace an existing cluster node: . Make sure the old node is completely stopped. . Give the new machine the same hostname and IP address as the old one. . Follow the procedure above for adding a node. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt index dde0149325..e12591a486 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt @@ -1,450 +1,448 @@ = Cluster-Wide Configuration = == CIB Properties == Certain settings are defined by CIB properties (that is, attributes of the +cib+ tag) rather than with the rest of the cluster configuration in the +configuration+ section. The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location. .CIB Properties [width="95%",cols="2m,5<",options="header",align="center"] |========================================================= |Field |Description | admin_epoch | indexterm:[Configuration Version,Cluster] indexterm:[Cluster,Option,Configuration Version] indexterm:[admin_epoch,Cluster Option] indexterm:[Cluster,Option,admin_epoch] When a node joins the cluster, the cluster performs a check to see which node has the best configuration. It asks the node with the highest (+admin_epoch+, +epoch+, +num_updates+) tuple to replace the configuration on all the nodes -- which makes setting them, and setting them correctly, very important. +admin_epoch+ is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete. _Never set this value to zero_. In such cases, the cluster cannot tell the difference between your configuration and the "empty" one used when nothing is found on disk. | epoch | indexterm:[epoch,Cluster Option] indexterm:[Cluster,Option,epoch] The cluster increments this every time the configuration is updated (usually by the administrator). | num_updates | indexterm:[num_updates,Cluster Option] indexterm:[Cluster,Option,num_updates] The cluster increments this every time the configuration or status is updated (usually by the cluster) and resets it to 0 when epoch changes. | validate-with | indexterm:[validate-with,Cluster Option] indexterm:[Cluster,Option,validate-with] Determines the type of XML validation that will be done on the configuration. If set to +none+, the cluster will not verify that updates conform to the DTD (nor reject ones that don't). This option can be useful when operating a mixed-version cluster during an upgrade. |cib-last-written | indexterm:[cib-last-written,Cluster Property] indexterm:[Cluster,Property,cib-last-written] Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. |have-quorum | indexterm:[have-quorum,Cluster Property] indexterm:[Cluster,Property,have-quorum] Indicates if the cluster has quorum. If false, this may mean that the cluster cannot start resources or fence other nodes (see +no-quorum-policy+ below). Maintained by the cluster. |dc-uuid | indexterm:[dc-uuid,Cluster Property] indexterm:[Cluster,Property,dc-uuid] Indicates which cluster node is the current leader. Used by the cluster when placing resources and determining the order of some events. Maintained by the cluster. |========================================================= === Working with CIB Properties === Although these fields can be written to by the user, in most cases the cluster will overwrite any values specified by the user with the "correct" ones. To change the ones that can be specified by the user, for example +admin_epoch+, one should use: ---- # cibadmin --modify --xml-text '' ---- A complete set of CIB properties will look something like this: .Attributes set for a cib object ====== [source,XML] ------- ------- ====== [[s-cluster-options]] == Cluster Options == Cluster options, as you might expect, control how the cluster behaves when confronted with certain situations. They are grouped into sets within the +crm_config+ section, and, in advanced configurations, there may be more than one set. (This will be described later in the section on <> where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once. You can obtain an up-to-date list of cluster options, including their default values, by running the `man pengine` and `man crmd` commands. .Cluster Options [width="95%",cols="5m,2,11>). | enable-startup-probes | TRUE | indexterm:[enable-startup-probes,Cluster Option] indexterm:[Cluster,Option,enable-startup-probes] Should the cluster check for active resources during startup? | maintenance-mode | FALSE | indexterm:[maintenance-mode,Cluster Option] indexterm:[Cluster,Option,maintenance-mode] Should the cluster refrain from monitoring, starting and stopping resources? | stonith-enabled | TRUE | indexterm:[stonith-enabled,Cluster Option] indexterm:[Cluster,Option,stonith-enabled] Should failed nodes and nodes with resources that can't be stopped be shot? If you value your data, set up a STONITH device and enable this. If true, or unset, the cluster will refuse to start resources unless one or more STONITH resources have been configured. If false, unresponsive nodes are immediately assumed to be running no resources, and resource takeover to online nodes starts without any further protection (which means _data loss_ if the unresponsive node still accesses shared storage, for example). See also the +requires+ meta-attribute in <>. | stonith-action | reboot | indexterm:[stonith-action,Cluster Option] indexterm:[Cluster,Option,stonith-action] Action to send to STONITH device. Allowed values are +reboot+ and +off+. The value +poweroff+ is also allowed, but is only used for legacy devices. | stonith-timeout | 60s | indexterm:[stonith-timeout,Cluster Option] indexterm:[Cluster,Option,stonith-timeout] How long to wait for STONITH actions (reboot, on, off) to complete | stonith-max-attempts | 10 | indexterm:[stonith-max-attempts,Cluster Option] indexterm:[Cluster,Option,stonith-max-attempts] How many times fencing can fail for a target before the cluster will no longer -immediately re-attempt it. '(since 1.1.17)' +immediately re-attempt it. | stonith-watchdog-timeout | 0 | indexterm:[stonith-watchdog-timeout,Cluster Option] indexterm:[Cluster,Option,stonith-watchdog-timeout] If nonzero, rely on hardware watchdog self-fencing. If positive, assume unseen nodes self-fence within this much time. If negative, and the SBD_WATCHDOG_TIMEOUT environment variable is set, use twice that value. | concurrent-fencing | FALSE | indexterm:[concurrent-fencing,Cluster Option] indexterm:[Cluster,Option,concurrent-fencing] Is the cluster allowed to initiate multiple fence actions concurrently? -'(since 1.1.15)' | cluster-delay | 60s | indexterm:[cluster-delay,Cluster Option] indexterm:[Cluster,Option,cluster-delay] Estimated maximum round-trip delay over the network (excluding action execution). If the TE requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node in this time (after considering the action's own timeout). The "correct" value will depend on the speed and load of your network and cluster nodes. | dc-deadtime | 20s | indexterm:[dc-deadtime,Cluster Option] indexterm:[Cluster,Option,dc-deadtime] How long to wait for a response from other nodes during startup. The "correct" value will depend on the speed/load of your network and the type of switches used. | cluster-recheck-interval | 15min | indexterm:[cluster-recheck-interval,Cluster Option] indexterm:[Cluster,Option,cluster-recheck-interval] Polling interval for time-based changes to options, resource parameters and constraints. The Cluster is primarily event-driven, but your configuration can have elements that take effect based on the time of day. To ensure these changes take effect, we can optionally poll the cluster's status for changes. A value of 0 disables polling. Positive values are an interval (in seconds unless other SI units are specified, e.g. 5min). | cluster-ipc-limit | 500 | indexterm:[cluster-ipc-limit,Cluster Option] indexterm:[Cluster,Option,cluster-ipc-limit] The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see "Evicting client" messages for cluster daemon PIDs in the logs. | pe-error-series-max | -1 | indexterm:[pe-error-series-max,Cluster Option] indexterm:[Cluster,Option,pe-error-series-max] The number of PE inputs resulting in ERRORs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-warn-series-max | -1 | indexterm:[pe-warn-series-max,Cluster Option] indexterm:[Cluster,Option,pe-warn-series-max] The number of PE inputs resulting in WARNINGs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-input-series-max | -1 | indexterm:[pe-input-series-max,Cluster Option] indexterm:[Cluster,Option,pe-input-series-max] The number of "normal" PE inputs to save. Used when reporting problems. A value of -1 means unlimited (report all). | placement-strategy | default | indexterm:[placement-strategy,Cluster Option] indexterm:[Cluster,Option,placement-strategy] How the cluster should allocate resources to nodes (see <>). Allowed values are +default+, +utilization+, +balanced+, and +minimal+. - '(since 1.1.0)' | node-health-strategy | none | indexterm:[node-health-strategy,Cluster Option] indexterm:[Cluster,Option,node-health-strategy] How the cluster should react to node health attributes (see <>). Allowed values are +none+, +migrate-on-red+, +only-green+, +progressive+, and +custom+. | node-health-base | 0 | indexterm:[node-health-base,Cluster Option] indexterm:[Cluster,Option,node-health-base] The base health score assigned to a node. Only used when - +node-health-strategy+ is +progressive+. '(since 1.1.16)' + +node-health-strategy+ is +progressive+. | node-health-green | 0 | indexterm:[node-health-green,Cluster Option] indexterm:[Cluster,Option,node-health-green] The score to use for a node health attribute whose value is +green+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-yellow | 0 | indexterm:[node-health-yellow,Cluster Option] indexterm:[Cluster,Option,node-health-yellow] The score to use for a node health attribute whose value is +yellow+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-red | 0 | indexterm:[node-health-red,Cluster Option] indexterm:[Cluster,Option,node-health-red] The score to use for a node health attribute whose value is +red+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | remove-after-stop | FALSE | indexterm:[remove-after-stop,Cluster Option] indexterm:[Cluster,Option,remove-after-stop] _Advanced Use Only:_ Should the cluster remove resources from the LRM after they are stopped? Values other than the default are, at best, poorly tested and potentially dangerous. | startup-fencing | TRUE | indexterm:[startup-fencing,Cluster Option] indexterm:[Cluster,Option,startup-fencing] _Advanced Use Only:_ Should the cluster shoot unseen nodes? Not using the default is very unsafe! | election-timeout | 2min | indexterm:[election-timeout,Cluster Option] indexterm:[Cluster,Option,election-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | shutdown-escalation | 20min | indexterm:[shutdown-escalation,Cluster Option] indexterm:[Cluster,Option,shutdown-escalation] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-integration-timeout | 3min | indexterm:[crmd-integration-timeout,Cluster Option] indexterm:[Cluster,Option,crmd-integration-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-finalization-timeout | 30min | indexterm:[crmd-finalization-timeout,Cluster Option] indexterm:[Cluster,Option,crmd-finalization-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-transition-delay | 0s | indexterm:[crmd-transition-delay,Cluster Option] indexterm:[Cluster,Option,crmd-transition-delay] _Advanced Use Only:_ Delay cluster recovery for the configured interval to allow for additional/related events to occur. Useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions. |========================================================= === Querying and Setting Cluster Options === indexterm:[Querying,Cluster Option] indexterm:[Setting,Cluster Option] indexterm:[Cluster,Querying Options] indexterm:[Cluster,Setting Options] Cluster options can be queried and modified using the `crm_attribute` tool. To get the current value of +cluster-delay+, you can run: ---- # crm_attribute --query --name cluster-delay ---- which is more simply written as ---- # crm_attribute -G -n cluster-delay ---- If a value is found, you'll see a result like this: ---- # crm_attribute -G -n cluster-delay scope=crm_config name=cluster-delay value=60s ---- If no value is found, the tool will display an error: ---- # crm_attribute -G -n clusta-deway scope=crm_config name=clusta-deway value=(null) Error performing operation: No such device or address ---- To use a different value (for example, 30 seconds), simply run: ---- # crm_attribute --name cluster-delay --update 30s ---- To go back to the cluster's default value, you can delete the value, for example: ---- # crm_attribute --name cluster-delay --delete Deleted crm_config option: id=cib-bootstrap-options-cluster-delay name=cluster-delay ---- === When Options are Listed More Than Once === If you ever see something like the following, it means that the option you're modifying is present more than once. .Deleting an option that is listed twice ======= ------ # crm_attribute --name batch-limit --delete Multiple attributes match name=batch-limit in crm_config: Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit) Value: 100 (set=custom, id=custom-batch-limit) Please choose from one of the matches above and supply the 'id' with --id ------- ======= In such cases, follow the on-screen instructions to perform the requested action. To determine which value is currently being used by the cluster, refer to <>. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt index fc5bc0ee1b..b4ce3b1c22 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt @@ -1,854 +1,854 @@ = Cluster Resources = [[s-resource-primitive]] == What is a Cluster Resource? == indexterm:[Resource] A resource is a service made highly available by a cluster. The simplest type of resource, a 'primitive' resource, is described in this chapter. More complex forms, such as groups and clones, are described in later chapters. Every primitive resource has a 'resource agent'. A resource agent is an external program that abstracts the service it provides and present a consistent view to the cluster. This allows the cluster to be agnostic about the resources it manages. The cluster doesn't need to understand how the resource works because it relies on the resource agent to do the right thing when given a `start`, `stop` or `monitor` command. For this reason, it is crucial that resource agents are well-tested. Typically, resource agents come in the form of shell scripts. However, they can be written using any technology (such as C, Python or Perl) that the author is comfortable with. [[s-resource-supported]] == Resource Classes == indexterm:[Resource,class] Pacemaker supports several classes of agents: * OCF * LSB * Upstart * Systemd * Service * Fencing * Nagios Plugins === Open Cluster Framework === indexterm:[Resource,OCF] indexterm:[OCF,Resources] indexterm:[Open Cluster Framework,Resources] The OCF standard footnote:[See http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD -- at least as it relates to resource agents. The Pacemaker implementation has been somewhat extended from the OCF specs, but none of those changes are incompatible with the original OCF specification.] is basically an extension of the Linux Standard Base conventions for init scripts to: * support parameters, * make them self-describing, and * make them extensible OCF specs have strict definitions of the exit codes that actions must return. footnote:[ The resource-agents source code includes the `ocf-tester` script, which can be useful in this regard. ] The cluster follows these specifications exactly, and giving the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state. Parameters are passed to the resource agent as environment variables, with the special prefix +OCF_RESKEY_+. So, a parameter which the user thinks of as +ip+ will be passed to the resource agent as +OCF_RESKEY_ip+. The number and purpose of the parameters is left to the resource agent; however, the resource agent should use the `meta-data` command to advertise any that it supports. The OCF class is the most preferred as it is an industry standard, highly flexible (allowing parameters to be passed to agents in a non-positional manner) and self-describing. For more information, see the http://www.linux-ha.org/wiki/OCF_Resource_Agents[reference] and <>. === Linux Standard Base === indexterm:[Resource,LSB] indexterm:[LSB,Resources] indexterm:[Linux Standard Base,Resources] LSB resource agents are those found in +/etc/init.d+. Generally, they are provided by the OS distribution and, in order to be used with the cluster, they must conform to the LSB Spec. footnote:[ See http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html for the LSB Spec as it relates to init scripts. ] [WARNING] ==== Many distributions claim LSB compliance but ship with broken init scripts. For details on how to check whether your init script is LSB-compatible, see <>. Common problematic violations of the LSB standard include: * Not implementing the status operation at all * Not observing the correct exit status codes for `start/stop/status` actions * Starting a started resource returns an error * Stopping a stopped resource returns an error ==== [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === Systemd === indexterm:[Resource,Systemd] indexterm:[Systemd,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons and scripts with an alternative called http://www.freedesktop.org/wiki/Software/systemd[Systemd]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, systemd has 'unit files'. Generally, the services (unit files) are provided by the OS distribution, but there are online guides for converting from init scripts. footnote:[For example, http://0pointer.de/blog/projects/systemd-for-admins-3.html] [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === Upstart === indexterm:[Resource,Upstart] indexterm:[Upstart,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons (and scripts) with an alternative called http://upstart.ubuntu.com/[Upstart]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, upstart has 'jobs'. Generally, the services (jobs) are provided by the OS distribution. [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === System Services === indexterm:[Resource,System Services] indexterm:[System Service,Resources] Since there are various types of system services (+systemd+, +upstart+, and +lsb+), Pacemaker supports a special +service+ alias which intelligently figures out which one applies to a given cluster node. This is particularly useful when the cluster contains a mix of +systemd+, +upstart+, and +lsb+. In order, Pacemaker will try to find the named service as: . an LSB init script . a Systemd unit file . an Upstart job === STONITH === indexterm:[Resource,STONITH] indexterm:[STONITH,Resources] The STONITH class is used exclusively for fencing-related resources. This is discussed later in <>. === Nagios Plugins === indexterm:[Resource,Nagios Plugins] indexterm:[Nagios Plugins,Resources] Nagios Plugins footnote:[The project has two independent forks, hosted at https://www.nagios-plugins.org/ and https://www.monitoring-plugins.org/. Output from both projects' plugins is similar, so plugins from either project can be used with pacemaker.] allow us to monitor services on remote hosts. Pacemaker is able to do remote monitoring with the plugins _if they are present_. A common use case is to configure them as resources belonging to a resource container (usually a virtual machine), and the container will be restarted if any of them has failed. Another use is to configure them as ordinary resources to be used for monitoring hosts or services via the network. The supported parameters are same as the long options of the plugin. [[primitive-resource]] == Resource Properties == These values tell the cluster which resource agent to use for the resource, where to find that resource agent and what standards it conforms to. .Properties of a Primitive Resource [width="95%",cols="1m,6<",options="header",align="center"] |========================================================= |Field |Description |id |Your name for the resource indexterm:[id,Resource] indexterm:[Resource,Property,id] |class |The standard the resource agent conforms to. Allowed values: +lsb+, +nagios+, +ocf+, +service+, +stonith+, +systemd+, +upstart+ indexterm:[class,Resource] indexterm:[Resource,Property,class] |type |The name of the Resource Agent you wish to use. E.g. +IPaddr+ or +Filesystem+ indexterm:[type,Resource] indexterm:[Resource,Property,type] |provider |The OCF spec allows multiple vendors to supply the same resource agent. To use the OCF resource agents supplied by the Heartbeat project, you would specify +heartbeat+ here. indexterm:[provider,Resource] indexterm:[Resource,Property,provider] |========================================================= The XML definition of a resource can be queried with the `crm_resource` tool. For example: ---- # crm_resource --resource Email --query-xml ---- might produce: .A system resource definition ===== [source,XML] ===== [NOTE] ===== One of the main drawbacks to system services (LSB, systemd or Upstart) resources is that they do not allow any parameters! ===== //// See https://tools.ietf.org/html/rfc5737 for choice of example IP address //// .An OCF resource definition ===== [source,XML] ------- ------- ===== [[s-resource-options]] == Resource Options == Resources have two types of options: 'meta-attributes' and 'instance attributes'. Meta-attributes apply to any type of resource, while instance attributes are specific to each resource agent. === Resource Meta-Attributes === Meta-attributes are used by the cluster to decide how a resource should behave and can be easily set using the `--meta` option of the `crm_resource` command. .Meta-attributes of a Primitive Resource [width="95%",cols="2m,2,5> resources, promoted to master if appropriate) * +Slave:+ Allow the resource to be started, but only in Slave mode if the resource is <> * +Master:+ Equivalent to +Started+ indexterm:[target-role,Resource Option] indexterm:[Resource,Option,target-role] |is-managed |TRUE |Is the cluster allowed to start and stop the resource? Allowed values: +true+, +false+ indexterm:[is-managed,Resource Option] indexterm:[Resource,Option,is-managed] |resource-stickiness |value of +resource-stickiness+ in the +rsc_defaults+ section |How much does the resource prefer to stay where it is? indexterm:[resource-stickiness,Resource Option] indexterm:[Resource,Option,resource-stickiness] |requires |+quorum+ for resources with a +class+ of +stonith+, otherwise +unfencing+ if unfencing is active in the cluster, otherwise +fencing+ if +stonith-enabled+ is true, otherwise +quorum+ -|Conditions under which the resource can be started '(since 1.1.8)' +|Conditions under which the resource can be started Allowed values: * +nothing:+ can always be started * +quorum:+ The cluster can only start this resource if a majority of the configured nodes are active * +fencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes have been <> * +unfencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes have been fenced _and_ only on nodes that have been - <> '(since 1.1.9)' + <> indexterm:[requires,Resource Option] indexterm:[Resource,Option,requires] |migration-threshold |INFINITY |How many failures may occur for this resource on a node, before this node is marked ineligible to host this resource. A value of 0 indicates that this feature is disabled (the node will never be marked ineligible); by constrast, the cluster treats INFINITY (the default) as a very large but finite number. This option has an effect only if the failed operation has on-fail=restart (the default), and additionally for failed start operations, if the cluster property start-failure-is-fatal is false. indexterm:[migration-threshold,Resource Option] indexterm:[Resource,Option,migration-threshold] |failure-timeout |0 |How many seconds to wait before acting as if the failure had not occurred, and potentially allowing the resource back to the node on which it failed. A value of 0 indicates that this feature is disabled. As with any time-based actions, this is not guaranteed to be checked more frequently than the value of +cluster-recheck-interval+ (see <>). indexterm:[failure-timeout,Resource Option] indexterm:[Resource,Option,failure-timeout] |multiple-active |stop_start |What should the cluster do if it ever finds the resource active on more than one node? Allowed values: * +block:+ mark the resource as unmanaged * +stop_only:+ stop all active instances and leave them that way * +stop_start:+ stop all active instances and start the resource in one location only indexterm:[multiple-active,Resource Option] indexterm:[Resource,Option,multiple-active] |allow-migrate |TRUE for ocf:pacemaker:remote resources, FALSE otherwise |Whether the cluster should try to "live migrate" this resource when it needs to be moved (see <>) |container-attribute-target | |Specific to bundle resources; see <> |remote-node | |The name of the Pacemaker Remote guest node this resource is associated with, if any. If specified, this both enables the resource as a guest node and defines the unique name used to identify the guest node. The guest must be configured to run the Pacemaker Remote daemon when it is started. +WARNING:+ - This value cannot overlap with any resource or node IDs. '(since 1.1.9)' + This value cannot overlap with any resource or node IDs. |remote-port |3121 |If +remote-node+ is specified, the port on the guest used for its Pacemaker Remote connection. The Pacemaker Remote daemon on the guest must be - configured to listen on this port. '(since 1.1.9)' + configured to listen on this port. |remote-addr |value of +remote-node+ |If +remote-node+ is specified, the IP address or hostname used to connect to the guest via Pacemaker Remote. The Pacemaker Remote daemon on the guest - must be configured to accept connections on this address. '(since 1.1.9)' + must be configured to accept connections on this address. |remote-connect-timeout |60s |If +remote-node+ is specified, how long before a pending guest connection will - time out. '(since 1.1.10)' + time out. |========================================================= As an example of setting resource options, if you performed the following commands on an LSB Email resource: ------- # crm_resource --meta --resource Email --set-parameter priority --parameter-value 100 # crm_resource -m -r Email -p multiple-active -v block ------- the resulting resource definition might be: .An LSB resource with cluster options ===== [source,XML] ------- ------- ===== [[s-resource-defaults]] === Setting Global Defaults for Resource Meta-Attributes === To set a default value for a resource option, add it to the +rsc_defaults+ section with `crm_attribute`. For example, ---- # crm_attribute --type rsc_defaults --name is-managed --update false ---- would prevent the cluster from starting or stopping any of the resources in the configuration (unless of course the individual resources were specifically enabled by having their +is-managed+ set to +true+). === Resource Instance Attributes === The resource agents of some resource classes (lsb, systemd and upstart 'not' among them) can be given parameters which determine how they behave and which instance of a service they control. If your resource agent supports parameters, you can add them with the `crm_resource` command. For example, ---- # crm_resource --resource Public-IP --set-parameter ip --parameter-value 192.0.2.2 ---- would create an entry in the resource like this: .An example OCF resource with instance attributes ===== [source,XML] ------- ------- ===== For an OCF resource, the result would be an environment variable called +OCF_RESKEY_ip+ with a value of +192.0.2.2+. The list of instance attributes supported by an OCF resource agent can be found by calling the resource agent with the `meta-data` command. The output contains an XML description of all the supported attributes, their purpose and default values. .Displaying the metadata for the Dummy resource agent template ===== ---- # export OCF_ROOT=/usr/lib/ocf # $OCF_ROOT/resource.d/pacemaker/Dummy meta-data ---- [source,XML] ------- 1.0 This is a Dummy Resource Agent. It does absolutely nothing except keep track of whether its running or not. Its purpose in life is for testing and to serve as a template for RA writers. NB: Please pay attention to the timeouts specified in the actions section below. They should be meaningful for the kind of resource the agent manages. They should be the minimum advised timeouts, but they shouldn't/cannot cover _all_ possible resource instances. So, try to be neither overly generous nor too stingy, but moderate. The minimum timeouts should never be below 10 seconds. Example stateless resource agent Location to store the resource state in. State file Fake attribute that can be changed to cause a reload Fake attribute that can be changed to cause a reload Number of seconds to sleep during operations. This can be used to test how the cluster reacts to operation timeouts. Operation sleep duration in seconds. ------- ===== == Resource Operations == indexterm:[Resource,Action] 'Operations' are actions the cluster can perform on a resource by calling the resource agent. Resource agents must support certain common operations such as start, stop and monitor, and may implement any others. Some operations are generated by the cluster itself, for example, stopping and starting resources as needed. You can configure operations in the cluster configuration. As an example, by default the cluster will 'not' ensure your resources stay healthy once they are started. footnote:[Currently, anyway. Automatic monitoring operations may be added in a future version of Pacemaker.] To instruct the cluster to do this, you need to add a +monitor+ operation to the resource's definition. .An OCF resource with a recurring health check ===== [source,XML] ------- ------- ===== .Properties of an Operation [width="95%",cols="2m,3,6>. indexterm:[interval,Action Property] indexterm:[Action,Property,interval] |timeout | |How long to wait before declaring the action has failed indexterm:[timeout,Action Property] indexterm:[Action,Property,timeout] |on-fail |restart '(except for stop operations, which default to' fence 'when STONITH is enabled and' block 'otherwise)' |The action to take if this action ever fails. Allowed values: * +ignore:+ Pretend the resource did not fail. * +block:+ Don't perform any further operations on the resource. * +stop:+ Stop the resource and do not start it elsewhere. * +restart:+ Stop the resource and start it again (possibly on a different node). * +fence:+ STONITH the node on which the resource failed. * +standby:+ Move _all_ resources away from the node on which the resource failed. indexterm:[on-fail,Action Property] indexterm:[Action,Property,on-fail] |enabled |TRUE |If +false+, ignore this operation definition. This is typically used to pause a particular recurring monitor operation; for instance, it can complement the respective resource being unmanaged (+is-managed=false+), as this alone will <>. Disabling the operation does not suppress all actions of the given type. Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |record-pending |FALSE |If +true+, the intention to perform the operation is recorded so that GUIs and CLI tools can indicate that an operation is in progress. This is best set as an _operation default_ (see next section). Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |role | |Run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring monitor operations. Allowed (case-sensitive) values: +Stopped+, +Started+, and in the case of <> resources, +Slave+ and +Master+. indexterm:[role,Action Property] indexterm:[Action,Property,role] |========================================================= [[s-resource-monitoring]] === Monitoring Resources for Failure === When Pacemaker first starts a resource, it runs one-time monitor operations (referred to as 'probes') to ensure the resource is running where it's supposed to be, and not running where it's not supposed to be. (This behavior can be affected by the +resource-discovery+ location constraint property.) Other than those initial probes, Pacemaker will not (by default) check that the resource continues to stay healthy. As in the example above, you must configure monitor operations explicitly to perform these checks. By default, a monitor operation will ensure that the resource is running where it is supposed to. The +target-role+ property can be used for further checking. For example, if a resource has one monitor operation with +interval=10 role=Started+ and a second monitor operation with +interval=11 role=Stopped+, the cluster will run the first monitor on any nodes it thinks 'should' be running the resource, and the second monitor on any nodes that it thinks 'should not' be running the resource (for the truly paranoid, who want to know when an administrator manually starts a service by mistake). [[s-monitoring-unmanaged]] === Monitoring Resources When Administration is Disabled === Recurring monitor operations behave differently under various administrative settings: * When a resource is unmanaged (by setting +is-managed=false+): No monitors will be stopped. + If the unmanaged resource is stopped on a node where the cluster thinks it should be running, the cluster will detect and report that it is not, but it will not consider the monitor failed, and will not try to start the resource until it is managed again. + Starting the unmanaged resource on a different node is strongly discouraged and will at least cause the cluster to consider the resource failed, and may require the resource's +target-role+ to be set to +Stopped+ then +Started+ to be recovered. * When a node is put into standby: All resources will be moved away from the node, and all monitor operations will be stopped on the node, except those with +role=Stopped+. Monitor operations with +role=Stopped+ will be started on the node if appropriate. * When the cluster is put into maintenance mode: All resources will be marked as unmanaged. All monitor operations will be stopped, except those with +role=Stopped+. As with single unmanaged resources, starting a resource on a node other than where the cluster expects it to be will cause problems. [[s-operation-defaults]] === Setting Global Defaults for Operations === You can change the global default values for operation properties in a given cluster. These are defined in an +op_defaults+ section of the CIB's +configuration+ section, and can be set with `crm_attribute`. For example, ---- # crm_attribute --type op_defaults --name timeout --update 20s ---- would default each operation's +timeout+ to 20 seconds. If an operation's definition also includes a value for +timeout+, then that value would be used for that operation instead. === When Implicit Operations Take a Long Time === The cluster will always perform a number of implicit operations: +start+, +stop+ and a non-recurring +monitor+ operation used at startup to check whether the resource is already active. If one of these is taking too long, then you can create an entry for them and specify a longer timeout. .An OCF resource with custom timeouts for its implicit actions ===== [source,XML] ------- ------- ===== === Multiple Monitor Operations === Provided no two operations (for a single resource) have the same name and interval, you can have as many monitor operations as you like. In this way, you can do a superficial health check every minute and progressively more intense ones at higher intervals. To tell the resource agent what kind of check to perform, you need to provide each monitor with a different value for a common parameter. The OCF standard creates a special parameter called +OCF_CHECK_LEVEL+ for this purpose and dictates that it is "made available to the resource agent without the normal +OCF_RESKEY+ prefix". Whatever name you choose, you can specify it by adding an +instance_attributes+ block to the +op+ tag. It is up to each resource agent to look for the parameter and decide how to use it. .An OCF resource with two recurring health checks, performing different levels of checks specified via +OCF_CHECK_LEVEL+. ===== [source,XML] ------- ------- ===== === Disabling a Monitor Operation === The easiest way to stop a recurring monitor is to just delete it. However, there can be times when you only want to disable it temporarily. In such cases, simply add +enabled="false"+ to the operation's definition. .Example of an OCF resource with a disabled health check ===== [source,XML] ------- ------- ===== This can be achieved from the command line by executing: ---- # cibadmin --modify --xml-text '' ---- Once you've done whatever you needed to do, you can then re-enable it with ---- # cibadmin --modify --xml-text '' ---- diff --git a/doc/Pacemaker_Explained/en-US/Ch-Rules.txt b/doc/Pacemaker_Explained/en-US/Ch-Rules.txt index 6951e1c0fb..9faf0735c4 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Rules.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Rules.txt @@ -1,643 +1,642 @@ = Rules = //// We prefer [[ch-rules]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-rules[Chapter 8, Rules] indexterm:[Resource,Constraint,Rule] Rules can be used to make your configuration more dynamic. One common example is to set one value for +resource-stickiness+ during working hours, to prevent resources from being moved back to their most preferred location, and another on weekends when no-one is around to notice an outage. Another use of rules might be to assign machines to different processing groups (using a node attribute) based on time and to then use that attribute when creating location constraints. Each rule can contain a number of expressions, date-expressions and even other rules. The results of the expressions are combined based on the rule's +boolean-op+ field to determine if the rule ultimately evaluates to +true+ or +false+. What happens next depends on the context in which the rule is being used. == Rule Properties == .Properties of a Rule [width="95%",cols="2m,1,5<",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the rule (required) indexterm:[id,Constraint Rule] indexterm:[Constraint,Rule,id] |role |+Started+ |Limits the rule to apply only when the resource is in the specified role. Allowed values are +Started+, +Slave+, and +Master+. A rule with +role="Master"+ cannot determine the initial location of a clone instance and will only affect which of the active instances will be promoted. indexterm:[role,Constraint Rule] indexterm:[Constraint,Rule,role] |score | |The score to apply if the rule evaluates to +true+. Limited to use in rules that are part of location constraints. indexterm:[score,Constraint Rule] indexterm:[Constraint,Rule,score] |score-attribute | |The node attribute to look up and use as a score if the rule evaluates to +true+. Limited to use in rules that are part of location constraints. indexterm:[score-attribute,Constraint Rule] indexterm:[Constraint,Rule,score-attribute] |boolean-op |+and+ |How to combine the result of multiple expression objects. Allowed values are +and+ and +or+. indexterm:[boolean-op,Constraint Rule] indexterm:[Constraint,Rule,boolean-op] |========================================================= == Node Attribute Expressions == indexterm:[Resource,Constraint,Attribute Expression] Expression objects are used to control a resource based on the attributes defined by a node or nodes. .Properties of an Expression [width="95%",cols="2m,1,5> |#id |Node ID |#kind |Node type. Possible values are +cluster+, +remote+, and +container+. Kind is +remote+ for Pacemaker Remote nodes created with the +ocf:pacemaker:remote+ resource, and +container+ for Pacemaker Remote guest nodes and bundle nodes - '(since 1.1.13)' |#is_dc |"true" if this node is a Designated Controller (DC), "false" otherwise |#cluster-name |The value of the +cluster-name+ cluster property, if set |#site-name |The value of the +site-name+ cluster property, if set, otherwise identical to +#cluster-name+ |#role |The role the relevant multistate resource has on this node. Valid only within a rule for a location constraint for a multistate resource. //// // if uncommenting, put a pipe in front of first two lines #ra-version The installed version of the resource agent on the node, as defined by the +version+ attribute of the +resource-agent+ tag in the agent's metadata. Valid only within rules controlling resource options. This can be useful during rolling upgrades of a backward-incompatible resource agent. '(coming in x.x.x)' //// |========================================================= == Time- and Date-Based Expressions == indexterm:[Time Based Expressions] indexterm:[Resource,Constraint,Date/Time Expression] As the name suggests, +date_expressions+ are used to control a resource or cluster option based on the current date/time. They may contain an optional +date_spec+ and/or +duration+ object depending on the context. .Properties of a Date Expression [width="95%",cols="2m,5 ---- ==== .Equivalent expression ==== [source,XML] ---- ---- ==== .9am-5pm Monday-Friday ==== [source,XML] ------- ------- ==== Please note that the +16+ matches up to +16:59:59+, as the numeric value (hour) still matches! .9am-6pm Monday through Friday or anytime Saturday ==== [source,XML] ------- ------- ==== .9am-5pm or 9pm-12am Monday through Friday ==== [source,XML] ------- ------- ==== .Mondays in March 2005 ==== [source,XML] ------- ------- ==== [NOTE] ====== Because no time is specified with the above dates, 00:00:00 is implied. This means that the range includes all of 2005-03-01 but none of 2005-04-01. You may wish to write +end="2005-03-31T23:59:59"+ to avoid confusion. ====== .A full moon on Friday the 13th ===== [source,XML] ------- ------- ===== == Using Rules to Determine Resource Location == indexterm:[Rule,Determine Resource Location] indexterm:[Resource,Location,Determine by Rules] A location constraint may contain rules. When the constraint's outermost rule evaluates to +false+, the cluster treats the constraint as if it were not there. When the rule evaluates to +true+, the node's preference for running the resource is updated with the score associated with the rule. If this sounds familiar, it is because you have been using a simplified syntax for location constraint rules already. Consider the following location constraint: .Prevent myApacheRsc from running on c001n03 ===== [source,XML] ------- ------- ===== This constraint can be more verbosely written as: .Prevent myApacheRsc from running on c001n03 - expanded version ===== [source,XML] ------- ------- ===== The advantage of using the expanded form is that one can then add extra clauses to the rule, such as limiting the rule such that it only applies during certain times of the day or days of the week. === Location Rules Based on Other Node Properties === The expanded form allows us to match on node properties other than its name. If we rated each machine's CPU power such that the cluster had the following nodes section: .A sample nodes section for use with score-attribute ===== [source,XML] ------- ------- ===== then we could prevent resources from running on underpowered machines with this rule: [source,XML] ------- ------- === Using +score-attribute+ Instead of +score+ === When using +score-attribute+ instead of +score+, each node matched by the rule has its score adjusted differently, according to its value for the named node attribute. Thus, in the previous example, if a rule used +score-attribute="cpu_mips"+, +c001n01+ would have its preference to run the resource increased by +1234+ whereas +c001n02+ would have its preference increased by +5678+. == Using Rules to Control Resource Options == Often some cluster nodes will be different from their peers. Sometimes, these differences -- e.g. the location of a binary or the names of network interfaces -- require resources to be configured differently depending on the machine they're hosted on. By defining multiple +instance_attributes+ objects for the resource and adding a rule to each, we can easily handle these special cases. In the example below, +mySpecialRsc+ will use eth1 and port 9999 when run on +node1+, eth2 and port 8888 on +node2+ and default to eth0 and port 9999 for all other nodes. .Defining different resource options based on the node name ===== [source,XML] ------- ------- ===== The order in which +instance_attributes+ objects are evaluated is determined by their score (highest to lowest). If not supplied, score defaults to zero, and objects with an equal score are processed in listed order. If the +instance_attributes+ object has no rule or a +rule+ that evaluates to +true+, then for any parameter the resource does not yet have a value for, the resource will use the parameter values defined by the +instance_attributes+. For example, given the configuration above, if the resource is placed on node1: . +special-node1+ has the highest score (3) and so is evaluated first; its rule evaluates to +true+, so +interface+ is set to +eth1+. . +special-node2+ is evaluated next with score 2, but its rule evaluates to +false+, so it is ignored. . +defaults+ is evaluated last with score 1, and has no rule, so its values are examined; +interface+ is already defined, so the value here is not used, but +port+ is not yet defined, so +port+ is set to +9999+. == Using Rules to Control Cluster Options == indexterm:[Rule,Controlling Cluster Options] indexterm:[Cluster,Setting Options with Rules] Controlling cluster options is achieved in much the same manner as specifying different resource options on different nodes. The difference is that because they are cluster options, one cannot (or should not, because they won't work) use attribute-based expressions. The following example illustrates how to set a different +resource-stickiness+ value during and outside work hours. This allows resources to automatically move back to their most preferred hosts, but at a time that (in theory) does not interfere with business activities. .Change +resource-stickiness+ during working hours ===== [source,XML] ------- ------- ===== [[s-rules-recheck]] == Ensuring Time-Based Rules Take Effect == A Pacemaker cluster is an event-driven system. As such, it won't recalculate the best place for resources to run unless something (like a resource failure or configuration change) happens. This can mean that a location constraint that only allows resource X to run between 9am and 5pm is not enforced. If you rely on time-based rules, the +cluster-recheck-interval+ cluster option (which defaults to 15 minutes) is essential. This tells the cluster to periodically recalculate the ideal state of the cluster. For example, if you set +cluster-recheck-interval="5m"+, then sometime between 09:00 and 09:05 the cluster would notice that it needs to start resource X, and between 17:00 and 17:05 it would realize that X needed to be stopped. The timing of the actual start and stop actions depends on what other actions the cluster may need to perform first. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Status.txt b/doc/Pacemaker_Explained/en-US/Ch-Status.txt index a25326da6a..b46f0167e4 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Status.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Status.txt @@ -1,373 +1,371 @@ = Status -- Here be dragons = Most users never need to understand the contents of the status section and can be happy with the output from `crm_mon`. However for those with a curious inclination, this section attempts to provide an overview of its contents. == Node Status == indexterm:[Node,Status] indexterm:[Status of a Node] In addition to the cluster's configuration, the CIB holds an up-to-date representation of each cluster node in the +status+ section. .A bare-bones status entry for a healthy node *cl-virt-1* ====== [source,XML] ----- - - - + + + ----- ====== Users are highly recommended _not_ to modify any part of a node's state _directly_. The cluster will periodically regenerate the entire section from authoritative sources, so any changes should be done with the tools appropriate to those sources. .Authoritative Sources for State Information [width="95%",cols="1m,1<",options="header",align="center"] |========================================================= | CIB Object | Authoritative Source |node_state|crmd |transient_attributes|attrd |lrm|lrmd |========================================================= The fields used in the +node_state+ objects are named as they are largely for historical reasons and are rooted in Pacemaker's origins as the resource manager for the older Heartbeat project. They have remained unchanged to preserve compatibility with older versions. .Node Status Fields [width="95%",cols="1m,4<",options="header",align="center"] |========================================================= |Field |Description | id | indexterm:[id,Node Status] indexterm:[Node,Status,id] Unique identifier for the node. Corosync-based clusters use a numeric counter. | uname | indexterm:[uname,Node Status] indexterm:[Node,Status,uname] -The node's machine name (output from `uname -n`). - -| ha | -indexterm:[ha,Node Status] -indexterm:[Node,Status,ha] -Is the cluster software active on this node? Allowed values: +active+, +dead+. +The node's name as known by the cluster | in_ccm | indexterm:[in_ccm,Node Status] indexterm:[Node,Status,in_ccm] -Is the node a member of the cluster? Allowed values: +true+, +false+. +Is the node a member at the cluster communication layer? Allowed values: ++true+, +false+. | crmd | indexterm:[crmd,Node Status] indexterm:[Node,Status,crmd] -Is the crmd process active on the node? Allowed values: +online+, +offline+. +Is the node a member at the pacemaker layer? Allowed values: +online+, ++offline+. + +| crm-debug-origin | +indexterm:[crm-debug-origin,Node Status] +indexterm:[Node,Status,crm-debug-origin] +The name of the source function that made the most recent change (for debugging +purposes). | join | indexterm:[join,Node Status] indexterm:[Node,Status,join] Does the node participate in hosting resources? Allowed values: +down+, +pending+, +member+, +banned+. | expected | indexterm:[expected,Node Status] indexterm:[Node,Status,expected] Expected value for +join+. -| crm-debug-origin | -indexterm:[crm-debug-origin,Node Status] -indexterm:[Node,Status,crm-debug-origin] -The origin of the most recent change(s). For diagnostic purposes. - |========================================================= The cluster uses these fields to determine whether, at the node level, the node is healthy or is in a failed state and needs to be fenced. == Transient Node Attributes == Like regular <>, the name/value pairs listed in the +transient_attributes+ section help to describe the node. However they are forgotten by the cluster when the node goes offline. This can be useful, for instance, when you want a node to be in standby mode (not able to run resources) just until the next reboot. In addition to any values the administrator sets, the cluster will also store information about failed resources here. .A set of transient node attributes for node *cl-virt-1* ====== [source,XML] ----- ----- ====== In the above example, we can see that a monitor on the +pingd:0+ resource has failed once, at 09:22:22 UTC 6 April 2009. footnote:[ You can use the standard `date` command to print a human-readable version of any seconds-since-epoch value, for example `date -d @1239009742`. ] We also see that the node is connected to three *pingd* peers and that all known resources have been checked for on this machine (+probe_complete+). == Operation History == indexterm:[Operation History] A node's resource history is held in the +lrm_resources+ tag (a child of the +lrm+ tag). The information stored here includes enough information for the cluster to stop the resource safely if it is removed from the +configuration+ section. Specifically, the resource's +id+, +class+, +type+ and +provider+ are stored. .A record of the +apcstonith+ resource ====== [source,XML] ====== Additionally, we store the last job for every combination of +resource+, +action+ and +interval+. The concatenation of the values in this tuple are used to create the id of the +lrm_rsc_op+ object. .Contents of an +lrm_rsc_op+ job [width="95%",cols="2m,5<",options="header",align="center"] |========================================================= |Field |Description | id | indexterm:[id,Action Status] indexterm:[Action,Status,id] Identifier for the job constructed from the resource's +id+, +operation+ and +interval+. | call-id | indexterm:[call-id,Action Status] indexterm:[Action,Status,call-id] The job's ticket number. Used as a sort key to determine the order in which the jobs were executed. | operation | indexterm:[operation,Action Status] indexterm:[Action,Status,operation] The action the resource agent was invoked with. | interval | indexterm:[interval,Action Status] indexterm:[Action,Status,interval] The frequency, in milliseconds, at which the operation will be repeated. A one-off job is indicated by 0. | op-status | indexterm:[op-status,Action Status] indexterm:[Action,Status,op-status] The job's status. Generally this will be either 0 (done) or -1 (pending). Rarely used in favor of +rc-code+. | rc-code | indexterm:[rc-code,Action Status] indexterm:[Action,Status,rc-code] The job's result. Refer to <> for details on what the values here mean and how they are interpreted. | last-run | indexterm:[last-run,Action Status] indexterm:[Action,Status,last-run] Machine-local date/time, in seconds since epoch, at which the job was executed. For diagnostic purposes. | last-rc-change | indexterm:[last-rc-change,Action Status] indexterm:[Action,Status,last-rc-change] Machine-local date/time, in seconds since epoch, at which the job first returned the current value of +rc-code+. For diagnostic purposes. | exec-time | indexterm:[exec-time,Action Status] indexterm:[Action,Status,exec-time] Time, in milliseconds, that the job was running for. For diagnostic purposes. | queue-time | indexterm:[queue-time,Action Status] indexterm:[Action,Status,queue-time] Time, in seconds, that the job was queued for in the LRMd. For diagnostic purposes. | crm_feature_set | indexterm:[crm_feature_set,Action Status] indexterm:[Action,Status,crm_feature_set] The version which this job description conforms to. Used when processing +op-digest+. | transition-key | indexterm:[transition-key,Action Status] indexterm:[Action,Status,transition-key] A concatenation of the job's graph action number, the graph number, the expected result and the UUID of the crmd instance that scheduled it. This is used to construct +transition-magic+ (below). | transition-magic | indexterm:[transition-magic,Action Status] indexterm:[Action,Status,transition-magic] A concatenation of the job's +op-status+, +rc-code+ and +transition-key+. Guaranteed to be unique for the life of the cluster (which ensures it is part of CIB update notifications) and contains all the information needed for the crmd to correctly analyze and process the completed job. Most importantly, the decomposed elements tell the crmd if the job entry was expected and whether it failed. | op-digest | indexterm:[op-digest,Action Status] indexterm:[Action,Status,op-digest] An MD5 sum representing the parameters passed to the job. Used to detect changes to the configuration, to restart resources if necessary. | crm-debug-origin | indexterm:[crm-debug-origin,Action Status] indexterm:[Action,Status,crm-debug-origin] The origin of the current values. For diagnostic purposes. |========================================================= === Simple Operation History Example === .A monitor operation (determines current state of the +apcstonith+ resource) ====== [source,XML] ----- ----- ====== In the above example, the job is a non-recurring monitor operation often referred to as a "probe" for the +apcstonith+ resource. The cluster schedules probes for every configured resource on a node when the node first starts, in order to determine the resource's current state before it takes any further action. From the +transition-key+, we can see that this was the 22nd action of the 2nd graph produced by this instance of the crmd (2668bbeb-06d5-40f9-936d-24cb7f87006a). The third field of the +transition-key+ contains a 7, which indicates that the job expects to find the resource inactive. By looking at the +rc-code+ property, we see that this was the case. As that is the only job recorded for this node, we can conclude that the cluster started the resource elsewhere. === Complex Operation History Example === .Resource history of a +pingd+ clone with multiple jobs ====== [source,XML] ----- ----- ====== When more than one job record exists, it is important to first sort them by +call-id+ before interpreting them. Once sorted, the above example can be summarized as: . A non-recurring monitor operation returning 7 (not running), with a +call-id+ of 3 . A stop operation returning 0 (success), with a +call-id+ of 32 . A start operation returning 0 (success), with a +call-id+ of 33 . A recurring monitor returning 0 (success), with a +call-id+ of 34 The cluster processes each job record to build up a picture of the resource's state. After the first and second entries, it is considered stopped, and after the third it considered active. Based on the last operation, we can tell that the resource is currently active. Additionally, from the presence of a +stop+ operation with a lower +call-id+ than that of the +start+ operation, we can conclude that the resource has been restarted. Specifically this occurred as part of actions 11 and 31 of transition 11 from the crmd instance with the key +2668bbeb...+. This information can be helpful for locating the relevant section of the logs when looking for the source of a failure. diff --git a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt index fe8996aab5..52883bebc7 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt @@ -1,953 +1,936 @@ = STONITH = //// We prefer [[ch-stonith]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-stonith[Chapter 13, STONITH] indexterm:[STONITH, Configuration] == What Is STONITH? == STONITH (an acronym for "Shoot The Other Node In The Head"), also called 'fencing', protects your data from being corrupted by rogue nodes or concurrent access. Just because a node is unresponsive, this doesn't mean it isn't accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH so we can be certain that the node is truly offline, before allowing the data to be accessed from another node. STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere. == What STONITH Device Should You Use? == It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one. The biggest mistake people make in choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. In such cases, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault. Likewise, any device that relies on the machine being active (such as SSH-based "devices" used during testing) are inappropriate. == Special Treatment of STONITH Resources == STONITH resources are somewhat special in Pacemaker. STONITH may be initiated by pacemaker or by other parts of the cluster (such as resources like DRBD or DLM). To accommodate this, pacemaker does not require the STONITH resource to be in the 'started' state in order to be used, thus allowing reliable use of STONITH devices in such a case. -[NOTE] -==== -In pacemaker versions 1.1.9 and earlier, this feature either did not exist or -did not work well. Only "running" STONITH resources could be used by Pacemaker -for fencing, and if another component tried to fence a node while Pacemaker was -moving STONITH resources, the fencing could fail. -==== - All nodes have access to STONITH devices' definitions and instantiate them on-the-fly when needed, but preference is given to 'verified' instances, which are the ones that are 'started' according to the cluster's knowledge. In the case of a cluster split, the partition with a verified instance will have a slight advantage, because the STONITH daemon in the other partition will have to hear from all its current peers before choosing a node to perform the fencing. Fencing resources do work the same as regular resources in some respects: * +target-role+ can be used to enable or disable the resource * Location constraints can be used to prevent a specific node from using the resource [IMPORTANT] =========== Currently there is a limitation that fencing resources may only have one set of meta-attributes and one set of instance attributes. This can be revisited if it becomes a significant limitation for people. =========== See the table below or run `man stonithd` to see special instance attributes that may be set for any fencing resource, regardless of fence agent. .Additional Properties of Fencing Resources [width="95%",cols="5m,2,3,10>). indexterm:[priority,Fencing] indexterm:[Fencing,Property,priority] |pcmk_host_map |string | |A mapping of host names to ports numbers for devices that do not support host names. Example: +node1:1;node2:2,3+ tells the cluster to use port 1 for *node1* and ports 2 and 3 for *node2*. indexterm:[pcmk_host_map,Fencing] indexterm:[Fencing,Property,pcmk_host_map] |pcmk_host_list |string | |A list of machines controlled by this device (optional unless +pcmk_host_check+ is +static-list+). indexterm:[pcmk_host_list,Fencing] indexterm:[Fencing,Property,pcmk_host_list] |pcmk_host_check |string |dynamic-list |How to determine which machines are controlled by the device. Allowed values: * +dynamic-list:+ query the device * +static-list:+ check the +pcmk_host_list+ attribute * +none:+ assume every device can fence every machine indexterm:[pcmk_host_check,Fencing] indexterm:[Fencing,Property,pcmk_host_check] |pcmk_delay_max |time |0s |Enable a random delay of up to the time specified before executing stonith actions. This is sometimes used in two-node clusters to ensure that the nodes don't fence each other at the same time. The overall delay introduced by pacemaker is derived from this random delay value adding a static delay so that the sum is kept below the maximum delay. indexterm:[pcmk_delay_max,Fencing] indexterm:[Fencing,Property,pcmk_delay_max] |pcmk_delay_base |time |0s |Enable a static delay before executing stonith actions. This can be used e.g. in two-node clusters to ensure that the nodes don't fence each other, by having separate fencing resources with different values. The node that is fenced with the shorter delay will lose a fencing race. The overall delay introduced by pacemaker is derived from this value plus a random delay such that the sum is kept below the maximum delay. indexterm:[pcmk_delay_base,Fencing] indexterm:[Fencing,Property,pcmk_delay_base] |pcmk_action_limit |integer |1 |The maximum number of actions that can be performed in parallel on this device, if the cluster option +concurrent-fencing+ is +true+. -1 is unlimited. - '(since 1.1.15)' indexterm:[pcmk_action_limit,Fencing] indexterm:[Fencing,Property,pcmk_action_limit] |pcmk_host_argument |string |port |'Advanced use only.' Which parameter should be supplied to the resource agent to identify the node to be fenced. Some devices do not support the standard +port+ parameter or may provide additional ones. Use this to specify an alternate, device-specific parameter. A value of +none+ tells the cluster not to supply any additional parameters. indexterm:[pcmk_host_argument,Fencing] indexterm:[Fencing,Property,pcmk_host_argument] |pcmk_reboot_action |string |reboot |'Advanced use only.' The command to send to the resource agent in order to reboot a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_reboot_action,Fencing] indexterm:[Fencing,Property,pcmk_reboot_action] |pcmk_reboot_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `reboot` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_reboot_timeout,Fencing] indexterm:[Fencing,Property,pcmk_reboot_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_reboot_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `reboot` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_reboot_retries,Fencing] indexterm:[Fencing,Property,pcmk_reboot_retries] |pcmk_off_action |string |off |'Advanced use only.' The command to send to the resource agent in order to shut down a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_off_action,Fencing] indexterm:[Fencing,Property,pcmk_off_action] |pcmk_off_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `off` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_off_timeout,Fencing] indexterm:[Fencing,Property,pcmk_off_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_off_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `off` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_off_retries,Fencing] indexterm:[Fencing,Property,pcmk_off_retries] |pcmk_list_action |string |list |'Advanced use only.' The command to send to the resource agent in order to list nodes. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_list_action,Fencing] indexterm:[Fencing,Property,pcmk_list_action] |pcmk_list_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `list` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_list_timeout,Fencing] indexterm:[Fencing,Property,pcmk_list_timeout] |pcmk_list_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `list` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_list_retries,Fencing] indexterm:[Fencing,Property,pcmk_list_retries] |pcmk_monitor_action |string |monitor |'Advanced use only.' The command to send to the resource agent in order to report extended status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_monitor_action,Fencing] indexterm:[Fencing,Property,pcmk_monitor_action] |pcmk_monitor_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `monitor` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_monitor_timeout,Fencing] indexterm:[Fencing,Property,pcmk_monitor_timeout] |pcmk_monitor_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `monitor` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_monitor_retries,Fencing] indexterm:[Fencing,Property,pcmk_monitor_retries] |pcmk_status_action |string |status |'Advanced use only.' The command to send to the resource agent in order to report status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_status_action,Fencing] indexterm:[Fencing,Property,pcmk_status_action] |pcmk_status_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `status` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_status_timeout,Fencing] indexterm:[Fencing,Property,pcmk_status_timeout] |pcmk_status_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `status` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_status_retries,Fencing] indexterm:[Fencing,Property,pcmk_status_retries] |========================================================= [[s-unfencing]] == Unfencing == Most fence devices cut the power to the target. By contrast, fence devices that perform 'fabric fencing' cut off a node's access to some critical resource, such as a shared disk or a network switch. With fabric fencing, it is expected that the cluster will fence the node, and then a system administrator must manually investigate what went wrong, correct any issues found, then reboot (or restart the cluster services on) the node. Once the node reboots and rejoins the cluster, some fabric fencing devices require that an explicit command to restore the node's access to the critical resource. This capability is called 'unfencing' and is typically implemented as the fence agent's +on+ command. If any cluster resource has +requires+ set to +unfencing+, then that resource will not be probed or started on a node until that node has been unfenced. == Configuring STONITH == [NOTE] =========== Higher-level configuration shells include functionality to simplify the process below, particularly the step for deciding which parameters are required. However since this document deals only with core components, you should refer to the STONITH chapter of the http://www.clusterlabs.org/doc/[Clusters from Scratch] guide for those details. =========== . Find the correct driver: + ---- # stonith_admin --list-installed ---- . Find the required parameters associated with the device (replacing $AGENT_NAME with the name obtained from the previous step): + ---- # stonith_admin --metadata --agent $AGENT_NAME ---- . Create a file called +stonith.xml+ containing a primitive resource with a class of +stonith+, a type equal to the agent name obtained earlier, and a parameter for each of the values returned in the previous step. . If the device does not know how to fence nodes based on their uname, you may also need to set the special +pcmk_host_map+ parameter. See `man stonithd` for details. . If the device does not support the `list` command, you may also need to set the special +pcmk_host_list+ and/or +pcmk_host_check+ parameters. See `man stonithd` for details. . If the device does not expect the victim to be specified with the `port` parameter, you may also need to set the special +pcmk_host_argument+ parameter. See `man stonithd` for details. . Upload it into the CIB using cibadmin: + ---- # cibadmin -C -o resources --xml-file stonith.xml ---- . Set +stonith-enabled+ to true: + ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- . Once the stonith resource is running, you can test it by executing the following (although you might want to stop the cluster on that machine first): + ---- # stonith_admin --reboot nodename ---- === Example STONITH Configuration === Assume we have an chassis containing four nodes and an IPMI device active on 192.0.2.1. We would choose the `fence_ipmilan` driver, and obtain the following list of parameters: .Obtaining a list of STONITH Parameters ==== ---- # stonith_admin --metadata -a fence_ipmilan ---- [source,XML] ----

---- ==== Based on that, we would create a STONITH resource fragment that might look like this: .An IPMI-based STONITH Resource ==== [source,XML] ---- ---- ==== Finally, we need to enable STONITH: ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- == Advanced STONITH Configurations == Some people consider that having one fencing device is a single point of failure footnote:[Not true, since a node or resource must fail before fencing even has a chance to]; others prefer removing the node from the storage and network instead of turning it off. Whatever the reason, Pacemaker supports fencing nodes with multiple devices through a feature called 'fencing topologies'. Simply create the individual devices as you normally would, then define one or more +fencing-level+ entries in the +fencing-topology+ section of the configuration. * Each fencing level is attempted in order of ascending +index+. Allowed values are 1 through 9. * If a device fails, processing terminates for the current level. No further devices in that level are exercised, and the next level is attempted instead. * If the operation succeeds for all the listed devices in a level, the level is deemed to have passed. * The operation is finished when a level has passed (success), or all levels have been attempted (failed). * If the operation failed, the next step is determined by the Policy Engine and/or `crmd`. Some possible uses of topologies include: * Try poison-pill and fail back to power * Try disk and network, and fall back to power if either fails * Initiate a kdump and then poweroff the node .Properties of Fencing Levels [width="95%",cols="1m,3<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the level indexterm:[id,fencing-level] indexterm:[Fencing,fencing-level,id] |target |The name of a single node to which this level applies indexterm:[target,fencing-level] indexterm:[Fencing,fencing-level,target] |target-pattern |A regular expression matching the names of nodes to which this level applies - '(since 1.1.14)' indexterm:[target-pattern,fencing-level] indexterm:[Fencing,fencing-level,target-pattern] |target-attribute |The name of a node attribute that is set (to +target-value+) for nodes to - which this level applies '(since 1.1.14)' + which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |target-value |The node attribute value (of +target-attribute+) that is set for nodes to - which this level applies '(since 1.1.14)' + which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |index |The order in which to attempt the levels. Levels are attempted in ascending order 'until one succeeds'. Valid values are 1 through 9. indexterm:[index,fencing-level] indexterm:[Fencing,fencing-level,index] |devices |A comma-separated list of devices that must all be tried for this level indexterm:[devices,fencing-level] indexterm:[Fencing,fencing-level,devices] |========================================================= .Fencing topology with different devices for different nodes ==== [source,XML] ---- ...

... ---- ==== === Example Dual-Layer, Dual-Device Fencing Topologies === The following example illustrates an advanced use of +fencing-topology+ in a cluster with the following properties: * 3 nodes (2 active prod-mysql nodes, 1 prod_mysql-rep in standby for quorum purposes) * the active nodes have an IPMI-controlled power board reached at 192.0.2.1 and 192.0.2.2 * the active nodes also have two independent PSUs (Power Supply Units) connected to two independent PDUs (Power Distribution Units) reached at 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) * the first fencing method uses the `fence_ipmi` agent * the second fencing method uses the `fence_apc_snmp` agent targetting 2 fencing devices (one per PSU, either port 10 or 11) * fencing is only implemented for the active nodes and has location constraints * fencing topology is set to try IPMI fencing first then default to a "sure-kill" dual PDU fencing In a normal failure scenario, STONITH will first select +fence_ipmi+ to try to kill the faulty node. Using a fencing topology, if that first method fails, STONITH will then move on to selecting +fence_apc_snmp+ twice: * once for the first PDU * again for the second PDU The fence action is considered successful only if both PDUs report the required status. If any of them fails, STONITH loops back to the first fencing method, +fence_ipmi+, and so on until the node is fenced or fencing action is cancelled. .First fencing method: single IPMI device Each cluster node has it own dedicated IPMI channel that can be called for fencing using the following primitives: [source,XML] ---- ---- .Second fencing method: dual PDU devices Each cluster node also has two distinct power channels controlled by two distinct PDUs. That means a total of 4 fencing devices configured as follows: - Node 1, PDU 1, PSU 1 @ port 10 - Node 1, PDU 2, PSU 2 @ port 10 - Node 2, PDU 1, PSU 1 @ port 11 - Node 2, PDU 2, PSU 2 @ port 11 The matching fencing agents are configured as follows: [source,XML] ---- ---- .Location Constraints To prevent STONITH from trying to run a fencing agent on the same node it is supposed to fence, constraints are placed on all the fencing primitives: [source,XML] ---- ---- .Fencing topology Now that all the fencing resources are defined, it's time to create the right topology. We want to first fence using IPMI and if that does not work, fence both PDUs to effectively and surely kill the node. [source,XML] ----

---- Please note, in +fencing-topology+, the lowest +index+ value determines the priority of the first fencing method. .Final configuration Put together, the configuration looks like this: [source,XML] ---- ...

... ---- == Remapping Reboots == When the cluster needs to reboot a node, whether because +stonith-action+ is +reboot+ or because a reboot was manually requested (such as by `stonith_admin --reboot`), it will remap that to other commands in two cases: . If the chosen fencing device does not support the +reboot+ command, the cluster will ask it to perform +off+ instead. . If a fencing topology level with multiple devices must be executed, the cluster will ask all the devices to perform +off+, then ask the devices to perform +on+. To understand the second case, consider the example of a node with redundant power supplies connected to intelligent power switches. Rebooting one switch and then the other would have no effect on the node. Turning both switches off, and then on, actually reboots the node. In such a case, the fencing operation will be treated as successful as long as the +off+ commands succeed, because then it is safe for the cluster to recover any resources that were on the node. Timeouts and errors in the +on+ phase will be logged but ignored. When a reboot operation is remapped, any action-specific timeout for the remapped action will be used (for example, +pcmk_off_timeout+ will be used when executing the +off+ command, not +pcmk_reboot_timeout+). - -[NOTE] -==== -In Pacemaker versions 1.1.13 and earlier, reboots will not be remapped in the -second case. To achieve the same effect, separate fencing devices for off and -on actions must be configured. -==== diff --git a/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.ent b/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.ent index 2611c5aa89..a767f5ffc2 100644 --- a/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.ent +++ b/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.ent @@ -1,4 +1,4 @@ - + diff --git a/doc/Pacemaker_Explained/en-US/Revision_History.xml b/doc/Pacemaker_Explained/en-US/Revision_History.xml index 837a23a737..839f62c750 100644 --- a/doc/Pacemaker_Explained/en-US/Revision_History.xml +++ b/doc/Pacemaker_Explained/en-US/Revision_History.xml @@ -1,132 +1,144 @@ Revision History 1-0 19 Oct 2009 AndrewBeekhofandrew@beekhof.net Import from Pages.app 2-0 26 Oct 2009 AndrewBeekhofandrew@beekhof.net Cleanup and reformatting of docbook xml complete 3-0 Tue Nov 12 2009 AndrewBeekhofandrew@beekhof.net Split book into chapters and pass validation Re-organize book for use with Publican 4-0 Mon Oct 8 2012 AndrewBeekhofandrew@beekhof.net Converted to asciidoc (which is converted to docbook for use with Publican) 5-0 Mon Feb 23 2015 KenGaillotkgaillot@redhat.com Update for clarity, stylistic consistency and current command-line syntax 6-0 Tue Dec 8 2015 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.14 7-0 Tue May 3 2016 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.15 7-1 Fri Oct 28 2016 KenGaillotkgaillot@redhat.com Overhaul upgrade documentation, and document node health strategies 8-0 Tue Oct 25 2016 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.16 9-0 Tue Jul 11 2017 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.17 10-0 Fri Oct 6 2017 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.18 + + 11-0 + Fri Jan 12 2018 + KenGaillotkgaillot@redhat.com + + + + Update for Pacemaker 2.0.0 + + + + diff --git a/doc/Pacemaker_Remote/en-US/Book_Info.xml b/doc/Pacemaker_Remote/en-US/Book_Info.xml index 64e6b32237..bf11f23ff9 100644 --- a/doc/Pacemaker_Remote/en-US/Book_Info.xml +++ b/doc/Pacemaker_Remote/en-US/Book_Info.xml @@ -1,75 +1,78 @@ %BOOK_ENTITIES; ]> Pacemaker Remote Scaling High Availablity Clusters 7 - 0 + 1 The document exists as both a reference and deployment guide for the Pacemaker Remote service. The example commands in this document will use: &DISTRO; &DISTRO_VERSION; as the host operating system Pacemaker Remote to perform resource management within guest nodes and remote nodes KVM for virtualization libvirt to manage guest nodes Corosync to provide messaging and membership services on cluster nodes - Pacemaker to perform resource management on cluster nodes + Pacemaker 1.1.16 + While this guide is part of the document set for + Pacemaker 2.0, it demonstrates the version available in + the standard &DISTRO; repositories + to perform resource management on cluster nodes pcs as the cluster configuration toolset The concepts are the same for other distributions, virtualization platforms, toolsets, and messaging layers, and should be easily adaptable. - diff --git a/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt b/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt index 0d4238da14..d6543f9fa2 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Alternatives.txt @@ -1,76 +1,76 @@ = Alternative Configurations = These alternative configurations may be appropriate in limited cases, such as a test cluster, but are not the best method in most situations. They are presented here for completeness and as an example of Pacemaker's flexibility to suit your needs. == Virtual Machines as Cluster Nodes == The preferred use of virtual machines in a Pacemaker cluster is as a cluster resource, whether opaque or as a guest node. However, it is possible to run the full cluster stack on a virtual node instead. This is commonly used to set up test environments; a single physical host (that does not participate in the cluster) runs two or more virtual machines, all running the full cluster stack. This can be used to simulate a larger cluster for testing purposes. In a production environment, fencing becomes more complicated, especially if the underlying hosts run any services besides the clustered VMs. If the VMs are not guaranteed a minimum amount of host resources, CPU and I/O contention can cause timing issues for cluster components. Another situation where this approach is sometimes used is when the cluster owner leases the VMs from a provider and does not have direct access to the underlying host. The main concerns in this case are proper fencing (usually via a custom resource agent that communicates with the provider's APIs) and maintaining a static IP address between reboots, as well as resource contention issues. == Virtual Machines as Remote Nodes == Virtual machines may be configured following the process for remote nodes rather than guest nodes (i.e., using an *ocf:pacemaker:remote* resource rather than letting the cluster manage the VM directly). This is mainly useful in testing, to use a single physical host to simulate a larger cluster involving remote nodes. Pacemaker's Cluster Test Suite (CTS) uses this approach to test remote node functionality. == Containers as Guest Nodes == Containers,footnote:[https://en.wikipedia.org/wiki/Operating-system-level_virtualization] and in particular Linux containers (LXC) and Docker, have become a popular method of isolating services in a resource-efficient manner. The preferred means of integrating containers into Pacemaker is as a cluster resource, whether opaque or using Pacemaker's 'bundle' resource type. However, it is possible to run `pacemaker_remote` inside a container, following the process for guest nodes. This is not recommended but can be useful, for example, in testing scenarios, to simulate a large number of guest nodes. The configuration process is very similar to that described for guest nodes using virtual machines. Key differences: * The underlying host must install the libvirt driver for the desired container technology -- for example, the +libvirt-daemon-lxc+ package to get the - http://libvirt.org/drvlxc.html:[libvirt-lxc] driver for LXC containers. + http://libvirt.org/drvlxc.html[libvirt-lxc] driver for LXC containers. * Libvirt XML definitions must be generated for the containers. The +pacemaker-cts+ package includes a script for this purpose, +/usr/share/pacemaker/tests/cts/lxc_autogen.sh+. Run it with the `--help` option for details on how to use it. It is intended for testing purposes only, and hardcodes various parameters that would need to be set appropriately in real usage. Of course, you can create XML definitions manually, following the appropriate libvirt driver documentation. * To share the authentication key, either share the host's +/etc/pacemaker+ directory with the container, or copy the key into the container's filesystem. * The *VirtualDomain* resource for a container will need *force_stop="true"* and an appropriate hypervisor option, for example *hypervisor="lxc:///"* for LXC containers. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt b/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt index 75f8d2216f..27261863ad 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Baremetal-Tutorial.txt @@ -1,310 +1,305 @@ = Remote Node Walk-through = *What this tutorial is:* An in-depth walk-through of how to get Pacemaker to integrate a remote node into the cluster as a node capable of running cluster resources. *What this tutorial is not:* A realistic deployment scenario. The steps shown here are meant to get users familiar with the concept of remote nodes as quickly as possible. This tutorial requires three machines: two to act as cluster nodes, and a third to act as the remote node. == Configure Remote Node == === Configure Firewall on Remote Node === Allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- # setenforce 0 # sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config # systemctl mask firewalld.service # systemctl stop firewalld.service # iptables --flush ---- ====== === Configure pacemaker_remote on Remote Node === Install the pacemaker_remote daemon on the remote node. ---- # yum install -y pacemaker-remote resource-agents pcs ---- Create a location for the shared authentication key: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker ---- All nodes (both cluster nodes and remote nodes) must have the same authentication key installed for the communication to work correctly. If you already have a key on an existing node, copy it to the new remote node. Otherwise, create a new key, for example: ---- # dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- Now start and enable the pacemaker_remote daemon on the remote node. ---- # systemctl enable pacemaker_remote.service # systemctl start pacemaker_remote.service ---- Verify the start is successful. ---- # systemctl status pacemaker_remote pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled) - Active: active (running) since Fri 2015-08-21 15:21:20 CDT; 20s ago + Active: active (running) since Fri 2018-01-12 15:21:20 CDT; 20s ago Main PID: 21273 (pacemaker_remot) CGroup: /system.slice/pacemaker_remote.service └─21273 /usr/sbin/pacemaker_remoted -Aug 21 15:21:20 remote1 systemd[1]: Starting Pacemaker Remote Service... -Aug 21 15:21:20 remote1 systemd[1]: Started Pacemaker Remote Service. -Aug 21 15:21:20 remote1 pacemaker_remoted[21273]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log -Aug 21 15:21:20 remote1 pacemaker_remoted[21273]: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. -Aug 21 15:21:20 remote1 pacemaker_remoted[21273]: notice: bind_and_listen: Listening on address :: +Jan 12 15:21:20 remote1 systemd[1]: Starting Pacemaker Remote Service... +Jan 12 15:21:20 remote1 systemd[1]: Started Pacemaker Remote Service. +Jan 12 15:21:20 remote1 pacemaker_remoted[21273]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log +Jan 12 15:21:20 remote1 pacemaker_remoted[21273]: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. +Jan 12 15:21:20 remote1 pacemaker_remoted[21273]: notice: bind_and_listen: Listening on address :: ---- == Verify Connection to Remote Node == Before moving forward, it's worth verifying that the cluster nodes can contact the remote node on port 3121. Here's a trick you can use. Connect using ssh from each of the cluster nodes. The connection will get destroyed, but how it is destroyed tells you whether it worked or not. First, add the remote node's hostname (we're using *remote1* in this tutorial) to the cluster nodes' +/etc/hosts+ files if you haven't already. This is required unless you have DNS set up in a way where remote1's address can be discovered. Execute the following on each cluster node, replacing the IP address with the actual IP address of the remote node. ---- # cat << END >> /etc/hosts 192.168.122.10 remote1 END ---- If running the ssh command on one of the cluster nodes results in this output before disconnecting, the connection works: ---- # ssh -p 3121 remote1 ssh_exchange_identification: read: Connection reset by peer ---- If you see one of these, the connection is not working: ---- # ssh -p 3121 remote1 ssh: connect to host remote1 port 3121: No route to host ---- ---- # ssh -p 3121 remote1 ssh: connect to host remote1 port 3121: Connection refused ---- Once you can successfully connect to the remote node from the both cluster nodes, move on to setting up Pacemaker on the cluster nodes. == Configure Cluster Nodes == === Configure Firewall on Cluster Nodes === On each cluster node, allow cluster-related services through the local firewall, following the same procedure as in <<_configure_firewall_on_remote_node>>. === Install Pacemaker on Cluster Nodes === On the two cluster nodes, install the following packages. ---- # yum install -y pacemaker corosync pcs resource-agents ---- === Copy Authentication Key to Cluster Nodes === Create a location for the shared authentication key, and copy it from any existing node: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker # scp remote1:/etc/pacemaker/authkey /etc/pacemaker/authkey ---- === Configure Corosync on Cluster Nodes === Corosync handles Pacemaker's cluster membership and messaging. The corosync config file is located in +/etc/corosync/corosync.conf+. That config file must be initialized with information about the two cluster nodes before pacemaker can start. To initialize the corosync config file, execute the following pcs command on both nodes, filling in the information in <> with your nodes' information. ---- # pcs cluster setup --force --local --name mycluster ---- === Start Pacemaker on Cluster Nodes === Start the cluster stack on both cluster nodes using the following command. ---- # pcs cluster start ---- Verify corosync membership .... # pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 node1 (local) .... Verify Pacemaker status. At first, the `pcs cluster status` output will look like this. ---- # pcs status Cluster name: mycluster -Last updated: Fri Aug 21 16:14:05 2015 -Last change: Fri Aug 21 14:02:14 2015 Stack: corosync Current DC: NONE -Version: 1.1.12-a14efad -1 Nodes configured, unknown expected votes -0 Resources configured +Last updated: Fri Jan 12 16:14:05 2018 +Last change: Fri Jan 12 14:02:14 2018 + +1 node configured +0 resources configured ---- After about a minute, you should see your two cluster nodes come online. ---- # pcs status Cluster name: mycluster -Last updated: Fri Aug 21 16:16:32 2015 -Last change: Fri Aug 21 14:02:14 2015 Stack: corosync -Current DC: node1 (1) - partition with quorum -Version: 1.1.12-a14efad -2 Nodes configured -0 Resources configured +Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 16:16:32 2018 +Last change: Fri Jan 12 14:02:14 2018 + +2 nodes configured +0 resources configured Online: [ node1 node2 ] ---- For the sake of this tutorial, we are going to disable stonith to avoid having to cover fencing device configuration. ---- # pcs property set stonith-enabled=false ---- == Integrate Remote Node into Cluster == Integrating a remote node into the cluster is achieved through the creation of a remote node connection resource. The remote node connection resource both establishes the connection to the remote node and defines that the remote node exists. Note that this resource is actually internal to Pacemaker's crmd component. A metadata file for this resource can be found in the +/usr/lib/ocf/resource.d/pacemaker/remote+ file that describes what options are available, but there is no actual *ocf:pacemaker:remote* resource agent script that performs any work. Define the remote node connection resource to our remote node, *remote1*, using the following command on any cluster node. ---- # pcs resource create remote1 ocf:pacemaker:remote ---- That's it. After a moment you should see the remote node come online. ---- Cluster name: mycluster -Last updated: Fri Aug 21 17:13:09 2015 -Last change: Fri Aug 21 17:02:02 2015 Stack: corosync -Current DC: node1 (1) - partition with quorum -Version: 1.1.12-a14efad -3 Nodes configured -1 Resources configured +Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 17:13:09 2018 +Last change: Fri Jan 12 17:02:02 2018 +3 nodes configured +1 resources configured Online: [ node1 node2 ] RemoteOnline: [ remote1 ] Full list of resources: remote1 (ocf::pacemaker:remote): Started node1 -PCSD Status: - node1: Online - node2: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- == Starting Resources on Remote Node == Once the remote node is integrated into the cluster, starting resources on a remote node is the exact same as on cluster nodes. Refer to the http://clusterlabs.org/doc/['Clusters from Scratch'] document for examples of resource creation. [WARNING] ========= Never involve a remote node connection resource in a resource group, colocation constraint, or order constraint. ========= == Fencing Remote Nodes == Remote nodes are fenced the same way as cluster nodes. No special considerations are required. Configure fencing resources for use with remote nodes the same as you would with cluster nodes. Note, however, that remote nodes can never 'initiate' a fencing action. Only cluster nodes are capable of actually executing a fencing operation against another node. == Accessing Cluster Tools from a Remote Node == Besides allowing the cluster to manage resources on a remote node, pacemaker_remote has one other trick. The pacemaker_remote daemon allows nearly all the pacemaker tools (`crm_resource`, `crm_mon`, `crm_attribute`, `crm_master`, etc.) to work on remote nodes natively. Try it: Run `crm_mon` on the remote node after pacemaker has integrated it into the cluster. These tools just work. These means resource agents such as master/slave resources which need access to tools like `crm_master` work seamlessly on the remote nodes. Higher-level command shells such as `pcs` may have partial support on remote nodes, but it is recommended to run them from a cluster node. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Example.txt b/doc/Pacemaker_Remote/en-US/Ch-Example.txt index cdc1823dd7..7583ed0e77 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Example.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Example.txt @@ -1,130 +1,130 @@ = Guest Node Quick Example = If you already know how to use Pacemaker, you'll likely be able to grasp this new concept of guest nodes by reading through this quick example without having to sort through all the detailed walk-through steps. Here are the key configuration ingredients that make this possible using libvirt and KVM virtual guests. These steps strip everything down to the very basics. (((guest node))) (((node,guest node))) == Mile-High View of Configuration Steps == * Give each virtual machine that will be used as a guest node a static network address and unique hostname. * Put the same authentication key with the path +/etc/pacemaker/authkey+ on every cluster node and virtual machine. This secures remote communication. + Run this command if you want to make a somewhat random key: + ---- dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- * Install pacemaker_remote on every virtual machine, enabling it to start at boot, and if a local firewall is used, allow the node to accept connections on TCP port 3121. + ---- yum install pacemaker-remote resource-agents systemctl enable pacemaker_remote firewall-cmd --add-port 3121/tcp --permanent ---- + [NOTE] ====== If you just want to see this work, you may want to simply disable the local firewall and put SELinux in permissive mode while testing. This creates security risks and should not be done on a production machine exposed to the Internet, but can be appropriate for a protected test machine. ====== * Create a Pacemaker resource to launch each virtual machine, using the *remote-node* meta-attribute to let Pacemaker know this will be a guest node capable of running resources. + ---- # pcs resource create vm-guest1 VirtualDomain hypervisor="qemu:///system" config="vm-guest1.xml" meta remote-node="guest1" ---- + The above command will create CIB XML similar to the following: + [source,XML] ---- ---- In the example above, the meta-attribute *remote-node="guest1"* tells Pacemaker that this resource is a guest node with the hostname *guest1*. The cluster will attempt to contact the virtual machine's pacemaker_remote service at the hostname *guest1* after it launches. [NOTE] ====== The ID of the resource creating the virtual machine (*vm-guest1* in the above example) 'must' be different from the virtual machine's uname (*guest1* in the above example). Pacemaker will create an implicit internal resource for the pacemaker_remote connection to the guest, named with the value of *remote-node*, so that value cannot be used as the name of any other resource. ====== == Using a Guest Node == Guest nodes will show up in `crm_mon` output as normal: .Example `crm_mon` output after *guest1* is integrated into cluster ---- -Last updated: Wed Mar 13 13:52:39 2013 -Last change: Wed Mar 13 13:25:17 2013 via crmd on node1 Stack: corosync -Current DC: node1 (24815808) - partition with quorum -Version: 1.1.10 -2 Nodes configured, unknown expected votes -2 Resources configured. +Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 13:52:39 2018 +Last change: Fri Jan 12 13:25:17 2018 via crmd on node1 + +2 nodes configured +2 resources configured Online: [ node1 guest1] vm-guest1 (ocf::heartbeat:VirtualDomain): Started node1 ---- Now, you could place a resource, such as a webserver, on *guest1*: ---- # pcs resource create webserver apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s # pcs constraint location webserver prefers guest1 ---- Now, the crm_mon output would show: ---- -Last updated: Wed Mar 13 13:52:39 2013 -Last change: Wed Mar 13 13:25:17 2013 via crmd on node1 Stack: corosync -Current DC: node1 (24815808) - partition with quorum -Version: 1.1.10 -2 Nodes configured, unknown expected votes -2 Resources configured. +Current DC: node1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 13:52:39 2018 +Last change: Fri Jan 12 13:25:17 2018 via crmd on node1 + +2 nodes configured +2 resources configured Online: [ node1 guest1] vm-guest1 (ocf::heartbeat:VirtualDomain): Started node1 webserver (ocf::heartbeat::apache): Started guest1 ---- It is worth noting that after *guest1* is integrated into the cluster, nearly all the Pacemaker command-line tools immediately become available to the guest node. This means things like `crm_mon`, `crm_resource`, and `crm_attribute` will work natively on the guest node, as long as the connection between the guest node and a cluster node exists. This is particularly important for any master/slave resources executing on the guest node that need access to `crm_master` to set transient attributes. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt index 3b6b0c2da0..e20a359967 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt @@ -1,206 +1,159 @@ = Scaling a Pacemaker Cluster = == Overview == In a basic Pacemaker high-availability cluster,footnote:[See the http://www.clusterlabs.org/doc/[Pacemaker documentation], especially 'Clusters From Scratch' and 'Pacemaker Explained', for basic information about high-availability using Pacemaker] each node runs the full cluster stack of corosync and all Pacemaker components. This allows great flexibility but limits scalability to around 16 nodes. To allow for scalability to dozens or even hundreds of nodes, Pacemaker allows nodes not running the full cluster stack to integrate into the cluster and have the cluster manage their resources as if they were a cluster node. == Terms == cluster node:: A node running the full high-availability stack of corosync and all Pacemaker components. Cluster nodes may run cluster resources, run all Pacemaker command-line tools (`crm_mon`, `crm_resource` and so on), execute fencing actions, count toward cluster quorum, and serve as the cluster's Designated Controller (DC). (((cluster node))) (((node,cluster node))) pacemaker_remote:: A small service daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. Nodes running pacemaker_remote may run cluster resources and most command-line tools, but cannot perform other functions of full cluster nodes such as fencing execution, quorum voting or DC eligibility. The pacemaker_remote daemon is an enhanced version of Pacemaker's local resource management daemon (LRMD). (((pacemaker_remote))) remote node:: A physical host running pacemaker_remote. Remote nodes have a special resource that manages communication with the cluster. This is sometimes referred to as the 'baremetal' case. (((remote node))) (((node,remote node))) guest node:: A virtual host running pacemaker_remote. Guest nodes differ from remote nodes mainly in that the guest node is itself a resource that the cluster manages. (((guest node))) (((node,guest node))) [NOTE] ====== 'Remote' in this document refers to the node not being a part of the underlying corosync cluster. It has nothing to do with physical proximity. Remote nodes and guest nodes are subject to the same latency requirements as cluster nodes, which means they are typically in the same data center. ====== [NOTE] ====== It is important to distinguish the various roles a virtual machine can serve in Pacemaker clusters: * A virtual machine can run the full cluster stack, in which case it is a cluster node and is not itself managed by the cluster. * A virtual machine can be managed by the cluster as a resource, without the cluster having any awareness of the services running inside the virtual machine. The virtual machine is 'opaque' to the cluster. * A virtual machine can be a cluster resource, and run pacemaker_remote to make it a guest node, allowing the cluster to manage services inside it. The virtual machine is 'transparent' to the cluster. ====== -== Support in Pacemaker Versions == - -It is recommended to run Pacemaker 1.1.12 or later when using pacemaker_remote -due to important bug fixes. An overview of changes in pacemaker_remote -capability by version (aside from bug fixes, which are included in every -version): - -.1.1.18 -* Support for unfencing remote nodes (useful with "fabric fencing" agents) -* Guest nodes are now probed for resource status before starting resources - -.1.1.16 -* Support for watchdog-based fencing (sbd) on remote nodes - -.1.1.15 -* If pacemaker_remote is stopped on an active node, it will wait for the - cluster to migrate all resources off before exiting, rather than exit - immediately and get fenced. - -.1.1.14 -* Resources that create guest nodes can be included in groups -* reconnect_interval option for remote nodes - -.1.1.13 -* Support for maintenance mode -* Remote nodes can recover without being fenced when the cluster node - hosting their connection fails -* +#kind+ built-in node attribute for use with rules - -.1.1.12 -* Support for permanent node attributes -* Support for migration - -.1.1.11 -* Support for IPv6 -* Support for remote nodes -* Support for transient node attributes -* Support for clusters with mixed endian architectures - -.1.1.10 -* remote-connect-timeout for guest nodes - -.1.1.9 -* Initial version to include pacemaker_remote -* Limited to guest nodes in KVM/LXC environments using only IPv4; - all nodes' architectures must have same endianness - == Guest Nodes == (((guest node))) (((node,guest node))) *"I want a Pacemaker cluster to manage virtual machine resources, but I also want Pacemaker to be able to manage the resources that live within those virtual machines."* Without pacemaker_remote, the possibilities for implementing the above use case have significant limitations: * The cluster stack could be run on the physical hosts only, which loses the ability to monitor resources within the guests. * A separate cluster could be on the virtual guests, which quickly hits scalability issues. * The cluster stack could be run on the guests using the same cluster as the physical hosts, which also hits scalability issues and complicates fencing. With pacemaker_remote: * The physical hosts are cluster nodes (running the full cluster stack). * The virtual machines are guest nodes (running the pacemaker_remote service). Nearly zero configuration is required on the virtual machine. * The cluster stack on the cluster nodes launches the virtual machines and immediately connects to the pacemaker_remote service on them, allowing the virtual machines to integrate into the cluster. The key difference here between the guest nodes and the cluster nodes is that the guest nodes do not run the cluster stack. This means they will never become the DC, initiate fencing actions or participate in quorum voting. On the other hand, this also means that they are not bound to the scalability limits associated with the cluster stack (no 16-node corosync member limits to deal with). That isn't to say that guest nodes can scale indefinitely, but it is known that guest nodes scale horizontally much further than cluster nodes. Other than the quorum limitation, these guest nodes behave just like cluster nodes with respect to resource management. The cluster is fully capable of managing and monitoring resources on each guest node. You can build constraints against guest nodes, put them in standby, or do whatever else you'd expect to be able to do with cluster nodes. They even show up in `crm_mon` output as nodes. To solidify the concept, below is an example that is very similar to an actual deployment we test in our developer environment to verify guest node scalability: * 16 cluster nodes running the full corosync + pacemaker stack * 64 Pacemaker-managed virtual machine resources running pacemaker_remote configured as guest nodes * 64 Pacemaker-managed webserver and database resources configured to run on the 64 guest nodes With this deployment, you would have 64 webservers and databases running on 64 virtual machines on 16 hardware nodes, all of which are managed and monitored by the same Pacemaker deployment. It is known that pacemaker_remote can scale to these lengths and possibly much further depending on the specific scenario. == Remote Nodes == (((remote node))) (((node,remote node))) *"I want my traditional high-availability cluster to scale beyond the limits imposed by the corosync messaging layer."* Ultimately, the primary advantage of remote nodes over cluster nodes is scalability. There are likely some other use cases related to geographically distributed HA clusters that remote nodes may serve a purpose in, but those use cases are not well understood at this point. Like guest nodes, remote nodes will never become the DC, initiate fencing actions or participate in quorum voting. That is not to say, however, that fencing of a remote node works any differently than that of a cluster node. The Pacemaker policy engine understands how to fence remote nodes. As long as a fencing device exists, the cluster is capable of ensuring remote nodes are fenced in the exact same way as cluster nodes. == Expanding the Cluster Stack == With pacemaker_remote, the traditional view of the high-availability stack can be expanded to include a new layer: .Traditional HA Stack image::images/pcmk-ha-cluster-stack.png["Traditional Pacemaker+Corosync Stack",width="17cm",height="9cm",align="center"] .HA Stack With Guest Nodes image::images/pcmk-ha-remote-stack.png["Pacemaker+Corosync Stack With pacemaker_remote",width="20cm",height="10cm",align="center"] diff --git a/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt b/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt index 2c38d5ed10..cf54d49655 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-KVM-Tutorial.txt @@ -1,583 +1,578 @@ = Guest Node Walk-through = *What this tutorial is:* An in-depth walk-through of how to get Pacemaker to manage a KVM guest instance and integrate that guest into the cluster as a guest node. *What this tutorial is not:* A realistic deployment scenario. The steps shown here are meant to get users familiar with the concept of guest nodes as quickly as possible. == Configure the Physical Host == [NOTE] ====== For this example, we will use a single physical host named *example-host*. A production cluster would likely have multiple physical hosts, in which case you would run the commands here on each one, unless noted otherwise. ====== === Configure Firewall on Host === On the physical host, allow cluster-related services through the local firewall: ---- # firewall-cmd --permanent --add-service=high-availability success # firewall-cmd --reload success ---- [NOTE] ====== If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports, which can be used by various clustering components: TCP ports 2224, 3121, and 21064, and UDP port 5405. If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host. To disable security measures: ---- [root@pcmk-1 ~]# setenforce 0 [root@pcmk-1 ~]# sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config [root@pcmk-1 ~]# systemctl mask firewalld.service [root@pcmk-1 ~]# systemctl stop firewalld.service [root@pcmk-1 ~]# iptables --flush ---- ====== === Install Cluster Software === ---- # yum install -y pacemaker corosync pcs resource-agents ---- === Configure Corosync === Corosync handles pacemaker's cluster membership and messaging. The corosync config file is located in +/etc/corosync/corosync.conf+. That config file must be initialized with information about the cluster nodes before pacemaker can start. To initialize the corosync config file, execute the following `pcs` command, replacing the cluster name and hostname as desired: ---- # pcs cluster setup --force --local --name mycluster example-host ---- [NOTE] ====== If you have multiple physical hosts, you would execute the setup command on only one host, but list all of them at the end of the command. ====== === Configure Pacemaker for Remote Node Communication === Create a place to hold an authentication key for use with pacemaker_remote: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker ---- Generate a key: ---- # dd if=/dev/urandom of=/etc/pacemaker/authkey bs=4096 count=1 ---- [NOTE] ====== If you have multiple physical hosts, you would generate the key on only one host, and copy it to the same location on all hosts. ====== === Verify Cluster Software === Start the cluster ---- # pcs cluster start ---- Verify corosync membership .... # pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 example-host (local) .... Verify pacemaker status. At first, the output will look like this: ---- # pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false -Last updated: Fri Oct 9 15:18:32 2015 Last change: Fri Oct 9 12:42:21 2015 by root via cibadmin on example-host Stack: corosync Current DC: NONE -1 node and 0 resources configured +Last updated: Fri Jan 12 15:18:32 2018 +Last change: Fri Jan 12 12:42:21 2018 by root via cibadmin on example-host -Node example-host: UNCLEAN (offline) - -Full list of resources: +1 node configured +0 resources configured +Node example-host: UNCLEAN (offline) -PCSD Status: - example-host: Online +No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- After a short amount of time, you should see your host as a single node in the cluster: ---- # pcs status Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false -Last updated: Fri Oct 9 15:20:05 2015 Last change: Fri Oct 9 12:42:21 2015 by root via cibadmin on example-host Stack: corosync -Current DC: example-host (version 1.1.13-a14efad) - partition WITHOUT quorum -1 node and 0 resources configured - -Online: [ example-host ] +Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition WITHOUT quorum +Last updated: Fri Jan 12 15:20:05 2018 +Last change: Fri Jan 12 12:42:21 2018 by root via cibadmin on example-host -Full list of resources: +1 node configured +0 resources configured +Online: [ example-host ] -PCSD Status: - example-host: Online +No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- === Disable STONITH and Quorum === Now, enable the cluster to work without quorum or stonith. This is required for the sake of getting this tutorial to work with a single cluster node. ---- # pcs property set stonith-enabled=false # pcs property set no-quorum-policy=ignore ---- [WARNING] ========= The use of `stonith-enabled=false` is completely inappropriate for a production cluster. It tells the cluster to simply pretend that failed nodes are safely powered off. Some vendors will refuse to support clusters that have STONITH disabled. We disable STONITH here only to focus the discussion on pacemaker_remote, and to be able to use a single physical host in the example. ========= Now, the status output should look similar to this: ---- # pcs status Cluster name: mycluster -Last updated: Fri Oct 9 15:22:49 2015 Last change: Fri Oct 9 15:22:46 2015 by root via cibadmin on example-host Stack: corosync -Current DC: example-host (version 1.1.13-a14efad) - partition with quorum -1 node and 0 resources configured - -Online: [ example-host ] +Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 15:22:49 2018 +Last change: Fri Jan 12 15:22:46 2018 by root via cibadmin on example-host -Full list of resources: +1 node configured +0 resources configured +Online: [ example-host ] -PCSD Status: - example-host: Online +No active resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Go ahead and stop the cluster for now after verifying everything is in order. ---- # pcs cluster stop --force ---- === Install Virtualization Software === ---- # yum install -y kvm libvirt qemu-system qemu-kvm bridge-utils virt-manager # systemctl enable libvirtd.service ---- Reboot the host. [NOTE] ====== While KVM is used in this example, any virtualization platform with a Pacemaker resource agent can be used to create a guest node. The resource agent needs only to support usual commands (start, stop, etc.); Pacemaker implements the *remote-node* meta-attribute, independent of the agent. ====== == Configure the KVM guest == === Create Guest === We will not outline here the installation steps required to create a KVM guest. There are plenty of tutorials available elsewhere that do that. Just be sure to configure the guest with a hostname and a static IP address (as an example here, we will use guest1 and 192.168.122.10). === Configure Firewall on Guest === On each guest, allow cluster-related services through the local firewall, following the same procedure as in <<_configure_firewall_on_host>>. === Verify Connectivity === At this point, you should be able to ping and ssh into guests from hosts, and vice versa. === Configure pacemaker_remote === Install pacemaker_remote, and enable it to run at start-up. Here, we also install the pacemaker package; it is not required, but it contains the dummy resource agent that we will use later for testing. ---- # yum install -y pacemaker pacemaker-remote resource-agents # systemctl enable pacemaker_remote.service ---- Copy the authentication key from a host: ---- # mkdir -p --mode=0750 /etc/pacemaker # chgrp haclient /etc/pacemaker # scp root@example-host:/etc/pacemaker/authkey /etc/pacemaker ---- Start pacemaker_remote, and verify the start was successful: ---- # systemctl start pacemaker_remote # systemctl status pacemaker_remote pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled) Active: active (running) since Thu 2013-03-14 18:24:04 EDT; 2min 8s ago Main PID: 1233 (pacemaker_remot) CGroup: name=systemd:/system/pacemaker_remote.service └─1233 /usr/sbin/pacemaker_remoted Mar 14 18:24:04 guest1 systemd[1]: Starting Pacemaker Remote Service... Mar 14 18:24:04 guest1 systemd[1]: Started Pacemaker Remote Service. Mar 14 18:24:04 guest1 pacemaker_remoted[1233]: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. ---- === Verify Host Connection to Guest === Before moving forward, it's worth verifying that the host can contact the guest on port 3121. Here's a trick you can use. Connect using ssh from the host. The connection will get destroyed, but how it is destroyed tells you whether it worked or not. First add guest1 to the host machine's +/etc/hosts+ file if you haven't already. This is required unless you have DNS setup in a way where guest1's address can be discovered. ---- # cat << END >> /etc/hosts 192.168.122.10 guest1 END ---- If running the ssh command on one of the cluster nodes results in this output before disconnecting, the connection works: ---- # ssh -p 3121 guest1 ssh_exchange_identification: read: Connection reset by peer ---- If you see one of these, the connection is not working: ---- # ssh -p 3121 guest1 ssh: connect to host guest1 port 3121: No route to host ---- ---- # ssh -p 3121 guest1 ssh: connect to host guest1 port 3121: Connection refused ---- Once you can successfully connect to the guest from the host, shutdown the guest. Pacemaker will be managing the virtual machine from this point forward. == Integrate Guest into Cluster == Now the fun part, integrating the virtual machine you've just created into the cluster. It is incredibly simple. === Start the Cluster === On the host, start pacemaker. ---- # pcs cluster start ---- Wait for the host to become the DC. The output of `pcs status` should look as it did in <<_disable_stonith_and_quorum>>. === Integrate as Guest Node === If you didn't already do this earlier in the verify host to guest connection section, add the KVM guest's IP address to the host's +/etc/hosts+ file so we can connect by hostname. For this example: ---- # cat << END >> /etc/hosts 192.168.122.10 guest1 END ---- We will use the *VirtualDomain* resource agent for the management of the virtual machine. This agent requires the virtual machine's XML config to be dumped to a file on disk. To do this, pick out the name of the virtual machine you just created from the output of this list. .... # virsh list --all Id Name State ---------------------------------------------------- - guest1 shut off .... In my case I named it guest1. Dump the xml to a file somewhere on the host using the following command. ---- # virsh dumpxml guest1 > /etc/pacemaker/guest1.xml ---- Now just register the resource with pacemaker and you're set! ---- # pcs resource create vm-guest1 VirtualDomain hypervisor="qemu:///system" \ config="/etc/pacemaker/guest1.xml" meta remote-node=guest1 ---- [NOTE] ====== This example puts the guest XML under /etc/pacemaker because the permissions and SELinux labeling should not need any changes. If you run into trouble with this or any step, try disabling SELinux with `setenforce 0`. If it works after that, see SELinux documentation for how to troubleshoot, if you wish to reenable SELinux. ====== [NOTE] ====== Pacemaker will automatically monitor pacemaker_remote connections for failure, so it is not necessary to create a recurring monitor on the VirtualDomain resource. ====== Once the *vm-guest1* resource is started you will see *guest1* appear in the `pcs status` output as a node. The final `pcs status` output should look something like this. ---- # pcs status Cluster name: mycluster -Last updated: Fri Oct 9 18:00:45 2015 Last change: Fri Oct 9 17:53:44 2015 by root via crm_resource on example-host Stack: corosync -Current DC: example-host (version 1.1.13-a14efad) - partition with quorum -2 nodes and 2 resources configured +Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 18:00:45 2018 +Last change: Fri Jan 12 17:53:44 2018 by root via crm_resource on example-host + +2 nodes configured +2 resources configured Online: [ example-host ] GuestOnline: [ guest1@example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host -PCSD Status: - example-host: Online - Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- === Starting Resources on KVM Guest === The commands below demonstrate how resources can be executed on both the guest node and the cluster node. Create a few Dummy resources. Dummy resources are real resource agents used just for testing purposes. They actually execute on the host they are assigned to just like an apache server or database would, except their execution just means a file was created. When the resource is stopped, that the file it created is removed. ---- # pcs resource create FAKE1 ocf:pacemaker:Dummy # pcs resource create FAKE2 ocf:pacemaker:Dummy # pcs resource create FAKE3 ocf:pacemaker:Dummy # pcs resource create FAKE4 ocf:pacemaker:Dummy # pcs resource create FAKE5 ocf:pacemaker:Dummy ---- Now check your `pcs status` output. In the resource section, you should see something like the following, where some of the resources started on the cluster node, and some started on the guest node. ---- Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started example-host FAKE4 (ocf::pacemaker:Dummy): Started guest1 FAKE5 (ocf::pacemaker:Dummy): Started example-host ---- The guest node, *guest1*, reacts just like any other node in the cluster. For example, pick out a resource that is running on your cluster node. For my purposes, I am picking FAKE3 from the output above. We can force FAKE3 to run on *guest1* in the exact same way we would any other node. ---- # pcs constraint location FAKE3 prefers guest1 ---- Now, looking at the bottom of the `pcs status` output you'll see FAKE3 is on *guest1*. ---- Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started guest1 FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host ---- === Testing Recovery and Fencing === Pacemaker's policy engine is smart enough to know fencing guest nodes associated with a virtual machine means shutting off/rebooting the virtual machine. No special configuration is necessary to make this happen. If you are interested in testing this functionality out, trying stopping the guest's pacemaker_remote daemon. This would be equivalent of abruptly terminating a cluster node's corosync membership without properly shutting it down. ssh into the guest and run this command. ---- # kill -9 `pidof pacemaker_remoted` ---- Within a few seconds, your `pcs status` output will show a monitor failure, and the *guest1* node will not be shown while it is being recovered. ---- # pcs status Cluster name: mycluster -Last updated: Fri Oct 9 18:08:35 2015 Last change: Fri Oct 9 18:07:00 2015 by root via cibadmin on example-host Stack: corosync -Current DC: example-host (version 1.1.13-a14efad) - partition with quorum -2 nodes and 7 resources configured +Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 18:08:35 2018 +Last change: Fri Jan 12 18:07:00 2018 by root via cibadmin on example-host + +2 nodes configured +7 resources configured Online: [ example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Stopped FAKE2 (ocf::pacemaker:Dummy): Stopped FAKE3 (ocf::pacemaker:Dummy): Stopped FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host Failed Actions: * guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none', - last-rc-change='Fri Oct 9 18:08:29 2015', queued=0ms, exec=0ms - - -PCSD Status: - example-host: Online + last-rc-change='Fri Jan 12 18:08:29 2018', queued=0ms, exec=0ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- [NOTE] ====== A guest node involves two resources: the one you explicitly configured creates the guest, and Pacemaker creates an implicit resource for the pacemaker_remote connection, which will be named the same as the value of the *remote-node* attribute of the explicit resource. When we killed pacemaker_remote, it is the implicit resource that failed, which is why the failed action starts with *guest1* and not *vm-guest1*. ====== Once recovery of the guest is complete, you'll see it automatically get re-integrated into the cluster. The final `pcs status` output should look something like this. ---- Cluster name: mycluster -Last updated: Fri Oct 9 18:18:30 2015 Last change: Fri Oct 9 18:07:00 2015 by root via cibadmin on example-host Stack: corosync -Current DC: example-host (version 1.1.13-a14efad) - partition with quorum -2 nodes and 7 resources configured +Current DC: example-host (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum +Last updated: Fri Jan 12 18:18:30 2018 +Last change: Fri Jan 12 18:07:00 2018 by root via cibadmin on example-host + +2 nodes configured +7 resources configured Online: [ example-host ] GuestOnline: [ guest1@example-host ] Full list of resources: vm-guest1 (ocf::heartbeat:VirtualDomain): Started example-host FAKE1 (ocf::pacemaker:Dummy): Started guest1 FAKE2 (ocf::pacemaker:Dummy): Started guest1 FAKE3 (ocf::pacemaker:Dummy): Started guest1 FAKE4 (ocf::pacemaker:Dummy): Started example-host FAKE5 (ocf::pacemaker:Dummy): Started example-host Failed Actions: * guest1_monitor_30000 on example-host 'unknown error' (1): call=8, status=Error, exitreason='none', - last-rc-change='Fri Oct 9 18:08:29 2015', queued=0ms, exec=0ms - - -PCSD Status: - example-host: Online + last-rc-change='Fri Jan 12 18:08:29 2018', queued=0ms, exec=0ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ---- Normally, once you've investigated and addressed a failed action, you can clear the failure. However Pacemaker does not yet support cleanup for the implicitly created connection resource while the explicit resource is active. If you want to clear the failed action from the status output, stop the guest resource before clearing it. For example: ---- # pcs resource disable vm-guest1 --wait # pcs resource cleanup guest1 # pcs resource enable vm-guest1 ---- === Accessing Cluster Tools from Guest Node === Besides allowing the cluster to manage resources on a guest node, pacemaker_remote has one other trick. The pacemaker_remote daemon allows nearly all the pacemaker tools (`crm_resource`, `crm_mon`, `crm_attribute`, `crm_master`, etc.) to work on guest nodes natively. Try it: Run `crm_mon` on the guest after pacemaker has integrated the guest node into the cluster. These tools just work. This means resource agents such as master/slave resources which need access to tools like `crm_master` work seamlessly on the guest nodes. Higher-level command shells such as `pcs` may have partial support on guest nodes, but it is recommended to run them from a cluster node. diff --git a/doc/Pacemaker_Remote/en-US/Pacemaker_Remote.ent b/doc/Pacemaker_Remote/en-US/Pacemaker_Remote.ent index 48c48c28fc..be2a282131 100644 --- a/doc/Pacemaker_Remote/en-US/Pacemaker_Remote.ent +++ b/doc/Pacemaker_Remote/en-US/Pacemaker_Remote.ent @@ -1,6 +1,6 @@ - + - + diff --git a/doc/Pacemaker_Remote/en-US/Revision_History.xml b/doc/Pacemaker_Remote/en-US/Revision_History.xml index d0ad93af96..b636049110 100644 --- a/doc/Pacemaker_Remote/en-US/Revision_History.xml +++ b/doc/Pacemaker_Remote/en-US/Revision_History.xml @@ -1,55 +1,61 @@ %BOOK_ENTITIES; ]> Revision History 1-0 Tue Mar 19 2013 DavidVosseldavidvossel@gmail.com Import from Pages.app 2-0 Tue May 13 2013 DavidVosseldavidvossel@gmail.com Added Future Features Section 3-0 Fri Oct 18 2013 DavidVosseldavidvossel@gmail.com Added Baremetal remote-node feature documentation 4-0 Tue Aug 25 2015 KenGaillotkgaillot@redhat.com Targeted CentOS 7.1 and Pacemaker 1.1.12+, updated for current terminology and practice 5-0 Tue Dec 8 2015 KenGaillotkgaillot@redhat.com Updated for Pacemaker 1.1.14 6-0 Tue May 3 2016 KenGaillotkgaillot@redhat.com Updated for Pacemaker 1.1.15 7-0 Mon Oct 31 2016 KenGaillotkgaillot@redhat.com Updated for Pacemaker 1.1.16 + + 7-1 + Fri Jan 12 2018 + KenGaillotkgaillot@redhat.com + Update banner for Pacemaker 2.0 and content for CentOS 7.4 with Pacemaker 1.1.16 + diff --git a/doc/shared/en-US/pacemaker-intro.txt b/doc/shared/en-US/pacemaker-intro.txt index c55ff9a108..bfa10f5ee5 100644 --- a/doc/shared/en-US/pacemaker-intro.txt +++ b/doc/shared/en-US/pacemaker-intro.txt @@ -1,158 +1,162 @@ - == What Is 'Pacemaker'? == -Pacemaker is a 'cluster resource manager', that is, a logic responsible -for a life-cycle of deployed software -- indirectly perhaps even whole -systems or their interconnections -- under its control within a set of -computers (a.k.a. 'nodes') and driven by -prescribed rules. +*Pacemaker* is a high-availability 'cluster resource manager' -- software that +runs on a set of hosts (a 'cluster' of 'nodes') in order to minimize downtime of +desired services ('resources'). +footnote:[ +'Cluster' is sometimes used in other contexts to refer to hosts grouped +together for other purposes, such as high-performance computing (HPC), but +Pacemaker is not intended for those purposes. +] + +Pacemaker's key features include: -It achieves maximum availability for your cluster services -(a.k.a. 'resources') by detecting and recovering from node- and -resource-level failures by making use of the messaging and membership -capabilities provided by an underlying cluster infrastructure layer -(currently http://www.corosync.org/[Corosync]), and possibly by -utilizing other parts of the overall cluster stack. + * Detection of and recovery from node- and service-level failures + * Ability to ensure data integrity by fencing faulty nodes + * Support for one or more nodes per cluster + * Support for multiple resource interface standards (anything that can be + scripted can be clustered) + * Support (but no requirement) for shared storage + * Support for practically any redundancy configuration (active/passive, N+1, + etc.) + * Automatically replicated configuration that can be updated from any node + * Ability to specify cluster-wide relationships between services, + such as ordering, colocation and anti-colocation + * Support for advanced service types, such as 'clones' (services that need to + be active on multiple nodes), 'stateful resources' (clones that can run in + one of two modes), and containerized services + * Unified, scriptable cluster management tools -.High Availability Clusters +.Fencing [NOTE] -For *the goal of minimal downtime* a term 'high availability' was coined -and together with its acronym, 'HA', is well-established in the sector. -To differentiate this sort of clusters from high performance computing -('HPC') ones, should a context require it (apparently, not the case in -this document), using 'HA cluster' is an option. +==== +'Fencing', also known as 'STONITH' (an acronym for Shoot The Other Node In The +Head), is the ability to ensure that it is not possible for a node to be +running a service. This is accomplished via 'fence devices' such as +intelligent power switches that cut power to the target, or intelligent +network switches that cut the target's access to the local network. -Pacemaker's key features include: +Pacemaker represents fence devices as a special class of resource. - * Detection and recovery of node and service-level failures - * Storage agnostic, no requirement for shared storage - * Resource agnostic, anything that can be scripted can be clustered - * Supports 'fencing' (also referred to as the 'STONITH' acronym, - <> later on) for ensuring data integrity - * Supports large and small clusters - * Supports both quorate and resource-driven clusters - * Supports practically any redundancy configuration - * Automatically replicated configuration that can be updated - from any node - * Ability to specify cluster-wide service ordering, - colocation and anti-colocation - * Support for advanced service types - ** Clones: for services which need to be active on multiple nodes - ** Multi-state: for services with multiple modes - (e.g. master/slave, primary/secondary) - * Unified, scriptable cluster management tools +A cluster cannot safely recover from certain failure conditions, such as an +unresponsive node, without fencing. +==== -== Pacemaker Architecture == +== Cluster Architecture == -At the highest level, the cluster is made up of three pieces: - - * *Non-cluster-aware components*. These pieces - include the resources themselves; scripts that start, stop and - monitor them; and a local daemon that masks the differences - between the different standards these scripts implement. - Even though interactions of these resources when run as multiple - instances can resemble a distributed system, they still lack - the proper HA mechanisms and/or autonomous cluster-wide governance - as subsumed in the following item. - - * *Resource management*. Pacemaker provides the brain that processes - and reacts to events regarding the cluster. These events include - nodes joining or leaving the cluster; resource events caused by - failures, maintenance and scheduled activities; and other - administrative actions. Pacemaker will compute the ideal state of - the cluster and plot a path to achieve it after any of these - events. This may include moving resources, stopping nodes and even - forcing them offline with remote power switches. - - * *Cluster membership layer:* The Corosync project provides reliable +At a high level, a cluster can viewed as having these parts (which together are +often referred to as the 'cluster stack'): + + * *Resources:* These are the reason for the cluster's being -- the services + that need to be kept highly available. + + * *Resource agents:* These are scripts or operating system components that + start, stop, and monitor resources, given a set of resource parameters. + These provide a uniform interface between Pacemaker and the managed + services. + + * *Fence agents:* These are scripts that execute node fencing actions, + given a target and fence device parameters. + + * *Cluster membership layer:* This component provides reliable messaging, membership, and quorum information about the cluster. + Currently, Pacemaker supports http://www.corosync.org/[Corosync] + as this layer. + + * *Cluster resource manager:* Pacemaker provides the brain that processes + and reacts to events that occur in the cluster. These events may include + nodes joining or leaving the cluster; resource events caused by failures, + maintenance, or scheduled activities; and other administrative actions. + To achieve the desired availability, Pacemaker may start and stop resources + and fence nodes. + + * *Cluster tools:* These provide an interface for users to interact with the + cluster. Various command-line and graphical (GUI) interfaces are available. -Most managed services are not, themselves, cluster-aware. However, -many popular open-source cluster filesystems make use of a common 'distributed -lock manager', which makes direct use of Corosync for its messaging and -membership capabilities (knowing which nodes are up or down) and Pacemaker for -the ability to fence nodes. +Most managed services are not, themselves, cluster-aware. However, many popular +open-source cluster filesystems make use of a common 'Distributed Lock +Manager' (DLM), which makes direct use of Corosync for its messaging and +membership capabilities and Pacemaker for the ability to fence nodes. -.The Pacemaker Stack -image::images/pcmk-stack.png["The Pacemaker stack",width="10cm",height="7.5cm",align="center"] +.Example Cluster Stack +image::images/pcmk-stack.png["Example cluster stack",width="10cm",height="7.5cm",align="center"] -=== Internal Components === +== Pacemaker Architecture == -Pacemaker itself is composed of five key components: +Pacemaker itself is composed of multiple daemons that work together: - * 'Cluster Information Base' ('CIB') - * 'Cluster Resource Management daemon' ('CRMd') - * 'Local Resource Management daemon' ('LRMd') - * 'Policy Engine' ('PEngine' or 'PE') - * Fencing daemon ('STONITHd') + * attrd + * cib + * crmd + * lrmd + * pacemakerd + * pengine + * stonithd .Internal Components -image::images/pcmk-internals.png["Subsystems of a Pacemaker cluster",align="center",scaledwidth="65%"] +image::images/pcmk-internals.png["Pacemaker software components",align="center",scaledwidth="65%"] -The CIB uses XML to represent both the cluster's configuration and -current state of all resources in the cluster. The contents of the CIB -are automatically kept in sync across the entire cluster and are used by -the PEngine to compute the ideal state of the cluster and how it should -be achieved. +The Pacemaker daemon (pacemakerd) is the master process that spawns all the +other daemons, and respawns them if they unexpectedly exit. -This list of instructions is then fed to the 'Designated Controller' -('DC'). Pacemaker centralizes all cluster decision making by electing -one of the CRMd instances to act as a master. Should the elected CRMd -process (or the node it is on) fail, a new one is quickly established. +The 'Cluster Information Base' (CIB) is an +https://en.wikipedia.org/wiki/XML[XML] representation of the cluster's +configuration and the state of all nodes and resources. The CIB daemon (cib) +keeps the CIB synchronized across the cluster, and handles requests to modify it. -The DC carries out the PEngine's instructions in the required order by -passing them to either the Local Resource Management daemon (LRMd) or -CRMd peers on other nodes via the cluster messaging infrastructure -(which in turn passes them on to their LRMd process). +The 'attribute daemon' (attrd) maintains a database of attributes for all +nodes, keeps it synchronized across the cluster, and handles requests to modify +them. These attributes are usually recorded in the CIB. -The peer nodes all report the results of their operations back to the DC -and, based on the expected and actual results, will either execute any -actions that needed to wait for the previous one to complete, or abort -processing and ask the PEngine to recalculate the ideal cluster state -based on the unexpected results. +Given a snapshot of the CIB as input, the 'policy engine' (pengine) determines +what actions are necessary to achieve the desired state of the cluster. -In some cases, it may be necessary to power off nodes in order to -protect shared data or complete resource recovery. For this, Pacemaker -comes with STONITHd. +The 'local resource management daemon' (lrmd) handles requests to execute +resource agents on the local node, and returns the result. -[[s-intro-stonith]] -.STONITH -[NOTE] -*STONITH* is an acronym for 'Shoot-The-Other-Node-In-The-Head', -a recommended practice that misbehaving node is best to be promptly -'fenced' (shut off, cut from shared resources or otherwise immobilized), -and is usually implemented with a remote power switch. - -In Pacemaker, STONITH devices are modeled as resources (and configured -in the CIB) to enable them to be easily monitored for failure, however -STONITHd takes care of understanding the STONITH topology such that its -clients simply request a node be fenced, and it does the rest. - -== Types of Pacemaker Clusters == - -Pacemaker makes no assumptions about your environment. This allows it -to support practically any -http://en.wikipedia.org/wiki/High-availability_cluster#Node_configurations[redundancy -configuration] including 'Active/Active', 'Active/Passive', 'N+1', +The 'STONITH daemon' (stonithd) handles requests to fence nodes. Given a target +node, stonithd decides which cluster node(s) should execute which fencing +device(s), and calls the necessary fencing agents (either directly, or via +requests to stonithd peers on other nodes), and returns the result. + +The 'cluster resource management daemon' ('CRMd') is Pacemaker's coordinator, +maintaining a consistent view of the cluster membership and orchestrating all +the other components. + +Pacemaker centralizes cluster decision-making by electing one of the CRMd +instances as the 'Designated Controller' ('DC'). Should the elected CRMd +process (or the node it is on) fail, a new one is quickly established. +The DC responds to cluster events by taking a current snapshot of the CIB, +feeding it to the policy engine, then asking the lrmd (either directly on the +local node, or via requests to crmd peers on other nodes) and stonithd to +execute any necessary actions. + +== Node Redundancy Designs == + +Pacemaker supports practically any +https://en.wikipedia.org/wiki/High-availability_cluster#Node_configurations[node +redundancy configuration] including 'Active/Active', 'Active/Passive', 'N+1', 'N+M', 'N-to-1' and 'N-to-N'. +Active/passive clusters with two (or more) nodes using Pacemaker and +https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device:[DRBD] are +a cost-effective high-availability solution for many situations. One of the +nodes provides the desired services, and if it fails, the other node takes +over. + .Active/Passive Redundancy image::images/pcmk-active-passive.png["Active/Passive Redundancy",width="10cm",height="7.5cm",align="center"] -Two-node Active/Passive clusters using Pacemaker and 'DRBD' are -a cost-effective solution for many High Availability situations. +Pacemaker also supports multiple nodes in a shared-failover design, +reducing hardware costs by allowing several active/passive clusters to be +combined and share a common backup node. .Shared Failover image::images/pcmk-shared-failover.png["Shared Failover",width="10cm",height="7.5cm",align="center"] -By supporting many nodes, Pacemaker can dramatically reduce hardware -costs by allowing several active/passive clusters to be combined and -share a common backup node. +When shared storage is available, every node can potentially be used for +failover. Pacemaker can even run multiple copies of services to spread out the +workload. .N to N Redundancy image::images/pcmk-active-active.png["N to N Redundancy",width="10cm",height="7.5cm",align="center"] - -When shared storage is available, every node can potentially be used for -failover. Pacemaker can even run multiple copies of services to spread -out the workload. - diff --git a/lib/common/digest.c b/lib/common/digest.c index fb2b8517bf..573116f8a6 100644 --- a/lib/common/digest.c +++ b/lib/common/digest.c @@ -1,243 +1,246 @@ /* * Copyright (C) 2015 Andrew Beekhof * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public * License as published by the Free Software Foundation; either * version 2.1 of the License, or (at your option) any later version. * * This library is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #define BEST_EFFORT_STATUS 0 /*! * \brief Dump XML in a format used with v1 digests * * \param[in] an_xml_node Root of XML to dump * * \return Newly allocated buffer containing dumped XML */ static char * dump_xml_for_digest(xmlNode * an_xml_node) { char *buffer = NULL; int offset = 0, max = 0; /* for compatibility with the old result which is used for v1 digests */ crm_buffer_add_char(&buffer, &offset, &max, ' '); crm_xml_dump(an_xml_node, 0, &buffer, &offset, &max, 0); crm_buffer_add_char(&buffer, &offset, &max, '\n'); return buffer; } /*! * \brief Calculate and return v1 digest of XML tree * * \param[in] input Root of XML to digest * \param[in] sort Whether to sort the XML before calculating digest * \param[in] ignored Not used * * \return Newly allocated string containing digest * \note Example return value: "c048eae664dba840e1d2060f00299e9d" */ static char * calculate_xml_digest_v1(xmlNode * input, gboolean sort, gboolean ignored) { char *digest = NULL; char *buffer = NULL; xmlNode *copy = NULL; if (sort) { crm_trace("Sorting xml..."); copy = sorted_xml(input, NULL, TRUE); crm_trace("Done"); input = copy; } buffer = dump_xml_for_digest(input); CRM_CHECK(buffer != NULL && strlen(buffer) > 0, free_xml(copy); free(buffer); return NULL); digest = crm_md5sum(buffer); crm_log_xml_trace(input, "digest:source"); free(buffer); free_xml(copy); return digest; } /*! * \brief Calculate and return v2 digest of XML tree * * \param[in] source Root of XML to digest * \param[in] do_filter Whether to filter certain XML attributes * * \return Newly allocated string containing digest */ static char * calculate_xml_digest_v2(xmlNode * source, gboolean do_filter) { char *digest = NULL; char *buffer = NULL; int offset, max; static struct qb_log_callsite *digest_cs = NULL; crm_trace("Begin digest %s", do_filter?"filtered":""); if (do_filter && BEST_EFFORT_STATUS) { /* Exclude the status calculation from the digest * * This doesn't mean it won't be sync'd, we just won't be paranoid * about it being an _exact_ copy * * We don't need it to be exact, since we throw it away and regenerate * from our peers whenever a new DC is elected anyway * * Importantly, this reduces the amount of XML to copy+export as * well as the amount of data for MD5 needs to operate on */ } else { crm_xml_dump(source, do_filter ? xml_log_option_filtered : 0, &buffer, &offset, &max, 0); } CRM_ASSERT(buffer != NULL); digest = crm_md5sum(buffer); if (digest_cs == NULL) { digest_cs = qb_log_callsite_get(__func__, __FILE__, "cib-digest", LOG_TRACE, __LINE__, crm_trace_nonlog); } if (digest_cs && digest_cs->targets) { char *trace_file = crm_concat("/tmp/digest", digest, '-'); crm_trace("Saving %s.%s.%s to %s", crm_element_value(source, XML_ATTR_GENERATION_ADMIN), crm_element_value(source, XML_ATTR_GENERATION), crm_element_value(source, XML_ATTR_NUMUPDATES), trace_file); save_xml_to_file(source, "digest input", trace_file); free(trace_file); } free(buffer); crm_trace("End digest"); return digest; } /*! * \brief Calculate and return digest of XML tree, suitable for storing on disk * * \param[in] input Root of XML to digest * * \return Newly allocated string containing digest */ char * calculate_on_disk_digest(xmlNode * input) { /* Always use the v1 format for on-disk digests * a) it's a compatibility nightmare * b) we only use this once at startup, all other * invocations are in a separate child process */ return calculate_xml_digest_v1(input, FALSE, FALSE); } /*! * \brief Calculate and return digest of XML operation * * \param[in] input Root of XML to digest * \param[in] version Not used * * \return Newly allocated string containing digest */ char * calculate_operation_digest(xmlNode *input, const char *version) { /* We still need the sorting for operation digests */ return calculate_xml_digest_v1(input, TRUE, FALSE); } /*! * \brief Calculate and return digest of XML tree * * \param[in] input Root of XML to digest * \param[in] sort Whether to sort XML before calculating digest * \param[in] do_filter Whether to filter certain XML attributes * \param[in] version CRM feature set version (used to select v1/v2 digest) * * \return Newly allocated string containing digest */ char * calculate_xml_versioned_digest(xmlNode * input, gboolean sort, gboolean do_filter, const char *version) { /* + * @COMPAT digests (on-disk or in diffs/patchsets) created <1.1.4; + * removing this affects even full-restart upgrades from old versions + * * The sorting associated with v1 digest creation accounted for 23% of * the CIB's CPU usage on the server. v2 drops this. * * The filtering accounts for an additional 2.5% and we may want to * remove it in future. * * v2 also uses the xmlBuffer contents directly to avoid additional copying */ if (version == NULL || compare_version("3.0.5", version) > 0) { crm_trace("Using v1 digest algorithm for %s", crm_str(version)); return calculate_xml_digest_v1(input, sort, do_filter); } crm_trace("Using v2 digest algorithm for %s", crm_str(version)); return calculate_xml_digest_v2(input, do_filter); } /*! * \internal * \brief Return whether calculated digest of XML tree matches expected digest * * \param[in] input Root of XML to digest * \param[in] expected Expected digest in on-disk format * * \return TRUE if digests match, FALSE otherwise or on error */ gboolean crm_digest_verify(xmlNode *input, const char *expected) { char *calculated = NULL; gboolean passed; if (input != NULL) { calculated = calculate_on_disk_digest(input); if (calculated == NULL) { crm_perror(LOG_ERR, "Could not calculate digest for comparison"); return FALSE; } } passed = safe_str_eq(expected, calculated); if (passed) { crm_trace("Digest comparison passed: %s", calculated); } else { crm_err("Digest comparison failed: expected %s, calculated %s", expected, calculated); } free(calculated); return passed; } diff --git a/mcp/pacemaker.c b/mcp/pacemaker.c index e943d5d36d..723f704a2c 100644 --- a/mcp/pacemaker.c +++ b/mcp/pacemaker.c @@ -1,1121 +1,1121 @@ /* * Copyright (C) 2010 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include gboolean pcmk_quorate = FALSE; gboolean fatal_error = FALSE; GMainLoop *mainloop = NULL; #define PCMK_PROCESS_CHECK_INTERVAL 5 const char *local_name = NULL; uint32_t local_nodeid = 0; crm_trigger_t *shutdown_trigger = NULL; const char *pid_file = "/var/run/pacemaker.pid"; typedef struct pcmk_child_s { int pid; long flag; int start_seq; int respawn_count; gboolean respawn; const char *name; const char *uid; const char *command; gboolean active_before_startup; } pcmk_child_t; /* Index into the array below */ #define pcmk_child_crmd 3 /* *INDENT-OFF* */ static pcmk_child_t pcmk_children[] = { { 0, crm_proc_none, 0, 0, FALSE, "none", NULL, NULL }, { 0, crm_proc_lrmd, 3, 0, TRUE, "lrmd", NULL, CRM_DAEMON_DIR"/lrmd" }, { 0, crm_proc_cib, 1, 0, TRUE, "cib", CRM_DAEMON_USER, CRM_DAEMON_DIR"/cib" }, { 0, crm_proc_crmd, 6, 0, TRUE, "crmd", CRM_DAEMON_USER, CRM_DAEMON_DIR"/crmd" }, { 0, crm_proc_attrd, 4, 0, TRUE, "attrd", CRM_DAEMON_USER, CRM_DAEMON_DIR"/attrd" }, { 0, crm_proc_stonithd, 0, 0, TRUE, "stonithd", NULL, NULL }, { 0, crm_proc_pe, 5, 0, TRUE, "pengine", CRM_DAEMON_USER, CRM_DAEMON_DIR"/pengine" }, { 0, crm_proc_stonith_ng, 2, 0, TRUE, "stonith-ng", NULL, CRM_DAEMON_DIR"/stonithd" }, }; /* *INDENT-ON* */ static gboolean start_child(pcmk_child_t * child); static gboolean check_active_before_startup_processes(gpointer user_data); void update_process_clients(crm_client_t *client); void update_process_peers(void); void enable_crmd_as_root(gboolean enable) { if (enable) { pcmk_children[pcmk_child_crmd].uid = NULL; } else { pcmk_children[pcmk_child_crmd].uid = CRM_DAEMON_USER; } } static uint32_t get_process_list(void) { int lpc = 0; uint32_t procs = crm_get_cluster_proc(); for (lpc = 0; lpc < SIZEOF(pcmk_children); lpc++) { if (pcmk_children[lpc].pid != 0) { procs |= pcmk_children[lpc].flag; } } return procs; } static void pcmk_process_exit(pcmk_child_t * child) { child->pid = 0; child->active_before_startup = FALSE; /* Broadcast the fact that one of our processes died ASAP * * Try to get some logging of the cause out first though * because we're probably about to get fenced * * Potentially do this only if respawn_count > N * to allow for local recovery */ update_node_processes(local_nodeid, NULL, get_process_list()); child->respawn_count += 1; if (child->respawn_count > MAX_RESPAWN) { crm_err("Child respawn count exceeded by %s", child->name); child->respawn = FALSE; } if (shutdown_trigger) { mainloop_set_trigger(shutdown_trigger); update_node_processes(local_nodeid, NULL, get_process_list()); } else if (child->respawn && crm_is_true(getenv("PCMK_fail_fast"))) { crm_err("Rebooting system because of %s", child->name); pcmk_panic(__FUNCTION__); } else if (child->respawn) { crm_notice("Respawning failed child process: %s", child->name); start_child(child); } } static void pcmk_child_exit(mainloop_child_t * p, pid_t pid, int core, int signo, int exitcode) { pcmk_child_t *child = mainloop_child_userdata(p); const char *name = mainloop_child_name(p); if (signo) { do_crm_log(((signo == SIGKILL)? LOG_WARNING : LOG_ERR), "%s[%d] terminated with signal %d (core=%d)", name, pid, signo, core); } else { switch(exitcode) { case CRM_EX_OK: crm_info("%s[%d] exited with status %d (%s)", name, pid, exitcode, crm_exit_str(exitcode)); break; case CRM_EX_FATAL: crm_warn("Shutting cluster down because %s[%d] had fatal failure", name, pid); child->respawn = FALSE; fatal_error = TRUE; pcmk_shutdown(SIGTERM); break; case CRM_EX_PANIC: do_crm_log_always(LOG_EMERG, "%s[%d] instructed the machine to reset", name, pid); child->respawn = FALSE; fatal_error = TRUE; pcmk_panic(__FUNCTION__); pcmk_shutdown(SIGTERM); break; default: crm_err("%s[%d] exited with status %d (%s)", name, pid, exitcode, crm_exit_str(exitcode)); break; } } pcmk_process_exit(child); } static gboolean stop_child(pcmk_child_t * child, int signal) { if (signal == 0) { signal = SIGTERM; } if (child->command == NULL) { crm_debug("Nothing to do for child \"%s\"", child->name); return TRUE; } if (child->pid <= 0) { crm_trace("Client %s not running", child->name); return TRUE; } errno = 0; if (kill(child->pid, signal) == 0) { crm_notice("Stopping %s "CRM_XS" sent signal %d to process %d", child->name, signal, child->pid); } else { crm_perror(LOG_ERR, "Could not stop %s (process %d) with signal %d", child->name, child->pid, signal); } return TRUE; } static char *opts_default[] = { NULL, NULL }; static char *opts_vgrind[] = { NULL, NULL, NULL, NULL, NULL }; static gboolean start_child(pcmk_child_t * child) { int lpc = 0; uid_t uid = 0; gid_t gid = 0; struct rlimit oflimits; gboolean use_valgrind = FALSE; gboolean use_callgrind = FALSE; const char *devnull = "/dev/null"; const char *env_valgrind = getenv("PCMK_valgrind_enabled"); const char *env_callgrind = getenv("PCMK_callgrind_enabled"); child->active_before_startup = FALSE; if (child->command == NULL) { crm_info("Nothing to do for child \"%s\"", child->name); return TRUE; } if (env_callgrind != NULL && crm_is_true(env_callgrind)) { use_callgrind = TRUE; use_valgrind = TRUE; } else if (env_callgrind != NULL && strstr(env_callgrind, child->name)) { use_callgrind = TRUE; use_valgrind = TRUE; } else if (env_valgrind != NULL && crm_is_true(env_valgrind)) { use_valgrind = TRUE; } else if (env_valgrind != NULL && strstr(env_valgrind, child->name)) { use_valgrind = TRUE; } if (use_valgrind && strlen(VALGRIND_BIN) == 0) { crm_warn("Cannot enable valgrind for %s:" " The location of the valgrind binary is unknown", child->name); use_valgrind = FALSE; } if (child->uid) { if (crm_user_lookup(child->uid, &uid, &gid) < 0) { crm_err("Invalid user (%s) for %s: not found", child->uid, child->name); return FALSE; } crm_info("Using uid=%u and group=%u for process %s", uid, gid, child->name); } child->pid = fork(); CRM_ASSERT(child->pid != -1); if (child->pid > 0) { /* parent */ mainloop_child_add(child->pid, 0, child->name, child, pcmk_child_exit); crm_info("Forked child %d for process %s%s", child->pid, child->name, use_valgrind ? " (valgrind enabled: " VALGRIND_BIN ")" : ""); update_node_processes(local_nodeid, NULL, get_process_list()); return TRUE; } else { /* Start a new session */ (void)setsid(); /* Setup the two alternate arg arrays */ opts_vgrind[0] = strdup(VALGRIND_BIN); if (use_callgrind) { opts_vgrind[1] = strdup("--tool=callgrind"); opts_vgrind[2] = strdup("--callgrind-out-file=" CRM_STATE_DIR "/callgrind.out.%p"); opts_vgrind[3] = strdup(child->command); opts_vgrind[4] = NULL; } else { opts_vgrind[1] = strdup(child->command); opts_vgrind[2] = NULL; opts_vgrind[3] = NULL; opts_vgrind[4] = NULL; } opts_default[0] = strdup(child->command); if(gid) { if (is_corosync_cluster()) { /* Drop root privileges completely * * We can do this because we set uidgid.gid.${gid}=1 * via CMAP which allows these processes to connect to * corosync */ if (setgid(gid) < 0) { crm_perror(LOG_ERR, "Could not set group to %d", gid); } // Keep root group, but add haclient group so we can access ipc } else if (initgroups(child->uid, gid) < 0) { crm_err("Cannot initialize groups for %s: %s (%d)", child->uid, pcmk_strerror(errno), errno); } } if (uid && setuid(uid) < 0) { crm_perror(LOG_ERR, "Could not set user to %d (%s)", uid, child->uid); } /* Close all open file descriptors */ getrlimit(RLIMIT_NOFILE, &oflimits); for (lpc = 0; lpc < oflimits.rlim_cur; lpc++) { close(lpc); } (void)open(devnull, O_RDONLY); /* Stdin: fd 0 */ (void)open(devnull, O_WRONLY); /* Stdout: fd 1 */ (void)open(devnull, O_WRONLY); /* Stderr: fd 2 */ if (use_valgrind) { (void)execvp(VALGRIND_BIN, opts_vgrind); } else { (void)execvp(child->command, opts_default); } crm_perror(LOG_ERR, "FATAL: Cannot exec %s", child->command); crm_exit(CRM_EX_FATAL); } return TRUE; /* never reached */ } static gboolean escalate_shutdown(gpointer data) { pcmk_child_t *child = data; if (child->pid) { /* Use SIGSEGV instead of SIGKILL to create a core so we can see what it was up to */ crm_err("Child %s not terminating in a timely manner, forcing", child->name); stop_child(child, SIGSEGV); } return FALSE; } static gboolean pcmk_shutdown_worker(gpointer user_data) { static int phase = 0; static time_t next_log = 0; static int max = SIZEOF(pcmk_children); int lpc = 0; if (phase == 0) { crm_notice("Shutting down Pacemaker"); phase = max; /* Add a second, more frequent, check to speed up shutdown */ g_timeout_add_seconds(5, check_active_before_startup_processes, NULL); } for (; phase > 0; phase--) { /* Don't stop anything with start_seq < 1 */ for (lpc = max - 1; lpc >= 0; lpc--) { pcmk_child_t *child = &(pcmk_children[lpc]); if (phase != child->start_seq) { continue; } if (child->pid) { time_t now = time(NULL); if (child->respawn) { next_log = now + 30; child->respawn = FALSE; stop_child(child, SIGTERM); if (phase < pcmk_children[pcmk_child_crmd].start_seq) { g_timeout_add(180000 /* 3m */ , escalate_shutdown, child); } } else if (now >= next_log) { next_log = now + 30; crm_notice("Still waiting for %s to terminate " CRM_XS " pid=%d seq=%d", child->name, child->pid, child->start_seq); } return TRUE; } /* cleanup */ crm_debug("%s confirmed stopped", child->name); child->pid = 0; } } /* send_cluster_id(); */ crm_notice("Shutdown complete"); { const char *delay = daemon_option("shutdown_delay"); if(delay) { sync(); sleep(crm_get_msec(delay) / 1000); } } g_main_loop_quit(mainloop); if (fatal_error) { crm_notice("Shutting down and staying down after fatal error"); crm_exit(CRM_EX_FATAL); } return TRUE; } static void pcmk_ignore(int nsig) { crm_info("Ignoring signal %s (%d)", strsignal(nsig), nsig); } static void pcmk_sigquit(int nsig) { pcmk_panic(__FUNCTION__); } void pcmk_shutdown(int nsig) { if (shutdown_trigger == NULL) { shutdown_trigger = mainloop_add_trigger(G_PRIORITY_HIGH, pcmk_shutdown_worker, NULL); } mainloop_set_trigger(shutdown_trigger); } static int32_t pcmk_ipc_accept(qb_ipcs_connection_t * c, uid_t uid, gid_t gid) { crm_trace("Connection %p", c); if (crm_client_new(c, uid, gid) == NULL) { return -EIO; } return 0; } static void pcmk_ipc_created(qb_ipcs_connection_t * c) { crm_trace("Connection %p", c); } /* Exit code means? */ static int32_t pcmk_ipc_dispatch(qb_ipcs_connection_t * qbc, void *data, size_t size) { uint32_t id = 0; uint32_t flags = 0; const char *task = NULL; crm_client_t *c = crm_client_get(qbc); xmlNode *msg = crm_ipcs_recv(c, data, size, &id, &flags); crm_ipcs_send_ack(c, id, flags, "ack", __FUNCTION__, __LINE__); if (msg == NULL) { return 0; } task = crm_element_value(msg, F_CRM_TASK); if (crm_str_eq(task, CRM_OP_QUIT, TRUE)) { /* Time to quit */ crm_notice("Shutting down in response to ticket %s (%s)", crm_element_value(msg, F_CRM_REFERENCE), crm_element_value(msg, F_CRM_ORIGIN)); pcmk_shutdown(15); } else if (crm_str_eq(task, CRM_OP_RM_NODE_CACHE, TRUE)) { /* Send to everyone */ struct iovec *iov; int id = 0; const char *name = NULL; crm_element_value_int(msg, XML_ATTR_ID, &id); name = crm_element_value(msg, XML_ATTR_UNAME); crm_notice("Instructing peers to remove references to node %s/%u", name, id); iov = calloc(1, sizeof(struct iovec)); iov->iov_base = dump_xml_unformatted(msg); iov->iov_len = 1 + strlen(iov->iov_base); send_cpg_iov(iov); } else { update_process_clients(c); } free_xml(msg); return 0; } /* Error code means? */ static int32_t pcmk_ipc_closed(qb_ipcs_connection_t * c) { crm_client_t *client = crm_client_get(c); if (client == NULL) { return 0; } crm_trace("Connection %p", c); crm_client_destroy(client); return 0; } static void pcmk_ipc_destroy(qb_ipcs_connection_t * c) { crm_trace("Connection %p", c); pcmk_ipc_closed(c); } struct qb_ipcs_service_handlers mcp_ipc_callbacks = { .connection_accept = pcmk_ipc_accept, .connection_created = pcmk_ipc_created, .msg_process = pcmk_ipc_dispatch, .connection_closed = pcmk_ipc_closed, .connection_destroyed = pcmk_ipc_destroy }; /*! * \internal * \brief Send an XML message with process list of all known peers to client(s) * * \param[in] client Send message to this client, or all clients if NULL */ void update_process_clients(crm_client_t *client) { GHashTableIter iter; crm_node_t *node = NULL; xmlNode *update = create_xml_node(NULL, "nodes"); if (is_corosync_cluster()) { crm_xml_add_int(update, "quorate", pcmk_quorate); } g_hash_table_iter_init(&iter, crm_peer_cache); while (g_hash_table_iter_next(&iter, NULL, (gpointer *) & node)) { xmlNode *xml = create_xml_node(update, "node"); crm_xml_add_int(xml, "id", node->id); crm_xml_add(xml, "uname", node->uname); crm_xml_add(xml, "state", node->state); crm_xml_add_int(xml, "processes", node->processes); } if(client) { crm_trace("Sending process list to client %s", client->id); crm_ipcs_send(client, 0, update, crm_ipc_server_event); } else { crm_trace("Sending process list to %d clients", crm_hash_table_size(client_connections)); g_hash_table_iter_init(&iter, client_connections); while (g_hash_table_iter_next(&iter, NULL, (gpointer *) & client)) { crm_ipcs_send(client, 0, update, crm_ipc_server_event); } } free_xml(update); } /*! * \internal * \brief Send a CPG message with local node's process list to all peers */ void update_process_peers(void) { /* Do nothing for corosync-2 based clusters */ char buffer[1024]; struct iovec *iov; int rc = 0; if (local_name) { rc = snprintf(buffer, SIZEOF(buffer), "", local_name, get_process_list()); } else { rc = snprintf(buffer, SIZEOF(buffer), "", get_process_list()); } crm_trace("Sending %s", buffer); iov = calloc(1, sizeof(struct iovec)); iov->iov_base = strdup(buffer); iov->iov_len = rc + 1; send_cpg_iov(iov); } /*! * \internal * \brief Update a node's process list, notifying clients and peers if needed * * \param[in] id Node ID of affected node * \param[in] uname Uname of affected node * \param[in] procs Affected node's process list mask * * \return TRUE if the process list changed, FALSE otherwise */ gboolean update_node_processes(uint32_t id, const char *uname, uint32_t procs) { gboolean changed = FALSE; crm_node_t *node = crm_get_peer(id, uname); if (procs != 0) { if (procs != node->processes) { crm_debug("Node %s now has process list: %.32x (was %.32x)", node->uname, procs, node->processes); node->processes = procs; changed = TRUE; /* If local node's processes have changed, notify clients/peers */ if (id == local_nodeid) { update_process_clients(NULL); update_process_peers(); } } else { crm_trace("Node %s still has process list: %.32x", node->uname, procs); } } return changed; } /* *INDENT-OFF* */ static struct crm_option long_options[] = { /* Top-level Options */ {"help", 0, 0, '?', "\tThis text"}, {"version", 0, 0, '$', "\tVersion information" }, {"verbose", 0, 0, 'V', "\tIncrease debug output"}, {"shutdown", 0, 0, 'S', "\tInstruct Pacemaker to shutdown on this machine"}, {"features", 0, 0, 'F', "\tDisplay the full version and list of features Pacemaker was built with"}, {"-spacer-", 1, 0, '-', "\nAdditional Options:"}, {"foreground", 0, 0, 'f', "\t(Ignored) Pacemaker always runs in the foreground"}, {"pid-file", 1, 0, 'p', "\t(Ignored) Daemon pid file location"}, {"standby", 0, 0, 's', "\tStart node in standby state"}, {NULL, 0, 0, 0} }; /* *INDENT-ON* */ static void mcp_chown(const char *path, uid_t uid, gid_t gid) { int rc = chown(path, uid, gid); if (rc < 0) { crm_warn("Cannot change the ownership of %s to user %s and gid %d: %s", path, CRM_DAEMON_USER, gid, pcmk_strerror(errno)); } } static gboolean check_active_before_startup_processes(gpointer user_data) { int start_seq = 1, lpc = 0; static int max = SIZEOF(pcmk_children); gboolean keep_tracking = FALSE; for (start_seq = 1; start_seq < max; start_seq++) { for (lpc = 0; lpc < max; lpc++) { if (pcmk_children[lpc].active_before_startup == FALSE) { /* we are already tracking it as a child process. */ continue; } else if (start_seq != pcmk_children[lpc].start_seq) { continue; } else { const char *name = pcmk_children[lpc].name; if (pcmk_children[lpc].flag == crm_proc_stonith_ng) { name = "stonithd"; } if (crm_pid_active(pcmk_children[lpc].pid, name) != 1) { crm_notice("Process %s terminated (pid=%d)", name, pcmk_children[lpc].pid); pcmk_process_exit(&(pcmk_children[lpc])); continue; } } /* at least one of the processes found at startup * is still going, so keep this recurring timer around */ keep_tracking = TRUE; } } return keep_tracking; } static bool find_and_track_existing_processes(void) { DIR *dp; struct dirent *entry; int start_tracker = 0; char entry_name[64]; dp = opendir("/proc"); if (!dp) { /* no proc directory to search through */ crm_notice("Can not read /proc directory to track existing components"); return FALSE; } while ((entry = readdir(dp)) != NULL) { int pid; int max = SIZEOF(pcmk_children); int i; if (crm_procfs_process_info(entry, entry_name, &pid) < 0) { continue; } for (i = 0; i < max; i++) { const char *name = pcmk_children[i].name; if (pcmk_children[i].start_seq == 0) { continue; } if (pcmk_children[i].flag == crm_proc_stonith_ng) { name = "stonithd"; } if (safe_str_eq(entry_name, name) && (crm_pid_active(pid, NULL) == 1)) { crm_notice("Tracking existing %s process (pid=%d)", name, pid); pcmk_children[i].pid = pid; pcmk_children[i].active_before_startup = TRUE; start_tracker = 1; break; } } } if (start_tracker) { g_timeout_add_seconds(PCMK_PROCESS_CHECK_INTERVAL, check_active_before_startup_processes, NULL); } closedir(dp); return start_tracker; } static void init_children_processes(void) { int start_seq = 1, lpc = 0; static int max = SIZEOF(pcmk_children); /* start any children that have not been detected */ for (start_seq = 1; start_seq < max; start_seq++) { /* don't start anything with start_seq < 1 */ for (lpc = 0; lpc < max; lpc++) { if (pcmk_children[lpc].pid) { /* we are already tracking it */ continue; } if (start_seq == pcmk_children[lpc].start_seq) { start_child(&(pcmk_children[lpc])); } } } /* From this point on, any daemons being started will be due to * respawning rather than node start. * * This may be useful for the daemons to know */ setenv("PCMK_respawned", "true", 1); } static void mcp_cpg_destroy(gpointer user_data) { crm_err("Connection destroyed"); crm_exit(CRM_EX_DISCONNECT); } /*! * \internal * \brief Process a CPG message (process list or manual peer cache removal) * * \param[in] handle CPG connection (ignored) * \param[in] groupName CPG group name (ignored) * \param[in] nodeid ID of affected node * \param[in] pid Process ID (ignored) * \param[in] msg CPG XML message * \param[in] msg_len Length of msg in bytes (ignored) */ static void mcp_cpg_deliver(cpg_handle_t handle, const struct cpg_name *groupName, uint32_t nodeid, uint32_t pid, void *msg, size_t msg_len) { xmlNode *xml = string2xml(msg); const char *task = crm_element_value(xml, F_CRM_TASK); crm_trace("Received CPG message (%s): %.200s", (task? task : "process list"), (char*)msg); if (task == NULL) { if (nodeid == local_nodeid) { - crm_info("Ignoring process list sent by peer for local node"); + crm_debug("Ignoring message with local node's process list"); } else { uint32_t procs = 0; const char *uname = crm_element_value(xml, "uname"); crm_element_value_int(xml, "proclist", (int *)&procs); if (update_node_processes(nodeid, uname, procs)) { update_process_clients(NULL); } } } else if (crm_str_eq(task, CRM_OP_RM_NODE_CACHE, TRUE)) { int id = 0; const char *name = NULL; crm_element_value_int(xml, XML_ATTR_ID, &id); name = crm_element_value(xml, XML_ATTR_UNAME); reap_crm_member(id, name); } if (xml != NULL) { free_xml(xml); } } static void mcp_cpg_membership(cpg_handle_t handle, const struct cpg_name *groupName, const struct cpg_address *member_list, size_t member_list_entries, const struct cpg_address *left_list, size_t left_list_entries, const struct cpg_address *joined_list, size_t joined_list_entries) { /* Update peer cache if needed */ pcmk_cpg_membership(handle, groupName, member_list, member_list_entries, left_list, left_list_entries, joined_list, joined_list_entries); /* Always broadcast our own presence after any membership change */ update_process_peers(); } static gboolean mcp_quorum_callback(unsigned long long seq, gboolean quorate) { pcmk_quorate = quorate; return TRUE; } static void mcp_quorum_destroy(gpointer user_data) { crm_info("connection lost"); } int main(int argc, char **argv) { int rc; int flag; int argerr = 0; int option_index = 0; gboolean shutdown = FALSE; uid_t pcmk_uid = 0; gid_t pcmk_gid = 0; struct rlimit cores; crm_ipc_t *old_instance = NULL; qb_ipcs_service_t *ipcs = NULL; const char *facility = daemon_option("logfacility"); static crm_cluster_t cluster; crm_log_preinit(NULL, argc, argv); crm_set_options(NULL, "mode [options]", long_options, "Start/Stop Pacemaker\n"); mainloop_add_signal(SIGHUP, pcmk_ignore); mainloop_add_signal(SIGQUIT, pcmk_sigquit); while (1) { flag = crm_get_option(argc, argv, &option_index); if (flag == -1) break; switch (flag) { case 'V': crm_bump_log_level(argc, argv); break; case 'f': /* Legacy */ break; case 'p': pid_file = optarg; break; case 's': set_daemon_option("node_start_state", "standby"); break; case '$': case '?': crm_help(flag, CRM_EX_OK); break; case 'S': shutdown = TRUE; break; case 'F': printf("Pacemaker %s (Build: %s)\n Supporting v%s: %s\n", PACEMAKER_VERSION, BUILD_VERSION, CRM_FEATURE_SET, CRM_FEATURES); crm_exit(CRM_EX_OK); default: printf("Argument code 0%o (%c) is not (?yet?) supported\n", flag, flag); ++argerr; break; } } if (optind < argc) { printf("non-option ARGV-elements: "); while (optind < argc) printf("%s ", argv[optind++]); printf("\n"); } if (argerr) { crm_help('?', CRM_EX_USAGE); } setenv("LC_ALL", "C", 1); set_daemon_option("mcp", "true"); crm_log_init(NULL, LOG_INFO, TRUE, FALSE, argc, argv, FALSE); /* Restore the original facility so that mcp_read_config() does the right thing */ set_daemon_option("logfacility", facility); crm_debug("Checking for old instances of %s", CRM_SYSTEM_MCP); old_instance = crm_ipc_new(CRM_SYSTEM_MCP, 0); crm_ipc_connect(old_instance); if (shutdown) { crm_debug("Terminating previous instance"); while (crm_ipc_connected(old_instance)) { xmlNode *cmd = create_request(CRM_OP_QUIT, NULL, NULL, CRM_SYSTEM_MCP, CRM_SYSTEM_MCP, NULL); crm_debug("."); crm_ipc_send(old_instance, cmd, 0, 0, NULL); free_xml(cmd); sleep(2); } crm_ipc_close(old_instance); crm_ipc_destroy(old_instance); crm_exit(CRM_EX_OK); } else if (crm_ipc_connected(old_instance)) { crm_ipc_close(old_instance); crm_ipc_destroy(old_instance); crm_err("Pacemaker is already active, aborting startup"); crm_exit(CRM_EX_FATAL); } crm_ipc_close(old_instance); crm_ipc_destroy(old_instance); if (mcp_read_config() == FALSE) { crm_notice("Could not obtain corosync config data, exiting"); crm_exit(CRM_EX_UNAVAILABLE); } crm_notice("Starting Pacemaker %s "CRM_XS" build=%s features:%s", PACEMAKER_VERSION, BUILD_VERSION, CRM_FEATURES); mainloop = g_main_new(FALSE); sysrq_init(); rc = getrlimit(RLIMIT_CORE, &cores); if (rc < 0) { crm_perror(LOG_ERR, "Cannot determine current maximum core size."); } else { if (cores.rlim_max == 0 && geteuid() == 0) { cores.rlim_max = RLIM_INFINITY; } else { crm_info("Maximum core file size is: %lu", (unsigned long)cores.rlim_max); } cores.rlim_cur = cores.rlim_max; rc = setrlimit(RLIMIT_CORE, &cores); if (rc < 0) { crm_perror(LOG_ERR, "Core file generation will remain disabled." " Core files are an important diagnositic tool," " please consider enabling them by default."); } #if 0 /* system() is not thread-safe, can't call from here * Actually, it's a pretty hacky way to try and achieve this anyway */ if (system("echo 1 > /proc/sys/kernel/core_uses_pid") != 0) { crm_perror(LOG_ERR, "Could not enable /proc/sys/kernel/core_uses_pid"); } #endif } if (crm_user_lookup(CRM_DAEMON_USER, &pcmk_uid, &pcmk_gid) < 0) { crm_err("Cluster user %s does not exist, aborting Pacemaker startup", CRM_DAEMON_USER); crm_exit(CRM_EX_NOUSER); } mkdir(CRM_STATE_DIR, 0750); mcp_chown(CRM_STATE_DIR, pcmk_uid, pcmk_gid); /* Used to store core/blackbox/pengine/cib files in */ crm_build_path(CRM_PACEMAKER_DIR, 0750); mcp_chown(CRM_PACEMAKER_DIR, pcmk_uid, pcmk_gid); /* Used to store core files in */ crm_build_path(CRM_CORE_DIR, 0750); mcp_chown(CRM_CORE_DIR, pcmk_uid, pcmk_gid); /* Used to store blackbox dumps in */ crm_build_path(CRM_BLACKBOX_DIR, 0750); mcp_chown(CRM_BLACKBOX_DIR, pcmk_uid, pcmk_gid); /* Used to store policy engine inputs in */ crm_build_path(PE_STATE_DIR, 0750); mcp_chown(PE_STATE_DIR, pcmk_uid, pcmk_gid); /* Used to store the cluster configuration */ crm_build_path(CRM_CONFIG_DIR, 0750); mcp_chown(CRM_CONFIG_DIR, pcmk_uid, pcmk_gid); /* Resource agent paths are constructed by the lrmd */ ipcs = mainloop_add_ipc_server(CRM_SYSTEM_MCP, QB_IPC_NATIVE, &mcp_ipc_callbacks); if (ipcs == NULL) { crm_err("Couldn't start IPC server"); crm_exit(CRM_EX_OSERR); } /* Allows us to block shutdown */ if (cluster_connect_cfg(&local_nodeid) == FALSE) { crm_err("Couldn't connect to Corosync's CFG service"); crm_exit(CRM_EX_PROTOCOL); } if(pcmk_locate_sbd() > 0) { setenv("PCMK_watchdog", "true", 1); } else { setenv("PCMK_watchdog", "false", 1); } find_and_track_existing_processes(); cluster.destroy = mcp_cpg_destroy; cluster.cpg.cpg_deliver_fn = mcp_cpg_deliver; cluster.cpg.cpg_confchg_fn = mcp_cpg_membership; crm_set_autoreap(FALSE); rc = pcmk_ok; if (cluster_connect_cpg(&cluster) == FALSE) { crm_err("Couldn't connect to Corosync's CPG service"); rc = -ENOPROTOOPT; } else if (cluster_connect_quorum(mcp_quorum_callback, mcp_quorum_destroy) == FALSE) { rc = -ENOTCONN; } else { local_name = get_local_node_name(); update_node_processes(local_nodeid, local_name, get_process_list()); mainloop_add_signal(SIGTERM, pcmk_shutdown); mainloop_add_signal(SIGINT, pcmk_shutdown); init_children_processes(); crm_info("Starting mainloop"); g_main_run(mainloop); } if (ipcs) { crm_trace("Closing IPC server"); mainloop_del_ipc_server(ipcs); ipcs = NULL; } g_main_destroy(mainloop); cluster_disconnect_cpg(&cluster); cluster_disconnect_cfg(); crm_info("Exiting %s", crm_system_name); return crm_exit(crm_errno2exit(rc)); } diff --git a/pengine/constraints.c b/pengine/constraints.c index b2afa18c08..0ff88fd5cc 100644 --- a/pengine/constraints.c +++ b/pengine/constraints.c @@ -1,2913 +1,2916 @@ /* * Copyright (C) 2004 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include enum pe_order_kind { pe_order_kind_optional, pe_order_kind_mandatory, pe_order_kind_serialize, }; #define EXPAND_CONSTRAINT_IDREF(__set, __rsc, __name) do { \ __rsc = pe_find_constraint_resource(data_set->resources, __name); \ if(__rsc == NULL) { \ crm_config_err("%s: No resource found for %s", __set, __name); \ return FALSE; \ } \ } while(0) enum pe_ordering get_flags(const char *id, enum pe_order_kind kind, const char *action_first, const char *action_then, gboolean invert); enum pe_ordering get_asymmetrical_flags(enum pe_order_kind kind); static rsc_to_node_t *generate_location_rule(resource_t * rsc, xmlNode * rule_xml, const char *discovery, pe_working_set_t * data_set, pe_match_data_t * match_data); gboolean unpack_constraints(xmlNode * xml_constraints, pe_working_set_t * data_set) { xmlNode *xml_obj = NULL; xmlNode *lifetime = NULL; for (xml_obj = __xml_first_child(xml_constraints); xml_obj != NULL; xml_obj = __xml_next_element(xml_obj)) { const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *tag = crm_element_name(xml_obj); if (id == NULL) { crm_config_err("Constraint <%s...> must have an id", tag); continue; } crm_trace("Processing constraint %s %s", tag, id); lifetime = first_named_child(xml_obj, "lifetime"); if (lifetime) { crm_config_warn("Support for the lifetime tag, used by %s, is deprecated." " The rules it contains should instead be direct descendents of the constraint object", id); } if (test_ruleset(lifetime, NULL, data_set->now) == FALSE) { crm_info("Constraint %s %s is not active", tag, id); } else if (safe_str_eq(XML_CONS_TAG_RSC_ORDER, tag)) { unpack_rsc_order(xml_obj, data_set); } else if (safe_str_eq(XML_CONS_TAG_RSC_DEPEND, tag)) { unpack_rsc_colocation(xml_obj, data_set); } else if (safe_str_eq(XML_CONS_TAG_RSC_LOCATION, tag)) { unpack_location(xml_obj, data_set); } else if (safe_str_eq(XML_CONS_TAG_RSC_TICKET, tag)) { unpack_rsc_ticket(xml_obj, data_set); } else { pe_err("Unsupported constraint type: %s", tag); } } return TRUE; } static const char * invert_action(const char *action) { if (safe_str_eq(action, RSC_START)) { return RSC_STOP; } else if (safe_str_eq(action, RSC_STOP)) { return RSC_START; } else if (safe_str_eq(action, RSC_PROMOTE)) { return RSC_DEMOTE; } else if (safe_str_eq(action, RSC_DEMOTE)) { return RSC_PROMOTE; } else if (safe_str_eq(action, RSC_PROMOTED)) { return RSC_DEMOTED; } else if (safe_str_eq(action, RSC_DEMOTED)) { return RSC_PROMOTED; } else if (safe_str_eq(action, RSC_STARTED)) { return RSC_STOPPED; } else if (safe_str_eq(action, RSC_STOPPED)) { return RSC_STARTED; } crm_config_warn("Unknown action: %s", action); return NULL; } static enum pe_order_kind get_ordering_type(xmlNode * xml_obj) { enum pe_order_kind kind_e = pe_order_kind_mandatory; const char *kind = crm_element_value(xml_obj, XML_ORDER_ATTR_KIND); if (kind == NULL) { const char *score = crm_element_value(xml_obj, XML_RULE_ATTR_SCORE); kind_e = pe_order_kind_mandatory; if (score) { int score_i = char2score(score); if (score_i == 0) { kind_e = pe_order_kind_optional; } /* } else if(rsc_then->variant == pe_native && rsc_first->variant >= pe_clone) { */ /* kind_e = pe_order_kind_optional; */ } } else if (safe_str_eq(kind, "Mandatory")) { kind_e = pe_order_kind_mandatory; } else if (safe_str_eq(kind, "Optional")) { kind_e = pe_order_kind_optional; } else if (safe_str_eq(kind, "Serialize")) { kind_e = pe_order_kind_serialize; } else { const char *id = crm_element_value(xml_obj, XML_ATTR_ID); crm_config_err("Constraint %s: Unknown type '%s'", id, kind); } return kind_e; } static resource_t * pe_find_constraint_resource(GListPtr rsc_list, const char *id) { GListPtr rIter = NULL; for (rIter = rsc_list; id && rIter; rIter = rIter->next) { resource_t *parent = rIter->data; resource_t *match = parent->fns->find_rsc(parent, id, NULL, pe_find_renamed); if (match != NULL) { if(safe_str_neq(match->id, id)) { /* We found an instance of a clone instead */ match = uber_parent(match); crm_debug("Found %s for %s", match->id, id); } return match; } } crm_trace("No match for %s", id); return NULL; } static gboolean pe_find_constraint_tag(pe_working_set_t * data_set, const char * id, tag_t ** tag) { gboolean rc = FALSE; *tag = NULL; rc = g_hash_table_lookup_extended(data_set->template_rsc_sets, id, NULL, (gpointer*) tag); if (rc == FALSE) { rc = g_hash_table_lookup_extended(data_set->tags, id, NULL, (gpointer*) tag); if (rc == FALSE) { crm_config_warn("No template/tag named '%s'", id); return FALSE; } else if (*tag == NULL) { crm_config_warn("No resource is tagged with '%s'", id); return FALSE; } } else if (*tag == NULL) { crm_config_warn("No resource is derived from template '%s'", id); return FALSE; } return rc; } static gboolean valid_resource_or_tag(pe_working_set_t * data_set, const char * id, resource_t ** rsc, tag_t ** tag) { gboolean rc = FALSE; if (rsc) { *rsc = NULL; *rsc = pe_find_constraint_resource(data_set->resources, id); if (*rsc) { return TRUE; } } if (tag) { *tag = NULL; rc = pe_find_constraint_tag(data_set, id, tag); } return rc; } static gboolean unpack_simple_rsc_order(xmlNode * xml_obj, pe_working_set_t * data_set) { int order_id = 0; resource_t *rsc_then = NULL; resource_t *rsc_first = NULL; gboolean invert_bool = TRUE; int min_required_before = 0; enum pe_order_kind kind = pe_order_kind_mandatory; enum pe_ordering cons_weight = pe_order_optional; const char *id_first = NULL; const char *id_then = NULL; const char *action_then = NULL; const char *action_first = NULL; const char *instance_then = NULL; const char *instance_first = NULL; const char *require_all_s = NULL; const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *invert = crm_element_value(xml_obj, XML_CONS_ATTR_SYMMETRICAL); crm_str_to_boolean(invert, &invert_bool); if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } else if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } id_then = crm_element_value(xml_obj, XML_ORDER_ATTR_THEN); id_first = crm_element_value(xml_obj, XML_ORDER_ATTR_FIRST); action_then = crm_element_value(xml_obj, XML_ORDER_ATTR_THEN_ACTION); action_first = crm_element_value(xml_obj, XML_ORDER_ATTR_FIRST_ACTION); instance_then = crm_element_value(xml_obj, XML_ORDER_ATTR_THEN_INSTANCE); instance_first = crm_element_value(xml_obj, XML_ORDER_ATTR_FIRST_INSTANCE); if (action_first == NULL) { action_first = RSC_START; } if (action_then == NULL) { action_then = action_first; } if (id_then == NULL || id_first == NULL) { crm_config_err("Constraint %s needs two sides lh: %s rh: %s", id, crm_str(id_then), crm_str(id_first)); return FALSE; } rsc_then = pe_find_constraint_resource(data_set->resources, id_then); rsc_first = pe_find_constraint_resource(data_set->resources, id_first); if (rsc_then == NULL) { crm_config_err("Constraint %s: no resource found for name '%s'", id, id_then); return FALSE; } else if (rsc_first == NULL) { crm_config_err("Constraint %s: no resource found for name '%s'", id, id_first); return FALSE; } else if (instance_then && pe_rsc_is_clone(rsc_then) == FALSE) { crm_config_err("Invalid constraint '%s':" " Resource '%s' is not a clone but instance %s was requested", id, id_then, instance_then); return FALSE; } else if (instance_first && pe_rsc_is_clone(rsc_first) == FALSE) { crm_config_err("Invalid constraint '%s':" " Resource '%s' is not a clone but instance %s was requested", id, id_first, instance_first); return FALSE; } if (instance_then) { rsc_then = find_clone_instance(rsc_then, instance_then, data_set); if (rsc_then == NULL) { crm_config_warn("Invalid constraint '%s': No instance '%s' of '%s'", id, instance_then, id_then); return FALSE; } } if (instance_first) { rsc_first = find_clone_instance(rsc_first, instance_first, data_set); if (rsc_first == NULL) { crm_config_warn("Invalid constraint '%s': No instance '%s' of '%s'", id, instance_first, id_first); return FALSE; } } require_all_s = crm_element_value(xml_obj, "require-all"); if (require_all_s && crm_is_true(require_all_s) == FALSE && pe_rsc_is_clone(rsc_first)) { /* require-all=false means only one instance of the clone is required */ min_required_before = 1; } else if (pe_rsc_is_clone(rsc_first)) { const char *min_clones_s = g_hash_table_lookup(rsc_first->meta, XML_RSC_ATTR_INCARNATION_MIN); if (min_clones_s) { /* if clone min is set, we require at a minimum X number of instances * to be runnable before allowing dependencies to be runnable. */ min_required_before = crm_parse_int(min_clones_s, "0"); } } cons_weight = pe_order_optional; kind = get_ordering_type(xml_obj); if (kind == pe_order_kind_optional && rsc_then->restart_type == pe_restart_restart) { crm_trace("Upgrade : recovery - implies right"); cons_weight |= pe_order_implies_then; } if (invert_bool == FALSE) { cons_weight |= get_asymmetrical_flags(kind); } else { cons_weight |= get_flags(id, kind, action_first, action_then, FALSE); } /* If there is a minimum number of instances that must be runnable before * the 'then' action is runnable, we use a pseudo action as an intermediate step * start min number of clones -> pseudo action is runnable -> dependency runnable. */ if (min_required_before) { GListPtr rIter = NULL; char *task = crm_concat(CRM_OP_RELAXED_CLONE, id, ':'); action_t *unordered_action = get_pseudo_op(task, data_set); free(task); /* require the pseudo action to have "min_required_before" number of * actions to be considered runnable before allowing the pseudo action * to be runnable. */ unordered_action->required_runnable_before = min_required_before; update_action_flags(unordered_action, pe_action_requires_any, __FUNCTION__, __LINE__); for (rIter = rsc_first->children; id && rIter; rIter = rIter->next) { resource_t *child = rIter->data; /* order each clone instance before the pseudo action */ custom_action_order(child, generate_op_key(child->id, action_first, 0), NULL, NULL, NULL, unordered_action, pe_order_one_or_more | pe_order_implies_then_printed, data_set); } /* order the "then" dependency to occur after the pseudo action only if * the pseudo action is runnable */ order_id = custom_action_order(NULL, NULL, unordered_action, rsc_then, generate_op_key(rsc_then->id, action_then, 0), NULL, cons_weight | pe_order_runnable_left, data_set); } else { order_id = new_rsc_order(rsc_first, action_first, rsc_then, action_then, cons_weight, data_set); } pe_rsc_trace(rsc_first, "order-%d (%s): %s_%s before %s_%s flags=0x%.6x", order_id, id, rsc_first->id, action_first, rsc_then->id, action_then, cons_weight); if (invert_bool == FALSE) { return TRUE; } else if (invert && kind == pe_order_kind_serialize) { crm_config_warn("Cannot invert serialized constraint set %s", id); return TRUE; } else if (kind == pe_order_kind_serialize) { return TRUE; } action_then = invert_action(action_then); action_first = invert_action(action_first); if (action_then == NULL || action_first == NULL) { crm_config_err("Cannot invert rsc_order constraint %s." " Please specify the inverse manually.", id); return TRUE; } cons_weight = pe_order_optional; if (kind == pe_order_kind_optional && rsc_then->restart_type == pe_restart_restart) { crm_trace("Upgrade : recovery - implies left"); cons_weight |= pe_order_implies_first; } cons_weight |= get_flags(id, kind, action_first, action_then, TRUE); order_id = new_rsc_order(rsc_then, action_then, rsc_first, action_first, cons_weight, data_set); pe_rsc_trace(rsc_then, "order-%d (%s): %s_%s before %s_%s flags=0x%.6x", order_id, id, rsc_then->id, action_then, rsc_first->id, action_first, cons_weight); return TRUE; } static gboolean expand_tags_in_sets(xmlNode * xml_obj, xmlNode ** expanded_xml, pe_working_set_t * data_set) { xmlNode *new_xml = NULL; xmlNode *set = NULL; gboolean any_refs = FALSE; const char *cons_id = NULL; *expanded_xml = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } new_xml = copy_xml(xml_obj); cons_id = ID(new_xml); for (set = __xml_first_child(new_xml); set != NULL; set = __xml_next_element(set)) { xmlNode *xml_rsc = NULL; GListPtr tag_refs = NULL; GListPtr gIter = NULL; if (safe_str_neq((const char *)set->name, XML_CONS_TAG_RSC_SET)) { continue; } for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { resource_t *rsc = NULL; tag_t *tag = NULL; const char *id = ID(xml_rsc); if (safe_str_neq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF)) { continue; } if (valid_resource_or_tag(data_set, id, &rsc, &tag) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", cons_id, id); free_xml(new_xml); return FALSE; } else if (rsc) { continue; } else if (tag) { /* The resource_ref under the resource_set references a template/tag */ xmlNode *last_ref = xml_rsc; /* A sample: Original XML: Now we are appending rsc2 and rsc3 which are tagged with tag1 right after it: */ for (gIter = tag->refs; gIter != NULL; gIter = gIter->next) { const char *obj_ref = (const char *) gIter->data; xmlNode *new_rsc_ref = NULL; new_rsc_ref = xmlNewDocRawNode(getDocPtr(set), NULL, (const xmlChar *)XML_TAG_RESOURCE_REF, NULL); crm_xml_add(new_rsc_ref, XML_ATTR_ID, obj_ref); xmlAddNextSibling(last_ref, new_rsc_ref); last_ref = new_rsc_ref; } any_refs = TRUE; /* Do not directly free ''. That would break the further __xml_next_element(xml_rsc)) and cause "Invalid read" seen by valgrind. So just record it into a hash table for freeing it later. */ tag_refs = g_list_append(tag_refs, xml_rsc); } } /* Now free '', and finally get: */ for (gIter = tag_refs; gIter != NULL; gIter = gIter->next) { xmlNode *tag_ref = gIter->data; free_xml(tag_ref); } g_list_free(tag_refs); } if (any_refs) { *expanded_xml = new_xml; } else { free_xml(new_xml); } return TRUE; } static gboolean tag_to_set(xmlNode * xml_obj, xmlNode ** rsc_set, const char * attr, gboolean convert_rsc, pe_working_set_t * data_set) { const char *cons_id = NULL; const char *id = NULL; resource_t *rsc = NULL; tag_t *tag = NULL; *rsc_set = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } if (attr == NULL) { crm_config_err("No attribute name to process."); return FALSE; } cons_id = crm_element_value(xml_obj, XML_ATTR_ID); if (cons_id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } id = crm_element_value(xml_obj, attr); if (id == NULL) { return TRUE; } if (valid_resource_or_tag(data_set, id, &rsc, &tag) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", cons_id, id); return FALSE; } else if (tag) { GListPtr gIter = NULL; /* A template/tag is referenced by the "attr" attribute (first, then, rsc or with-rsc). Add the template/tag's corresponding "resource_set" which contains the resources derived from it or tagged with it under the constraint. */ *rsc_set = create_xml_node(xml_obj, XML_CONS_TAG_RSC_SET); crm_xml_add(*rsc_set, XML_ATTR_ID, id); for (gIter = tag->refs; gIter != NULL; gIter = gIter->next) { const char *obj_ref = (const char *) gIter->data; xmlNode *rsc_ref = NULL; rsc_ref = create_xml_node(*rsc_set, XML_TAG_RESOURCE_REF); crm_xml_add(rsc_ref, XML_ATTR_ID, obj_ref); } /* Set sequential="false" for the resource_set */ crm_xml_add(*rsc_set, "sequential", XML_BOOLEAN_FALSE); } else if (rsc && convert_rsc) { /* Even a regular resource is referenced by "attr", convert it into a resource_set. Because the other side of the constraint could be a template/tag reference. */ xmlNode *rsc_ref = NULL; *rsc_set = create_xml_node(xml_obj, XML_CONS_TAG_RSC_SET); crm_xml_add(*rsc_set, XML_ATTR_ID, id); rsc_ref = create_xml_node(*rsc_set, XML_TAG_RESOURCE_REF); crm_xml_add(rsc_ref, XML_ATTR_ID, id); } else { return TRUE; } /* Remove the "attr" attribute referencing the template/tag */ if (*rsc_set) { xml_remove_prop(xml_obj, attr); } return TRUE; } static gboolean unpack_rsc_location(xmlNode * xml_obj, resource_t * rsc_lh, const char * role, const char * score, pe_working_set_t * data_set, pe_match_data_t * match_data); static gboolean unpack_simple_location(xmlNode * xml_obj, pe_working_set_t * data_set) { const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *value = crm_element_value(xml_obj, XML_LOC_ATTR_SOURCE); if(value) { resource_t *rsc_lh = pe_find_constraint_resource(data_set->resources, value); return unpack_rsc_location(xml_obj, rsc_lh, NULL, NULL, data_set, NULL); } value = crm_element_value(xml_obj, XML_LOC_ATTR_SOURCE_PATTERN); if(value) { regex_t *r_patt = calloc(1, sizeof(regex_t)); bool invert = FALSE; GListPtr rIter = NULL; if(value[0] == '!') { value++; invert = TRUE; } if (regcomp(r_patt, value, REG_EXTENDED)) { crm_config_err("Bad regex '%s' for constraint '%s'", value, id); regfree(r_patt); free(r_patt); return FALSE; } for (rIter = data_set->resources; rIter; rIter = rIter->next) { resource_t *r = rIter->data; int nregs = 0; regmatch_t *pmatch = NULL; int status; if(r_patt->re_nsub > 0) { nregs = r_patt->re_nsub + 1; } else { nregs = 1; } pmatch = calloc(nregs, sizeof(regmatch_t)); status = regexec(r_patt, r->id, nregs, pmatch, 0); if(invert == FALSE && status == 0) { pe_re_match_data_t re_match_data = { .string = r->id, .nregs = nregs, .pmatch = pmatch }; pe_match_data_t match_data = { .re = &re_match_data, .params = r->parameters, .meta = r->meta, }; crm_debug("'%s' matched '%s' for %s", r->id, value, id); unpack_rsc_location(xml_obj, r, NULL, NULL, data_set, &match_data); } if(invert && status != 0) { crm_debug("'%s' is an inverted match of '%s' for %s", r->id, value, id); unpack_rsc_location(xml_obj, r, NULL, NULL, data_set, NULL); } else { crm_trace("'%s' does not match '%s' for %s", r->id, value, id); } free(pmatch); } regfree(r_patt); free(r_patt); } return FALSE; } static gboolean unpack_rsc_location(xmlNode * xml_obj, resource_t * rsc_lh, const char * role, const char * score, pe_working_set_t * data_set, pe_match_data_t * match_data) { gboolean empty = TRUE; rsc_to_node_t *location = NULL; const char *id_lh = crm_element_value(xml_obj, XML_LOC_ATTR_SOURCE); const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *node = crm_element_value(xml_obj, XML_CIB_TAG_NODE); const char *discovery = crm_element_value(xml_obj, XML_LOCATION_ATTR_DISCOVERY); if (rsc_lh == NULL) { /* only a warn as BSC adds the constraint then the resource */ crm_config_warn("No resource (con=%s, rsc=%s)", id, id_lh); return FALSE; } if (score == NULL) { score = crm_element_value(xml_obj, XML_RULE_ATTR_SCORE); } if (node != NULL && score != NULL) { int score_i = char2score(score); node_t *match = pe_find_node(data_set->nodes, node); if (!match) { return FALSE; } location = rsc2node_new(id, rsc_lh, score_i, discovery, match, data_set); } else { xmlNode *rule_xml = NULL; for (rule_xml = __xml_first_child(xml_obj); rule_xml != NULL; rule_xml = __xml_next_element(rule_xml)) { if (crm_str_eq((const char *)rule_xml->name, XML_TAG_RULE, TRUE)) { empty = FALSE; crm_trace("Unpacking %s/%s", id, ID(rule_xml)); generate_location_rule(rsc_lh, rule_xml, discovery, data_set, match_data); } } if (empty) { crm_config_err("Invalid location constraint %s:" " rsc_location must contain at least one rule", ID(xml_obj)); } } if (role == NULL) { role = crm_element_value(xml_obj, XML_RULE_ATTR_ROLE); } if (location && role) { if (text2role(role) == RSC_ROLE_UNKNOWN) { pe_err("Invalid constraint %s: Bad role %s", id, role); return FALSE; } else { enum rsc_role_e r = text2role(role); switch(r) { case RSC_ROLE_UNKNOWN: case RSC_ROLE_STARTED: case RSC_ROLE_SLAVE: /* Applies to all */ location->role_filter = RSC_ROLE_UNKNOWN; break; default: location->role_filter = r; break; } } } return TRUE; } static gboolean unpack_location_tags(xmlNode * xml_obj, xmlNode ** expanded_xml, pe_working_set_t * data_set) { const char *id = NULL; const char *id_lh = NULL; const char *state_lh = NULL; resource_t *rsc_lh = NULL; tag_t *tag_lh = NULL; xmlNode *new_xml = NULL; xmlNode *rsc_set_lh = NULL; *expanded_xml = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } id = crm_element_value(xml_obj, XML_ATTR_ID); if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } /* Attempt to expand any template/tag references in possible resource sets. */ expand_tags_in_sets(xml_obj, &new_xml, data_set); if (new_xml) { /* There are resource sets referencing templates. Return with the expanded XML. */ crm_log_xml_trace(new_xml, "Expanded rsc_location..."); *expanded_xml = new_xml; return TRUE; } id_lh = crm_element_value(xml_obj, XML_LOC_ATTR_SOURCE); if (id_lh == NULL) { return TRUE; } if (valid_resource_or_tag(data_set, id_lh, &rsc_lh, &tag_lh) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_lh); return FALSE; } else if (rsc_lh) { /* No template is referenced. */ return TRUE; } state_lh = crm_element_value(xml_obj, XML_RULE_ATTR_ROLE); new_xml = copy_xml(xml_obj); /* Convert the template/tag reference in "rsc" into a resource_set under the rsc_location constraint. */ if (tag_to_set(new_xml, &rsc_set_lh, XML_LOC_ATTR_SOURCE, FALSE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_lh) { if (state_lh) { /* A "rsc-role" is specified. Move it into the converted resource_set as a "role"" attribute. */ crm_xml_add(rsc_set_lh, "role", state_lh); xml_remove_prop(new_xml, XML_RULE_ATTR_ROLE); } crm_log_xml_trace(new_xml, "Expanded rsc_location..."); *expanded_xml = new_xml; } else { /* No sets */ free_xml(new_xml); } return TRUE; } static gboolean unpack_location_set(xmlNode * location, xmlNode * set, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *resource = NULL; const char *set_id; const char *role; const char *local_score; if (set == NULL) { crm_config_err("No resource_set object to process."); return FALSE; } set_id = ID(set); if (set_id == NULL) { crm_config_err("resource_set must have an id"); return FALSE; } role = crm_element_value(set, "role"); local_score = crm_element_value(set, XML_RULE_ATTR_SCORE); for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(set_id, resource, ID(xml_rsc)); unpack_rsc_location(location, resource, role, local_score, data_set, NULL); } } return TRUE; } gboolean unpack_location(xmlNode * xml_obj, pe_working_set_t * data_set) { xmlNode *set = NULL; gboolean any_sets = FALSE; xmlNode *orig_xml = NULL; xmlNode *expanded_xml = NULL; if (unpack_location_tags(xml_obj, &expanded_xml, data_set) == FALSE) { return FALSE; } if (expanded_xml) { orig_xml = xml_obj; xml_obj = expanded_xml; } for (set = __xml_first_child(xml_obj); set != NULL; set = __xml_next_element(set)) { if (crm_str_eq((const char *)set->name, XML_CONS_TAG_RSC_SET, TRUE)) { any_sets = TRUE; set = expand_idref(set, data_set->input); if (unpack_location_set(xml_obj, set, data_set) == FALSE) { if (expanded_xml) { free_xml(expanded_xml); } return FALSE; } } } if (expanded_xml) { free_xml(expanded_xml); xml_obj = orig_xml; } if (any_sets == FALSE) { return unpack_simple_location(xml_obj, data_set); } return TRUE; } static int get_node_score(const char *rule, const char *score, gboolean raw, node_t * node, resource_t *rsc) { int score_f = 0; if (score == NULL) { pe_err("Rule %s: no score specified. Assuming 0.", rule); } else if (raw) { score_f = char2score(score); } else { const char *attr_score = pe_node_attribute_calculated(node, score, rsc); if (attr_score == NULL) { crm_debug("Rule %s: node %s did not have a value for %s", rule, node->details->uname, score); score_f = -INFINITY; } else { crm_debug("Rule %s: node %s had value %s for %s", rule, node->details->uname, attr_score, score); score_f = char2score(attr_score); } } return score_f; } static rsc_to_node_t * generate_location_rule(resource_t * rsc, xmlNode * rule_xml, const char *discovery, pe_working_set_t * data_set, pe_match_data_t * match_data) { const char *rule_id = NULL; const char *score = NULL; const char *boolean = NULL; const char *role = NULL; GListPtr gIter = NULL; GListPtr match_L = NULL; gboolean do_and = TRUE; gboolean accept = TRUE; gboolean raw_score = TRUE; gboolean score_allocated = FALSE; rsc_to_node_t *location_rule = NULL; rule_xml = expand_idref(rule_xml, data_set->input); rule_id = crm_element_value(rule_xml, XML_ATTR_ID); boolean = crm_element_value(rule_xml, XML_RULE_ATTR_BOOLEAN_OP); role = crm_element_value(rule_xml, XML_RULE_ATTR_ROLE); crm_trace("Processing rule: %s", rule_id); if (role != NULL && text2role(role) == RSC_ROLE_UNKNOWN) { pe_err("Bad role specified for %s: %s", rule_id, role); return NULL; } score = crm_element_value(rule_xml, XML_RULE_ATTR_SCORE); if (score == NULL) { score = crm_element_value(rule_xml, XML_RULE_ATTR_SCORE_ATTRIBUTE); if (score != NULL) { raw_score = FALSE; } } if (safe_str_eq(boolean, "or")) { do_and = FALSE; } location_rule = rsc2node_new(rule_id, rsc, 0, discovery, NULL, data_set); if (location_rule == NULL) { return NULL; } if (match_data && match_data->re && match_data->re->nregs > 0 && match_data->re->pmatch[0].rm_so != -1) { if (raw_score == FALSE) { char *result = pe_expand_re_matches(score, match_data->re); if (result) { score = (const char *) result; score_allocated = TRUE; } } } if (role != NULL) { crm_trace("Setting role filter: %s", role); location_rule->role_filter = text2role(role); if (location_rule->role_filter == RSC_ROLE_SLAVE) { /* Any master/slave cannot be promoted without being a slave first * Ergo, any constraint for the slave role applies to every role */ location_rule->role_filter = RSC_ROLE_UNKNOWN; } } if (do_and) { GListPtr gIter = NULL; match_L = node_list_dup(data_set->nodes, TRUE, FALSE); for (gIter = match_L; gIter != NULL; gIter = gIter->next) { node_t *node = (node_t *) gIter->data; node->weight = get_node_score(rule_id, score, raw_score, node, rsc); } } for (gIter = data_set->nodes; gIter != NULL; gIter = gIter->next) { int score_f = 0; node_t *node = (node_t *) gIter->data; accept = pe_test_rule_full(rule_xml, node->details->attrs, RSC_ROLE_UNKNOWN, data_set->now, match_data); crm_trace("Rule %s %s on %s", ID(rule_xml), accept ? "passed" : "failed", node->details->uname); score_f = get_node_score(rule_id, score, raw_score, node, rsc); /* if(accept && score_f == -INFINITY) { */ /* accept = FALSE; */ /* } */ if (accept) { node_t *local = pe_find_node_id(match_L, node->details->id); if (local == NULL && do_and) { continue; } else if (local == NULL) { local = node_copy(node); match_L = g_list_append(match_L, local); } if (do_and == FALSE) { local->weight = merge_weights(local->weight, score_f); } crm_trace("node %s now has weight %d", node->details->uname, local->weight); } else if (do_and && !accept) { /* remove it */ node_t *delete = pe_find_node_id(match_L, node->details->id); if (delete != NULL) { match_L = g_list_remove(match_L, delete); crm_trace("node %s did not match", node->details->uname); } free(delete); } } if (score_allocated == TRUE) { free((char *)score); } location_rule->node_list_rh = match_L; if (location_rule->node_list_rh == NULL) { crm_trace("No matching nodes for rule %s", rule_id); return NULL; } crm_trace("%s: %d nodes matched", rule_id, g_list_length(location_rule->node_list_rh)); return location_rule; } static gint sort_cons_priority_lh(gconstpointer a, gconstpointer b) { const rsc_colocation_t *rsc_constraint1 = (const rsc_colocation_t *)a; const rsc_colocation_t *rsc_constraint2 = (const rsc_colocation_t *)b; if (a == NULL) { return 1; } if (b == NULL) { return -1; } CRM_ASSERT(rsc_constraint1->rsc_lh != NULL); CRM_ASSERT(rsc_constraint1->rsc_rh != NULL); if (rsc_constraint1->rsc_lh->priority > rsc_constraint2->rsc_lh->priority) { return -1; } if (rsc_constraint1->rsc_lh->priority < rsc_constraint2->rsc_lh->priority) { return 1; } /* Process clones before primitives and groups */ if (rsc_constraint1->rsc_lh->variant > rsc_constraint2->rsc_lh->variant) { return -1; } else if (rsc_constraint1->rsc_lh->variant < rsc_constraint2->rsc_lh->variant) { return 1; } return strcmp(rsc_constraint1->rsc_lh->id, rsc_constraint2->rsc_lh->id); } static gint sort_cons_priority_rh(gconstpointer a, gconstpointer b) { const rsc_colocation_t *rsc_constraint1 = (const rsc_colocation_t *)a; const rsc_colocation_t *rsc_constraint2 = (const rsc_colocation_t *)b; if (a == NULL) { return 1; } if (b == NULL) { return -1; } CRM_ASSERT(rsc_constraint1->rsc_lh != NULL); CRM_ASSERT(rsc_constraint1->rsc_rh != NULL); if (rsc_constraint1->rsc_rh->priority > rsc_constraint2->rsc_rh->priority) { return -1; } if (rsc_constraint1->rsc_rh->priority < rsc_constraint2->rsc_rh->priority) { return 1; } /* Process clones before primitives and groups */ if (rsc_constraint1->rsc_rh->variant > rsc_constraint2->rsc_rh->variant) { return -1; } else if (rsc_constraint1->rsc_rh->variant < rsc_constraint2->rsc_rh->variant) { return 1; } return strcmp(rsc_constraint1->rsc_rh->id, rsc_constraint2->rsc_rh->id); } static void anti_colocation_order(resource_t * first_rsc, int first_role, resource_t * then_rsc, int then_role, pe_working_set_t * data_set) { const char *first_tasks[] = { NULL, NULL }; const char *then_tasks[] = { NULL, NULL }; int first_lpc = 0; int then_lpc = 0; /* Actions to make first_rsc lose first_role */ if (first_role == RSC_ROLE_MASTER) { first_tasks[0] = CRMD_ACTION_DEMOTE; } else { first_tasks[0] = CRMD_ACTION_STOP; if (first_role == RSC_ROLE_SLAVE) { first_tasks[1] = CRMD_ACTION_PROMOTE; } } /* Actions to make then_rsc gain then_role */ if (then_role == RSC_ROLE_MASTER) { then_tasks[0] = CRMD_ACTION_PROMOTE; } else { then_tasks[0] = CRMD_ACTION_START; if (then_role == RSC_ROLE_SLAVE) { then_tasks[1] = CRMD_ACTION_DEMOTE; } } for (first_lpc = 0; first_lpc <= 1 && first_tasks[first_lpc] != NULL; first_lpc++) { for (then_lpc = 0; then_lpc <= 1 && then_tasks[then_lpc] != NULL; then_lpc++) { new_rsc_order(first_rsc, first_tasks[first_lpc], then_rsc, then_tasks[then_lpc], pe_order_anti_colocation, data_set); } } } gboolean rsc_colocation_new(const char *id, const char *node_attr, int score, resource_t * rsc_lh, resource_t * rsc_rh, const char *state_lh, const char *state_rh, pe_working_set_t * data_set) { rsc_colocation_t *new_con = NULL; if (rsc_lh == NULL) { crm_config_err("No resource found for LHS %s", id); return FALSE; } else if (rsc_rh == NULL) { crm_config_err("No resource found for RHS of %s", id); return FALSE; } new_con = calloc(1, sizeof(rsc_colocation_t)); if (new_con == NULL) { return FALSE; } if (state_lh == NULL || safe_str_eq(state_lh, RSC_ROLE_STARTED_S)) { state_lh = RSC_ROLE_UNKNOWN_S; } if (state_rh == NULL || safe_str_eq(state_rh, RSC_ROLE_STARTED_S)) { state_rh = RSC_ROLE_UNKNOWN_S; } new_con->id = id; new_con->rsc_lh = rsc_lh; new_con->rsc_rh = rsc_rh; new_con->score = score; new_con->role_lh = text2role(state_lh); new_con->role_rh = text2role(state_rh); new_con->node_attribute = node_attr; if (node_attr == NULL) { node_attr = CRM_ATTR_UNAME; } pe_rsc_trace(rsc_lh, "%s ==> %s (%s %d)", rsc_lh->id, rsc_rh->id, node_attr, score); rsc_lh->rsc_cons = g_list_insert_sorted(rsc_lh->rsc_cons, new_con, sort_cons_priority_rh); rsc_rh->rsc_cons_lhs = g_list_insert_sorted(rsc_rh->rsc_cons_lhs, new_con, sort_cons_priority_lh); data_set->colocation_constraints = g_list_append(data_set->colocation_constraints, new_con); if (score <= -INFINITY) { anti_colocation_order(rsc_lh, new_con->role_lh, rsc_rh, new_con->role_rh, data_set); anti_colocation_order(rsc_rh, new_con->role_rh, rsc_lh, new_con->role_lh, data_set); } return TRUE; } /* LHS before RHS */ int new_rsc_order(resource_t * lh_rsc, const char *lh_task, resource_t * rh_rsc, const char *rh_task, enum pe_ordering type, pe_working_set_t * data_set) { char *lh_key = NULL; char *rh_key = NULL; CRM_CHECK(lh_rsc != NULL, return -1); CRM_CHECK(lh_task != NULL, return -1); CRM_CHECK(rh_rsc != NULL, return -1); CRM_CHECK(rh_task != NULL, return -1); /* We no longer need to test if these reference stonith resources * now that stonithd has access to them even when they're not "running" * if (validate_order_resources(lh_rsc, lh_task, rh_rsc, rh_task)) { return -1; } */ lh_key = generate_op_key(lh_rsc->id, lh_task, 0); rh_key = generate_op_key(rh_rsc->id, rh_task, 0); return custom_action_order(lh_rsc, lh_key, NULL, rh_rsc, rh_key, NULL, type, data_set); } static char * task_from_action_or_key(action_t *action, const char *key) { char *res = NULL; char *rsc_id = NULL; char *op_type = NULL; int interval = 0; if (action) { res = strdup(action->task); } else if (key) { int rc = 0; rc = parse_op_key(key, &rsc_id, &op_type, &interval); if (rc == TRUE) { res = op_type; op_type = NULL; } free(rsc_id); free(op_type); } return res; } /* when order constraints are made between two resources start and stop actions * those constraints have to be mirrored against the corresponding * migration actions to ensure start/stop ordering is preserved during * a migration */ static void handle_migration_ordering(order_constraint_t *order, pe_working_set_t *data_set) { char *lh_task = NULL; char *rh_task = NULL; gboolean rh_migratable; gboolean lh_migratable; if (order->lh_rsc == NULL || order->rh_rsc == NULL) { return; } else if (order->lh_rsc == order->rh_rsc) { return; /* don't mess with those constraints built between parent * resources and the children */ } else if (is_parent(order->lh_rsc, order->rh_rsc)) { return; } else if (is_parent(order->rh_rsc, order->lh_rsc)) { return; } lh_migratable = is_set(order->lh_rsc->flags, pe_rsc_allow_migrate); rh_migratable = is_set(order->rh_rsc->flags, pe_rsc_allow_migrate); /* one of them has to be migratable for * the migrate ordering logic to be applied */ if (lh_migratable == FALSE && rh_migratable == FALSE) { return; } /* at this point we have two resources which allow migrations that have an * order dependency set between them. If those order dependencies involve * start/stop actions, we need to mirror the corresponding migrate actions * so order will be preserved. */ lh_task = task_from_action_or_key(order->lh_action, order->lh_action_task); rh_task = task_from_action_or_key(order->rh_action, order->rh_action_task); if (lh_task == NULL || rh_task == NULL) { goto cleanup_order; } if (safe_str_eq(lh_task, RSC_START) && safe_str_eq(rh_task, RSC_START)) { int flags = pe_order_optional; if (lh_migratable && rh_migratable) { /* A start then B start * A migrate_from then B migrate_to */ custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_MIGRATED, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATE, 0), NULL, flags, data_set); } if (rh_migratable) { if (lh_migratable) { flags |= pe_order_apply_first_non_migratable; } /* A start then B start * A start then B migrate_to... only if A start is not a part of a migration*/ custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_START, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATE, 0), NULL, flags, data_set); } } else if (rh_migratable == TRUE && safe_str_eq(lh_task, RSC_STOP) && safe_str_eq(rh_task, RSC_STOP)) { int flags = pe_order_optional; if (lh_migratable) { flags |= pe_order_apply_first_non_migratable; } /* rh side is at the bottom of the stack during a stop. If we have a constraint * stop B then stop A, if B is migrating via stop/start, and A is migrating using migration actions, * we need to enforce that A's migrate_to action occurs after B's stop action. */ custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_STOP, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATE, 0), NULL, flags, data_set); /* We need to build the stop constraint against migrate_from as well * to account for partial migrations. */ if (order->rh_rsc->partial_migration_target) { custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_STOP, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATED, 0), NULL, flags, data_set); } } else if (safe_str_eq(lh_task, RSC_PROMOTE) && safe_str_eq(rh_task, RSC_START)) { int flags = pe_order_optional; if (rh_migratable) { /* A promote then B start * A promote then B migrate_to */ custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_PROMOTE, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATE, 0), NULL, flags, data_set); } } else if (safe_str_eq(lh_task, RSC_DEMOTE) && safe_str_eq(rh_task, RSC_STOP)) { int flags = pe_order_optional; if (rh_migratable) { /* A demote then B stop * A demote then B migrate_to */ custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_DEMOTE, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATE, 0), NULL, flags, data_set); /* We need to build the demote constraint against migrate_from as well * to account for partial migrations. */ if (order->rh_rsc->partial_migration_target) { custom_action_order(order->lh_rsc, generate_op_key(order->lh_rsc->id, RSC_DEMOTE, 0), NULL, order->rh_rsc, generate_op_key(order->rh_rsc->id, RSC_MIGRATED, 0), NULL, flags, data_set); } } } cleanup_order: free(lh_task); free(rh_task); } /* LHS before RHS */ int custom_action_order(resource_t * lh_rsc, char *lh_action_task, action_t * lh_action, resource_t * rh_rsc, char *rh_action_task, action_t * rh_action, enum pe_ordering type, pe_working_set_t * data_set) { order_constraint_t *order = NULL; if (lh_rsc == NULL && lh_action) { lh_rsc = lh_action->rsc; } if (rh_rsc == NULL && rh_action) { rh_rsc = rh_action->rsc; } if ((lh_action == NULL && lh_rsc == NULL) || (rh_action == NULL && rh_rsc == NULL)) { crm_config_err("Invalid inputs %p.%p %p.%p", lh_rsc, lh_action, rh_rsc, rh_action); free(lh_action_task); free(rh_action_task); return -1; } order = calloc(1, sizeof(order_constraint_t)); crm_trace("Creating[%d] %s %s %s - %s %s %s", data_set->order_id, lh_rsc?lh_rsc->id:"NA", lh_action_task, lh_action?lh_action->uuid:"NA", rh_rsc?rh_rsc->id:"NA", rh_action_task, rh_action?rh_action->uuid:"NA"); /* CRM_ASSERT(data_set->order_id != 291); */ order->id = data_set->order_id++; order->type = type; order->lh_rsc = lh_rsc; order->rh_rsc = rh_rsc; order->lh_action = lh_action; order->rh_action = rh_action; order->lh_action_task = lh_action_task; order->rh_action_task = rh_action_task; if (order->lh_action_task == NULL && lh_action) { order->lh_action_task = strdup(lh_action->uuid); } if (order->rh_action_task == NULL && rh_action) { order->rh_action_task = strdup(rh_action->uuid); } if (order->lh_rsc == NULL && lh_action) { order->lh_rsc = lh_action->rsc; } if (order->rh_rsc == NULL && rh_action) { order->rh_rsc = rh_action->rsc; } data_set->ordering_constraints = g_list_prepend(data_set->ordering_constraints, order); handle_migration_ordering(order, data_set); return order->id; } enum pe_ordering get_asymmetrical_flags(enum pe_order_kind kind) { enum pe_ordering flags = pe_order_optional; if (kind == pe_order_kind_mandatory) { flags |= pe_order_asymmetrical; } else if (kind == pe_order_kind_serialize) { flags |= pe_order_serialize_only; } return flags; } enum pe_ordering get_flags(const char *id, enum pe_order_kind kind, const char *action_first, const char *action_then, gboolean invert) { enum pe_ordering flags = pe_order_optional; if (invert && kind == pe_order_kind_mandatory) { crm_trace("Upgrade %s: implies left", id); flags |= pe_order_implies_first; } else if (kind == pe_order_kind_mandatory) { crm_trace("Upgrade %s: implies right", id); flags |= pe_order_implies_then; if (safe_str_eq(action_first, RSC_START) || safe_str_eq(action_first, RSC_PROMOTE)) { crm_trace("Upgrade %s: runnable", id); flags |= pe_order_runnable_left; } } else if (kind == pe_order_kind_serialize) { flags |= pe_order_serialize_only; } return flags; } static gboolean unpack_order_set(xmlNode * set, enum pe_order_kind kind, resource_t ** rsc, action_t ** begin, action_t ** end, action_t ** inv_begin, action_t ** inv_end, const char *symmetrical, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; GListPtr set_iter = NULL; GListPtr resources = NULL; resource_t *last = NULL; resource_t *resource = NULL; int local_kind = kind; gboolean sequential = FALSE; enum pe_ordering flags = pe_order_optional; char *key = NULL; const char *id = ID(set); const char *action = crm_element_value(set, "action"); const char *sequential_s = crm_element_value(set, "sequential"); const char *kind_s = crm_element_value(set, XML_ORDER_ATTR_KIND); /* char *pseudo_id = NULL; char *end_id = NULL; char *begin_id = NULL; */ if (action == NULL) { action = RSC_START; } if (kind_s) { local_kind = get_ordering_type(set); } if (sequential_s == NULL) { sequential_s = "1"; } sequential = crm_is_true(sequential_s); if (crm_is_true(symmetrical)) { flags = get_flags(id, local_kind, action, action, FALSE); } else { flags = get_asymmetrical_flags(local_kind); } for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, resource, ID(xml_rsc)); resources = g_list_append(resources, resource); } } if (g_list_length(resources) == 1) { crm_trace("Single set: %s", id); *rsc = resource; *end = NULL; *begin = NULL; *inv_end = NULL; *inv_begin = NULL; goto done; } /* pseudo_id = crm_concat(id, action, '-'); end_id = crm_concat(pseudo_id, "end", '-'); begin_id = crm_concat(pseudo_id, "begin", '-'); */ *rsc = NULL; /* *end = get_pseudo_op(end_id, data_set); *begin = get_pseudo_op(begin_id, data_set); free(pseudo_id); free(begin_id); free(end_id); */ set_iter = resources; while (set_iter != NULL) { resource = (resource_t *) set_iter->data; set_iter = set_iter->next; key = generate_op_key(resource->id, action, 0); /* custom_action_order(NULL, NULL, *begin, resource, strdup(key), NULL, flags|pe_order_implies_first_printed, data_set); custom_action_order(resource, strdup(key), NULL, NULL, NULL, *end, flags|pe_order_implies_then_printed, data_set); */ if (local_kind == pe_order_kind_serialize) { /* Serialize before everything that comes after */ GListPtr gIter = NULL; for (gIter = set_iter; gIter != NULL; gIter = gIter->next) { resource_t *then_rsc = (resource_t *) gIter->data; char *then_key = generate_op_key(then_rsc->id, action, 0); custom_action_order(resource, strdup(key), NULL, then_rsc, then_key, NULL, flags, data_set); } } else if (sequential) { if (last != NULL) { new_rsc_order(last, action, resource, action, flags, data_set); } last = resource; } free(key); } if (crm_is_true(symmetrical) == FALSE) { goto done; } else if (symmetrical && local_kind == pe_order_kind_serialize) { crm_config_warn("Cannot invert serialized constraint set %s", id); goto done; } else if (local_kind == pe_order_kind_serialize) { goto done; } last = NULL; action = invert_action(action); /* pseudo_id = crm_concat(id, action, '-'); end_id = crm_concat(pseudo_id, "end", '-'); begin_id = crm_concat(pseudo_id, "begin", '-'); *inv_end = get_pseudo_op(end_id, data_set); *inv_begin = get_pseudo_op(begin_id, data_set); free(pseudo_id); free(begin_id); free(end_id); */ flags = get_flags(id, local_kind, action, action, TRUE); set_iter = resources; while (set_iter != NULL) { resource = (resource_t *) set_iter->data; set_iter = set_iter->next; /* key = generate_op_key(resource->id, action, 0); custom_action_order(NULL, NULL, *inv_begin, resource, strdup(key), NULL, flags|pe_order_implies_first_printed, data_set); custom_action_order(resource, key, NULL, NULL, NULL, *inv_end, flags|pe_order_implies_then_printed, data_set); */ if (sequential) { if (last != NULL) { new_rsc_order(resource, action, last, action, flags, data_set); } last = resource; } } done: g_list_free(resources); return TRUE; } static gboolean order_rsc_sets(const char *id, xmlNode * set1, xmlNode * set2, enum pe_order_kind kind, pe_working_set_t * data_set, gboolean invert, gboolean symmetrical) { xmlNode *xml_rsc = NULL; xmlNode *xml_rsc_2 = NULL; resource_t *rsc_1 = NULL; resource_t *rsc_2 = NULL; const char *action_1 = crm_element_value(set1, "action"); const char *action_2 = crm_element_value(set2, "action"); const char *sequential_1 = crm_element_value(set1, "sequential"); const char *sequential_2 = crm_element_value(set2, "sequential"); const char *require_all_s = crm_element_value(set1, "require-all"); gboolean require_all = require_all_s ? crm_is_true(require_all_s) : TRUE; enum pe_ordering flags = pe_order_none; if (action_1 == NULL) { action_1 = RSC_START; }; if (action_2 == NULL) { action_2 = RSC_START; }; if (invert) { action_1 = invert_action(action_1); action_2 = invert_action(action_2); } if(safe_str_eq(RSC_STOP, action_1) || safe_str_eq(RSC_DEMOTE, action_1)) { /* Assuming: A -> ( B || C) -> D * The one-or-more logic only applies during the start/promote phase * During shutdown neither B nor can shutdown until D is down, so simply turn require_all back on. */ require_all = TRUE; } if (symmetrical == FALSE) { flags = get_asymmetrical_flags(kind); } else { flags = get_flags(id, kind, action_2, action_1, invert); } /* If we have an un-ordered set1, whether it is sequential or not is irrelevant in regards to set2. */ if (!require_all) { char *task = crm_concat(CRM_OP_RELAXED_SET, ID(set1), ':'); action_t *unordered_action = get_pseudo_op(task, data_set); free(task); update_action_flags(unordered_action, pe_action_requires_any, __FUNCTION__, __LINE__); for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (!crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { continue; } EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); /* Add an ordering constraint between every element in set1 and the pseudo action. * If any action in set1 is runnable the pseudo action will be runnable. */ custom_action_order(rsc_1, generate_op_key(rsc_1->id, action_1, 0), NULL, NULL, NULL, unordered_action, pe_order_one_or_more | pe_order_implies_then_printed, data_set); } for (xml_rsc_2 = __xml_first_child(set2); xml_rsc_2 != NULL; xml_rsc_2 = __xml_next_element(xml_rsc_2)) { if (!crm_str_eq((const char *)xml_rsc_2->name, XML_TAG_RESOURCE_REF, TRUE)) { continue; } EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc_2)); /* Add an ordering constraint between the pseudo action and every element in set2. * If the pseudo action is runnable, every action in set2 will be runnable */ custom_action_order(NULL, NULL, unordered_action, rsc_2, generate_op_key(rsc_2->id, action_2, 0), NULL, flags | pe_order_runnable_left, data_set); } return TRUE; } if (crm_is_true(sequential_1)) { if (invert == FALSE) { /* get the last one */ const char *rid = NULL; for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { rid = ID(xml_rsc); } } EXPAND_CONSTRAINT_IDREF(id, rsc_1, rid); } else { /* get the first one */ for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); break; } } } } if (crm_is_true(sequential_2)) { if (invert == FALSE) { /* get the first one */ for (xml_rsc = __xml_first_child(set2); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc)); break; } } } else { /* get the last one */ const char *rid = NULL; for (xml_rsc = __xml_first_child(set2); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { rid = ID(xml_rsc); } } EXPAND_CONSTRAINT_IDREF(id, rsc_2, rid); } } if (rsc_1 != NULL && rsc_2 != NULL) { new_rsc_order(rsc_1, action_1, rsc_2, action_2, flags, data_set); } else if (rsc_1 != NULL) { for (xml_rsc = __xml_first_child(set2); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc)); new_rsc_order(rsc_1, action_1, rsc_2, action_2, flags, data_set); } } } else if (rsc_2 != NULL) { xmlNode *xml_rsc = NULL; for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); new_rsc_order(rsc_1, action_1, rsc_2, action_2, flags, data_set); } } } else { for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { xmlNode *xml_rsc_2 = NULL; EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); for (xml_rsc_2 = __xml_first_child(set2); xml_rsc_2 != NULL; xml_rsc_2 = __xml_next_element(xml_rsc_2)) { if (crm_str_eq((const char *)xml_rsc_2->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc_2)); new_rsc_order(rsc_1, action_1, rsc_2, action_2, flags, data_set); } } } } } return TRUE; } static gboolean unpack_order_tags(xmlNode * xml_obj, xmlNode ** expanded_xml, pe_working_set_t * data_set) { const char *id = NULL; const char *id_first = NULL; const char *id_then = NULL; const char *action_first = NULL; const char *action_then = NULL; resource_t *rsc_first = NULL; resource_t *rsc_then = NULL; tag_t *tag_first = NULL; tag_t *tag_then = NULL; xmlNode *new_xml = NULL; xmlNode *rsc_set_first = NULL; xmlNode *rsc_set_then = NULL; gboolean any_sets = FALSE; *expanded_xml = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } id = crm_element_value(xml_obj, XML_ATTR_ID); if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } /* Attempt to expand any template/tag references in possible resource sets. */ expand_tags_in_sets(xml_obj, &new_xml, data_set); if (new_xml) { /* There are resource sets referencing templates/tags. Return with the expanded XML. */ crm_log_xml_trace(new_xml, "Expanded rsc_order..."); *expanded_xml = new_xml; return TRUE; } id_first = crm_element_value(xml_obj, XML_ORDER_ATTR_FIRST); id_then = crm_element_value(xml_obj, XML_ORDER_ATTR_THEN); if (id_first == NULL || id_then == NULL) { return TRUE; } if (valid_resource_or_tag(data_set, id_first, &rsc_first, &tag_first) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_first); return FALSE; } if (valid_resource_or_tag(data_set, id_then, &rsc_then, &tag_then) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_then); return FALSE; } if (rsc_first && rsc_then) { /* Neither side references any template/tag. */ return TRUE; } action_first = crm_element_value(xml_obj, XML_ORDER_ATTR_FIRST_ACTION); action_then = crm_element_value(xml_obj, XML_ORDER_ATTR_THEN_ACTION); new_xml = copy_xml(xml_obj); /* Convert the template/tag reference in "first" into a resource_set under the order constraint. */ if (tag_to_set(new_xml, &rsc_set_first, XML_ORDER_ATTR_FIRST, TRUE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_first) { if (action_first) { /* A "first-action" is specified. Move it into the converted resource_set as an "action" attribute. */ crm_xml_add(rsc_set_first, "action", action_first); xml_remove_prop(new_xml, XML_ORDER_ATTR_FIRST_ACTION); } any_sets = TRUE; } /* Convert the template/tag reference in "then" into a resource_set under the order constraint. */ if (tag_to_set(new_xml, &rsc_set_then, XML_ORDER_ATTR_THEN, TRUE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_then) { if (action_then) { /* A "then-action" is specified. Move it into the converted resource_set as an "action" attribute. */ crm_xml_add(rsc_set_then, "action", action_then); xml_remove_prop(new_xml, XML_ORDER_ATTR_THEN_ACTION); } any_sets = TRUE; } if (any_sets) { crm_log_xml_trace(new_xml, "Expanded rsc_order..."); *expanded_xml = new_xml; } else { free_xml(new_xml); } return TRUE; } gboolean unpack_rsc_order(xmlNode * xml_obj, pe_working_set_t * data_set) { gboolean any_sets = FALSE; resource_t *rsc = NULL; /* resource_t *last_rsc = NULL; */ action_t *set_end = NULL; action_t *set_begin = NULL; action_t *set_inv_end = NULL; action_t *set_inv_begin = NULL; xmlNode *set = NULL; xmlNode *last = NULL; xmlNode *orig_xml = NULL; xmlNode *expanded_xml = NULL; /* action_t *last_end = NULL; action_t *last_begin = NULL; action_t *last_inv_end = NULL; action_t *last_inv_begin = NULL; */ const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *invert = crm_element_value(xml_obj, XML_CONS_ATTR_SYMMETRICAL); enum pe_order_kind kind = get_ordering_type(xml_obj); gboolean invert_bool = TRUE; gboolean rc = TRUE; if (invert == NULL) { invert = "true"; } invert_bool = crm_is_true(invert); rc = unpack_order_tags(xml_obj, &expanded_xml, data_set); if (expanded_xml) { orig_xml = xml_obj; xml_obj = expanded_xml; } else if (rc == FALSE) { return FALSE; } for (set = __xml_first_child(xml_obj); set != NULL; set = __xml_next_element(set)) { if (crm_str_eq((const char *)set->name, XML_CONS_TAG_RSC_SET, TRUE)) { any_sets = TRUE; set = expand_idref(set, data_set->input); if (unpack_order_set(set, kind, &rsc, &set_begin, &set_end, &set_inv_begin, &set_inv_end, invert, data_set) == FALSE) { return FALSE; /* Expand orders in order_rsc_sets() instead of via pseudo actions. */ /* } else if(last) { const char *set_action = crm_element_value(set, "action"); const char *last_action = crm_element_value(last, "action"); enum pe_ordering flags = get_flags(id, kind, last_action, set_action, FALSE); if(!set_action) { set_action = RSC_START; } if(!last_action) { last_action = RSC_START; } if(rsc == NULL && last_rsc == NULL) { order_actions(last_end, set_begin, flags); } else { custom_action_order( last_rsc, null_or_opkey(last_rsc, last_action), last_end, rsc, null_or_opkey(rsc, set_action), set_begin, flags, data_set); } if(crm_is_true(invert)) { set_action = invert_action(set_action); last_action = invert_action(last_action); flags = get_flags(id, kind, last_action, set_action, TRUE); if(rsc == NULL && last_rsc == NULL) { order_actions(last_inv_begin, set_inv_end, flags); } else { custom_action_order( last_rsc, null_or_opkey(last_rsc, last_action), last_inv_begin, rsc, null_or_opkey(rsc, set_action), set_inv_end, flags, data_set); } } */ } else if ( /* never called -- Now call it for supporting clones in resource sets */ last) { if (order_rsc_sets(id, last, set, kind, data_set, FALSE, invert_bool) == FALSE) { return FALSE; } if (invert_bool && order_rsc_sets(id, set, last, kind, data_set, TRUE, invert_bool) == FALSE) { return FALSE; } } last = set; /* last_rsc = rsc; last_end = set_end; last_begin = set_begin; last_inv_end = set_inv_end; last_inv_begin = set_inv_begin; */ } } if (expanded_xml) { free_xml(expanded_xml); xml_obj = orig_xml; } if (any_sets == FALSE) { return unpack_simple_rsc_order(xml_obj, data_set); } return TRUE; } static gboolean unpack_colocation_set(xmlNode * set, int score, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *with = NULL; resource_t *resource = NULL; const char *set_id = ID(set); const char *role = crm_element_value(set, "role"); const char *sequential = crm_element_value(set, "sequential"); const char *ordering = crm_element_value(set, "ordering"); int local_score = score; const char *score_s = crm_element_value(set, XML_RULE_ATTR_SCORE); if (score_s) { local_score = char2score(score_s); } if(ordering == NULL) { ordering = "group"; } if (sequential != NULL && crm_is_true(sequential) == FALSE) { return TRUE; } else if (local_score >= 0 && safe_str_eq(ordering, "group")) { for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(set_id, resource, ID(xml_rsc)); if (with != NULL) { pe_rsc_trace(resource, "Colocating %s with %s", resource->id, with->id); rsc_colocation_new(set_id, NULL, local_score, resource, with, role, role, data_set); } with = resource; } } } else if (local_score >= 0) { resource_t *last = NULL; for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(set_id, resource, ID(xml_rsc)); if (last != NULL) { pe_rsc_trace(resource, "Colocating %s with %s", last->id, resource->id); rsc_colocation_new(set_id, NULL, local_score, last, resource, role, role, data_set); } last = resource; } } } else { /* Anti-colocating with every prior resource is * the only way to ensure the intuitive result * (i.e. that no one in the set can run with anyone else in the set) */ for (xml_rsc = __xml_first_child(set); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { xmlNode *xml_rsc_with = NULL; EXPAND_CONSTRAINT_IDREF(set_id, resource, ID(xml_rsc)); for (xml_rsc_with = __xml_first_child(set); xml_rsc_with != NULL; xml_rsc_with = __xml_next_element(xml_rsc_with)) { if (crm_str_eq((const char *)xml_rsc_with->name, XML_TAG_RESOURCE_REF, TRUE)) { if (safe_str_eq(resource->id, ID(xml_rsc_with))) { break; } else if (resource == NULL) { crm_config_err("%s: No resource found for %s", set_id, ID(xml_rsc_with)); return FALSE; } EXPAND_CONSTRAINT_IDREF(set_id, with, ID(xml_rsc_with)); pe_rsc_trace(resource, "Anti-Colocating %s with %s", resource->id, with->id); rsc_colocation_new(set_id, NULL, local_score, resource, with, role, role, data_set); } } } } } return TRUE; } static gboolean colocate_rsc_sets(const char *id, xmlNode * set1, xmlNode * set2, int score, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *rsc_1 = NULL; resource_t *rsc_2 = NULL; const char *role_1 = crm_element_value(set1, "role"); const char *role_2 = crm_element_value(set2, "role"); const char *sequential_1 = crm_element_value(set1, "sequential"); const char *sequential_2 = crm_element_value(set2, "sequential"); if (sequential_1 == NULL || crm_is_true(sequential_1)) { /* get the first one */ for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); break; } } } if (sequential_2 == NULL || crm_is_true(sequential_2)) { /* get the last one */ const char *rid = NULL; for (xml_rsc = __xml_first_child(set2); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { rid = ID(xml_rsc); } } EXPAND_CONSTRAINT_IDREF(id, rsc_2, rid); } if (rsc_1 != NULL && rsc_2 != NULL) { rsc_colocation_new(id, NULL, score, rsc_1, rsc_2, role_1, role_2, data_set); } else if (rsc_1 != NULL) { for (xml_rsc = __xml_first_child(set2); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc)); rsc_colocation_new(id, NULL, score, rsc_1, rsc_2, role_1, role_2, data_set); } } } else if (rsc_2 != NULL) { for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); rsc_colocation_new(id, NULL, score, rsc_1, rsc_2, role_1, role_2, data_set); } } } else { for (xml_rsc = __xml_first_child(set1); xml_rsc != NULL; xml_rsc = __xml_next_element(xml_rsc)) { if (crm_str_eq((const char *)xml_rsc->name, XML_TAG_RESOURCE_REF, TRUE)) { xmlNode *xml_rsc_2 = NULL; EXPAND_CONSTRAINT_IDREF(id, rsc_1, ID(xml_rsc)); for (xml_rsc_2 = __xml_first_child(set2); xml_rsc_2 != NULL; xml_rsc_2 = __xml_next_element(xml_rsc_2)) { if (crm_str_eq((const char *)xml_rsc_2->name, XML_TAG_RESOURCE_REF, TRUE)) { EXPAND_CONSTRAINT_IDREF(id, rsc_2, ID(xml_rsc_2)); rsc_colocation_new(id, NULL, score, rsc_1, rsc_2, role_1, role_2, data_set); } } } } } return TRUE; } static gboolean unpack_simple_colocation(xmlNode * xml_obj, pe_working_set_t * data_set) { int score_i = 0; const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *score = crm_element_value(xml_obj, XML_RULE_ATTR_SCORE); const char *id_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE); const char *id_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET); const char *state_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_ROLE); const char *state_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET_ROLE); - const char *instance_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_INSTANCE); - const char *instance_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET_INSTANCE); const char *attr = crm_element_value(xml_obj, XML_COLOC_ATTR_NODE_ATTR); - const char *symmetrical = crm_element_value(xml_obj, XML_CONS_ATTR_SYMMETRICAL); + // experimental syntax from pacemaker-next (unlikely to be adopted as-is) + const char *instance_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_INSTANCE); + const char *instance_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET_INSTANCE); + resource_t *rsc_lh = pe_find_constraint_resource(data_set->resources, id_lh); resource_t *rsc_rh = pe_find_constraint_resource(data_set->resources, id_rh); if (rsc_lh == NULL) { crm_config_err("Invalid constraint '%s': No resource named '%s'", id, id_lh); return FALSE; } else if (rsc_rh == NULL) { crm_config_err("Invalid constraint '%s': No resource named '%s'", id, id_rh); return FALSE; } else if (instance_lh && pe_rsc_is_clone(rsc_lh) == FALSE) { crm_config_err ("Invalid constraint '%s': Resource '%s' is not a clone but instance %s was requested", id, id_lh, instance_lh); return FALSE; } else if (instance_rh && pe_rsc_is_clone(rsc_rh) == FALSE) { crm_config_err ("Invalid constraint '%s': Resource '%s' is not a clone but instance %s was requested", id, id_rh, instance_rh); return FALSE; } if (instance_lh) { rsc_lh = find_clone_instance(rsc_lh, instance_lh, data_set); if (rsc_lh == NULL) { crm_config_warn("Invalid constraint '%s': No instance '%s' of '%s'", id, instance_lh, id_lh); return FALSE; } } if (instance_rh) { rsc_rh = find_clone_instance(rsc_rh, instance_rh, data_set); if (rsc_rh == NULL) { crm_config_warn("Invalid constraint '%s': No instance '%s' of '%s'", id, instance_rh, id_rh); return FALSE; } } if (crm_is_true(symmetrical)) { crm_config_warn("The %s colocation constraint attribute has been removed." " It didn't do what you think it did anyway.", XML_CONS_ATTR_SYMMETRICAL); } if (score) { score_i = char2score(score); } rsc_colocation_new(id, attr, score_i, rsc_lh, rsc_rh, state_lh, state_rh, data_set); return TRUE; } static gboolean unpack_colocation_tags(xmlNode * xml_obj, xmlNode ** expanded_xml, pe_working_set_t * data_set) { const char *id = NULL; const char *id_lh = NULL; const char *id_rh = NULL; const char *state_lh = NULL; const char *state_rh = NULL; resource_t *rsc_lh = NULL; resource_t *rsc_rh = NULL; tag_t *tag_lh = NULL; tag_t *tag_rh = NULL; xmlNode *new_xml = NULL; xmlNode *rsc_set_lh = NULL; xmlNode *rsc_set_rh = NULL; gboolean any_sets = FALSE; *expanded_xml = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } id = crm_element_value(xml_obj, XML_ATTR_ID); if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } /* Attempt to expand any template/tag references in possible resource sets. */ expand_tags_in_sets(xml_obj, &new_xml, data_set); if (new_xml) { /* There are resource sets referencing templates/tags. Return with the expanded XML. */ crm_log_xml_trace(new_xml, "Expanded rsc_colocation..."); *expanded_xml = new_xml; return TRUE; } id_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE); id_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET); if (id_lh == NULL || id_rh == NULL) { return TRUE; } if (valid_resource_or_tag(data_set, id_lh, &rsc_lh, &tag_lh) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_lh); return FALSE; } if (valid_resource_or_tag(data_set, id_rh, &rsc_rh, &tag_rh) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_rh); return FALSE; } if (rsc_lh && rsc_rh) { /* Neither side references any template/tag. */ return TRUE; } if (tag_lh && tag_rh) { /* A colocation constraint between two templates/tags makes no sense. */ crm_config_err("Either LHS or RHS of %s should be a normal resource instead of a template/tag", id); return FALSE; } state_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_ROLE); state_rh = crm_element_value(xml_obj, XML_COLOC_ATTR_TARGET_ROLE); new_xml = copy_xml(xml_obj); /* Convert the template/tag reference in "rsc" into a resource_set under the colocation constraint. */ if (tag_to_set(new_xml, &rsc_set_lh, XML_COLOC_ATTR_SOURCE, TRUE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_lh) { if (state_lh) { /* A "rsc-role" is specified. Move it into the converted resource_set as a "role"" attribute. */ crm_xml_add(rsc_set_lh, "role", state_lh); xml_remove_prop(new_xml, XML_COLOC_ATTR_SOURCE_ROLE); } any_sets = TRUE; } /* Convert the template/tag reference in "with-rsc" into a resource_set under the colocation constraint. */ if (tag_to_set(new_xml, &rsc_set_rh, XML_COLOC_ATTR_TARGET, TRUE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_rh) { if (state_rh) { /* A "with-rsc-role" is specified. Move it into the converted resource_set as a "role"" attribute. */ crm_xml_add(rsc_set_rh, "role", state_rh); xml_remove_prop(new_xml, XML_COLOC_ATTR_TARGET_ROLE); } any_sets = TRUE; } if (any_sets) { crm_log_xml_trace(new_xml, "Expanded rsc_colocation..."); *expanded_xml = new_xml; } else { free_xml(new_xml); } return TRUE; } gboolean unpack_rsc_colocation(xmlNode * xml_obj, pe_working_set_t * data_set) { int score_i = 0; xmlNode *set = NULL; xmlNode *last = NULL; gboolean any_sets = FALSE; xmlNode *orig_xml = NULL; xmlNode *expanded_xml = NULL; const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *score = crm_element_value(xml_obj, XML_RULE_ATTR_SCORE); gboolean rc = TRUE; if (score) { score_i = char2score(score); } rc = unpack_colocation_tags(xml_obj, &expanded_xml, data_set); if (expanded_xml) { orig_xml = xml_obj; xml_obj = expanded_xml; } else if (rc == FALSE) { return FALSE; } for (set = __xml_first_child(xml_obj); set != NULL; set = __xml_next_element(set)) { if (crm_str_eq((const char *)set->name, XML_CONS_TAG_RSC_SET, TRUE)) { any_sets = TRUE; set = expand_idref(set, data_set->input); if (unpack_colocation_set(set, score_i, data_set) == FALSE) { return FALSE; } else if (last && colocate_rsc_sets(id, last, set, score_i, data_set) == FALSE) { return FALSE; } last = set; } } if (expanded_xml) { free_xml(expanded_xml); xml_obj = orig_xml; } if (any_sets == FALSE) { return unpack_simple_colocation(xml_obj, data_set); } return TRUE; } gboolean rsc_ticket_new(const char *id, resource_t * rsc_lh, ticket_t * ticket, const char *state_lh, const char *loss_policy, pe_working_set_t * data_set) { rsc_ticket_t *new_rsc_ticket = NULL; if (rsc_lh == NULL) { crm_config_err("No resource found for LHS %s", id); return FALSE; } new_rsc_ticket = calloc(1, sizeof(rsc_ticket_t)); if (new_rsc_ticket == NULL) { return FALSE; } if (state_lh == NULL || safe_str_eq(state_lh, RSC_ROLE_STARTED_S)) { state_lh = RSC_ROLE_UNKNOWN_S; } new_rsc_ticket->id = id; new_rsc_ticket->ticket = ticket; new_rsc_ticket->rsc_lh = rsc_lh; new_rsc_ticket->role_lh = text2role(state_lh); if (safe_str_eq(loss_policy, "fence")) { if (is_set(data_set->flags, pe_flag_stonith_enabled)) { new_rsc_ticket->loss_policy = loss_ticket_fence; } else { crm_config_err("Resetting %s loss-policy to 'stop': fencing is not configured", ticket->id); loss_policy = "stop"; } } if (new_rsc_ticket->loss_policy == loss_ticket_fence) { crm_debug("On loss of ticket '%s': Fence the nodes running %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); } else if (safe_str_eq(loss_policy, "freeze")) { crm_debug("On loss of ticket '%s': Freeze %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); new_rsc_ticket->loss_policy = loss_ticket_freeze; } else if (safe_str_eq(loss_policy, "demote")) { crm_debug("On loss of ticket '%s': Demote %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); new_rsc_ticket->loss_policy = loss_ticket_demote; } else if (safe_str_eq(loss_policy, "stop")) { crm_debug("On loss of ticket '%s': Stop %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); new_rsc_ticket->loss_policy = loss_ticket_stop; } else { if (new_rsc_ticket->role_lh == RSC_ROLE_MASTER) { crm_debug("On loss of ticket '%s': Default to demote %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); new_rsc_ticket->loss_policy = loss_ticket_demote; } else { crm_debug("On loss of ticket '%s': Default to stop %s (%s)", new_rsc_ticket->ticket->id, new_rsc_ticket->rsc_lh->id, role2text(new_rsc_ticket->role_lh)); new_rsc_ticket->loss_policy = loss_ticket_stop; } } pe_rsc_trace(rsc_lh, "%s (%s) ==> %s", rsc_lh->id, role2text(new_rsc_ticket->role_lh), ticket->id); rsc_lh->rsc_tickets = g_list_append(rsc_lh->rsc_tickets, new_rsc_ticket); data_set->ticket_constraints = g_list_append(data_set->ticket_constraints, new_rsc_ticket); if (new_rsc_ticket->ticket->granted == FALSE || new_rsc_ticket->ticket->standby) { rsc_ticket_constraint(rsc_lh, new_rsc_ticket, data_set); } return TRUE; } static gboolean unpack_rsc_ticket_set(xmlNode * set, ticket_t * ticket, const char *loss_policy, pe_working_set_t * data_set) { xmlNode *xml_rsc = NULL; resource_t *resource = NULL; const char *set_id = NULL; const char *role = NULL; CRM_CHECK(set != NULL, return FALSE); CRM_CHECK(ticket != NULL, return FALSE); set_id = ID(set); if (set_id == NULL) { crm_config_err("resource_set must have an id"); return FALSE; } role = crm_element_value(set, "role"); for (xml_rsc = first_named_child(set, XML_TAG_RESOURCE_REF); xml_rsc != NULL; xml_rsc = crm_next_same_xml(xml_rsc)) { EXPAND_CONSTRAINT_IDREF(set_id, resource, ID(xml_rsc)); pe_rsc_trace(resource, "Resource '%s' depends on ticket '%s'", resource->id, ticket->id); rsc_ticket_new(set_id, resource, ticket, role, loss_policy, data_set); } return TRUE; } static gboolean unpack_simple_rsc_ticket(xmlNode * xml_obj, pe_working_set_t * data_set) { const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *ticket_str = crm_element_value(xml_obj, XML_TICKET_ATTR_TICKET); const char *loss_policy = crm_element_value(xml_obj, XML_TICKET_ATTR_LOSS_POLICY); ticket_t *ticket = NULL; const char *id_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE); const char *state_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_ROLE); + + // experimental syntax from pacemaker-next (unlikely to be adopted as-is) const char *instance_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_INSTANCE); resource_t *rsc_lh = NULL; if (xml_obj == NULL) { crm_config_err("No rsc_ticket constraint object to process."); return FALSE; } if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } if (ticket_str == NULL) { crm_config_err("Invalid constraint '%s': No ticket specified", id); return FALSE; } else { ticket = g_hash_table_lookup(data_set->tickets, ticket_str); } if (ticket == NULL) { crm_config_err("Invalid constraint '%s': No ticket named '%s'", id, ticket_str); return FALSE; } if (id_lh == NULL) { crm_config_err("Invalid constraint '%s': No resource specified", id); return FALSE; } else { rsc_lh = pe_find_constraint_resource(data_set->resources, id_lh); } if (rsc_lh == NULL) { crm_config_err("Invalid constraint '%s': No resource named '%s'", id, id_lh); return FALSE; } else if (instance_lh && pe_rsc_is_clone(rsc_lh) == FALSE) { crm_config_err ("Invalid constraint '%s': Resource '%s' is not a clone but instance %s was requested", id, id_lh, instance_lh); return FALSE; } if (instance_lh) { rsc_lh = find_clone_instance(rsc_lh, instance_lh, data_set); if (rsc_lh == NULL) { crm_config_warn("Invalid constraint '%s': No instance '%s' of '%s'", id, instance_lh, id_lh); return FALSE; } } rsc_ticket_new(id, rsc_lh, ticket, state_lh, loss_policy, data_set); return TRUE; } static gboolean unpack_rsc_ticket_tags(xmlNode * xml_obj, xmlNode ** expanded_xml, pe_working_set_t * data_set) { const char *id = NULL; const char *id_lh = NULL; const char *state_lh = NULL; resource_t *rsc_lh = NULL; tag_t *tag_lh = NULL; xmlNode *new_xml = NULL; xmlNode *rsc_set_lh = NULL; gboolean any_sets = FALSE; *expanded_xml = NULL; if (xml_obj == NULL) { crm_config_err("No constraint object to process."); return FALSE; } id = crm_element_value(xml_obj, XML_ATTR_ID); if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } /* Attempt to expand any template/tag references in possible resource sets. */ expand_tags_in_sets(xml_obj, &new_xml, data_set); if (new_xml) { /* There are resource sets referencing templates/tags. Return with the expanded XML. */ crm_log_xml_trace(new_xml, "Expanded rsc_ticket..."); *expanded_xml = new_xml; return TRUE; } id_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE); if (id_lh == NULL) { return TRUE; } if (valid_resource_or_tag(data_set, id_lh, &rsc_lh, &tag_lh) == FALSE) { crm_config_err("Constraint '%s': Invalid reference to '%s'", id, id_lh); return FALSE; } else if (rsc_lh) { /* No template/tag is referenced. */ return TRUE; } state_lh = crm_element_value(xml_obj, XML_COLOC_ATTR_SOURCE_ROLE); new_xml = copy_xml(xml_obj); /* Convert the template/tag reference in "rsc" into a resource_set under the rsc_ticket constraint. */ if (tag_to_set(new_xml, &rsc_set_lh, XML_COLOC_ATTR_SOURCE, FALSE, data_set) == FALSE) { free_xml(new_xml); return FALSE; } if (rsc_set_lh) { if (state_lh) { /* A "rsc-role" is specified. Move it into the converted resource_set as a "role"" attribute. */ crm_xml_add(rsc_set_lh, "role", state_lh); xml_remove_prop(new_xml, XML_COLOC_ATTR_SOURCE_ROLE); } any_sets = TRUE; } if (any_sets) { crm_log_xml_trace(new_xml, "Expanded rsc_ticket..."); *expanded_xml = new_xml; } else { free_xml(new_xml); } return TRUE; } gboolean unpack_rsc_ticket(xmlNode * xml_obj, pe_working_set_t * data_set) { xmlNode *set = NULL; gboolean any_sets = FALSE; const char *id = crm_element_value(xml_obj, XML_ATTR_ID); const char *ticket_str = crm_element_value(xml_obj, XML_TICKET_ATTR_TICKET); const char *loss_policy = crm_element_value(xml_obj, XML_TICKET_ATTR_LOSS_POLICY); ticket_t *ticket = NULL; xmlNode *orig_xml = NULL; xmlNode *expanded_xml = NULL; gboolean rc = TRUE; if (xml_obj == NULL) { crm_config_err("No rsc_ticket constraint object to process."); return FALSE; } if (id == NULL) { crm_config_err("%s constraint must have an id", crm_element_name(xml_obj)); return FALSE; } if (data_set->tickets == NULL) { data_set->tickets = g_hash_table_new_full(crm_str_hash, g_str_equal, g_hash_destroy_str, destroy_ticket); } if (ticket_str == NULL) { crm_config_err("Invalid constraint '%s': No ticket specified", id); return FALSE; } else { ticket = g_hash_table_lookup(data_set->tickets, ticket_str); } if (ticket == NULL) { ticket = ticket_new(ticket_str, data_set); if (ticket == NULL) { return FALSE; } } rc = unpack_rsc_ticket_tags(xml_obj, &expanded_xml, data_set); if (expanded_xml) { orig_xml = xml_obj; xml_obj = expanded_xml; } else if (rc == FALSE) { return FALSE; } for (set = __xml_first_child(xml_obj); set != NULL; set = __xml_next_element(set)) { if (crm_str_eq((const char *)set->name, XML_CONS_TAG_RSC_SET, TRUE)) { any_sets = TRUE; set = expand_idref(set, data_set->input); if (unpack_rsc_ticket_set(set, ticket, loss_policy, data_set) == FALSE) { return FALSE; } } } if (expanded_xml) { free_xml(expanded_xml); xml_obj = orig_xml; } if (any_sets == FALSE) { return unpack_simple_rsc_ticket(xml_obj, data_set); } return TRUE; } gboolean is_active(rsc_to_node_t * cons) { return TRUE; } diff --git a/pengine/test10/clone-colocate-instance-1.xml b/pengine/test10/clone-colocate-instance-1.xml index 590c306443..ec9f45d891 100644 --- a/pengine/test10/clone-colocate-instance-1.xml +++ b/pengine/test10/clone-colocate-instance-1.xml @@ -1,45 +1,42 @@ - - - + diff --git a/tools/cib_shadow.c b/tools/cib_shadow.c index 99c2afa088..fd2cfdc96a 100644 --- a/tools/cib_shadow.c +++ b/tools/cib_shadow.c @@ -1,458 +1,458 @@ /* * Copyright (C) 2004 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include static int command_options = cib_sync_call; static cib_t *real_cib = NULL; static int force_flag = 0; static int batch_flag = 0; static char * get_shadow_prompt(const char *name) { return crm_strdup_printf("shadow[%.40s] # ", name); } static void shadow_setup(char *name, gboolean do_switch) { const char *prompt = getenv("PS1"); const char *shell = getenv("SHELL"); char *new_prompt = get_shadow_prompt(name); printf("Setting up shadow instance\n"); if (safe_str_eq(new_prompt, prompt)) { /* nothing to do */ goto done; } else if (batch_flag == FALSE && shell != NULL) { setenv("PS1", new_prompt, 1); setenv("CIB_shadow", name, 1); printf("Type Ctrl-D to exit the crm_shadow shell\n"); if (strstr(shell, "bash")) { execl(shell, shell, "--norc", "--noprofile", NULL); } else { execl(shell, shell, NULL); } } else if (do_switch) { printf("To switch to the named shadow instance, paste the following into your shell:\n"); } else { printf ("A new shadow instance was created. To begin using it paste the following into your shell:\n"); } printf(" CIB_shadow=%s ; export CIB_shadow\n", name); done: free(new_prompt); } static void shadow_teardown(char *name) { const char *prompt = getenv("PS1"); char *our_prompt = get_shadow_prompt(name); if (prompt != NULL && strstr(prompt, our_prompt)) { printf("Now type Ctrl-D to exit the crm_shadow shell\n"); } else { printf ("Please remember to unset the CIB_shadow variable by pasting the following into your shell:\n"); printf(" unset CIB_shadow\n"); } free(our_prompt); } /* *INDENT-OFF* */ static struct crm_option long_options[] = { /* Top-level Options */ {"help", 0, 0, '?', "\t\tThis text"}, {"version", 0, 0, '$', "\t\tVersion information" }, {"verbose", 0, 0, 'V', "\t\tIncrease debug output"}, {"-spacer-", 1, 0, '-', "\nQueries:"}, {"which", no_argument, NULL, 'w', "\t\tIndicate the active shadow copy"}, {"display", no_argument, NULL, 'p', "\t\tDisplay the contents of the active shadow copy"}, {"edit", no_argument, NULL, 'E', "\t\tEdit the contents of the active shadow copy with your favorite $EDITOR"}, {"diff", no_argument, NULL, 'd', "\t\tDisplay the changes in the active shadow copy\n"}, {"file", no_argument, NULL, 'F', "\t\tDisplay the location of the active shadow copy file\n"}, {"-spacer-", 1, 0, '-', "\nCommands:"}, {"create", required_argument, NULL, 'c', "\tCreate the named shadow copy of the active cluster configuration"}, {"create-empty", required_argument, NULL, 'e', "Create the named shadow copy with an empty cluster configuration. Optional: --validate-with"}, {"commit", required_argument, NULL, 'C', "\tUpload the contents of the named shadow copy to the cluster"}, {"delete", required_argument, NULL, 'D', "\tDelete the contents of the named shadow copy"}, {"reset", required_argument, NULL, 'r', "\tRecreate the named shadow copy from the active cluster configuration"}, {"switch", required_argument, NULL, 's', "\t(Advanced) Switch to the named shadow copy"}, {"-spacer-", 1, 0, '-', "\nAdditional Options:"}, {"force", no_argument, NULL, 'f', "\t\t(Advanced) Force the action to be performed"}, {"batch", no_argument, NULL, 'b', "\t\t(Advanced) Don't spawn a new shell" }, {"all", no_argument, NULL, 'a', "\t\t(Advanced) Upload the entire CIB, including status, with --commit" }, {"validate-with", required_argument, NULL, 'v', "(Advanced) Create an older configuration version" }, {"-spacer-", 1, 0, '-', "\nExamples:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', "Create a blank shadow configuration:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_shadow --create-empty myShadow", pcmk_option_example}, {"-spacer-", 1, 0, '-', "Create a shadow configuration from the running cluster:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_shadow --create myShadow", pcmk_option_example}, {"-spacer-", 1, 0, '-', "Display the current shadow configuration:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_shadow --display", pcmk_option_example}, {"-spacer-", 1, 0, '-', "Discard the current shadow configuration (named myShadow):", pcmk_option_paragraph}, - {"-spacer-", 1, 0, '-', " crm_shadow --delete myShadow", pcmk_option_example}, + {"-spacer-", 1, 0, '-', " crm_shadow --delete myShadow --force", pcmk_option_example}, {"-spacer-", 1, 0, '-', "Upload the current shadow configuration (named myShadow) to the running cluster:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_shadow --commit myShadow", pcmk_option_example}, {0, 0, 0, 0} }; /* *INDENT-ON* */ int main(int argc, char **argv) { int rc = pcmk_ok; int flag; int argerr = 0; crm_exit_t exit_code = CRM_EX_OK; static int command = '?'; const char *validation = NULL; char *shadow = NULL; char *shadow_file = NULL; gboolean full_upload = FALSE; gboolean dangerous_cmd = FALSE; struct stat buf; int option_index = 0; crm_log_cli_init("crm_shadow"); crm_set_options(NULL, "(query|command) [modifiers]", long_options, "Perform configuration changes in a sandbox before updating the live cluster." "\n\nSets up an environment in which configuration tools (cibadmin, crm_resource, etc) work" " offline instead of against a live cluster, allowing changes to be previewed and tested" " for side-effects.\n"); if (argc < 2) { crm_help('?', CRM_EX_USAGE); } while (1) { flag = crm_get_option(argc, argv, &option_index); if (flag == -1 || flag == 0) break; switch (flag) { case 'a': full_upload = TRUE; break; case 'd': case 'E': case 'p': case 'w': case 'F': command = flag; free(shadow); shadow = NULL; { const char *env = getenv("CIB_shadow"); if(env) { shadow = strdup(env); } else { fprintf(stderr, "No active shadow configuration defined\n"); crm_exit(CRM_EX_NOSUCH); } } break; case 'v': validation = optarg; break; case 'e': case 'c': case 's': case 'r': command = flag; free(shadow); shadow = strdup(optarg); break; case 'C': case 'D': command = flag; dangerous_cmd = TRUE; free(shadow); shadow = strdup(optarg); break; case 'V': command_options = command_options | cib_verbose; crm_bump_log_level(argc, argv); break; case '$': case '?': crm_help(flag, CRM_EX_OK); break; case 'f': command_options |= cib_quorum_override; force_flag = 1; break; case 'b': batch_flag = 1; break; default: printf("Argument code 0%o (%c)" " is not (?yet?) supported\n", flag, flag); ++argerr; break; } } if (optind < argc) { printf("non-option ARGV-elements: "); while (optind < argc) printf("%s ", argv[optind++]); printf("\n"); crm_help('?', CRM_EX_USAGE); } if (optind > argc) { ++argerr; } if (argerr) { crm_help('?', CRM_EX_USAGE); } if (command == 'w') { /* which shadow instance is active? */ const char *local = getenv("CIB_shadow"); if (local == NULL) { fprintf(stderr, "No shadow instance provided\n"); exit_code = CRM_EX_NOSUCH; } else { fprintf(stdout, "%s\n", local); } goto done; } if (shadow == NULL) { fprintf(stderr, "No shadow instance provided\n"); fflush(stderr); exit_code = CRM_EX_NOSUCH; goto done; } else if (command != 's' && command != 'c') { const char *local = getenv("CIB_shadow"); if (local != NULL && safe_str_neq(local, shadow) && force_flag == FALSE) { fprintf(stderr, "The supplied shadow instance (%s) is not the same as the active one (%s).\n" " To prevent accidental destruction of the cluster," " the --force flag is required in order to proceed.\n", shadow, local); fflush(stderr); exit_code = CRM_EX_USAGE; goto done; } } if (dangerous_cmd && force_flag == FALSE) { fprintf(stderr, "The supplied command is considered dangerous." " To prevent accidental destruction of the cluster," " the --force flag is required in order to proceed.\n"); fflush(stderr); exit_code = CRM_EX_USAGE; goto done; } shadow_file = get_shadow_file(shadow); if (command == 'D') { /* delete the file */ if ((unlink(shadow_file) < 0) && (errno != ENOENT)) { exit_code = crm_errno2exit(errno); fprintf(stderr, "Could not remove shadow instance '%s': %s\n", shadow, strerror(errno)); } shadow_teardown(shadow); goto done; } else if (command == 'F') { printf("%s\n", shadow_file); goto done; } if (command == 'd' || command == 'r' || command == 'c' || command == 'C') { real_cib = cib_new_no_shadow(); rc = real_cib->cmds->signon(real_cib, crm_system_name, cib_command); if (rc != pcmk_ok) { fprintf(stderr, "Signon to CIB failed: %s\n", pcmk_strerror(rc)); exit_code = crm_errno2exit(rc); goto done; } } // File existence check rc = stat(shadow_file, &buf); if (command == 'e' || command == 'c') { if (rc == 0 && force_flag == FALSE) { fprintf(stderr, "A shadow instance '%s' already exists.\n" " To prevent accidental destruction of the cluster," " the --force flag is required in order to proceed.\n", shadow); exit_code = CRM_EX_CANTCREAT; goto done; } } else if (rc < 0) { fprintf(stderr, "Could not access shadow instance '%s': %s\n", shadow, strerror(errno)); exit_code = CRM_EX_NOSUCH; goto done; } if (command == 'c' || command == 'e' || command == 'r') { xmlNode *output = NULL; /* create a shadow instance based on the current cluster config */ if (command == 'c' || command == 'r') { rc = real_cib->cmds->query(real_cib, NULL, &output, command_options); if (rc != pcmk_ok) { fprintf(stderr, "Could not connect to the CIB: %s\n", pcmk_strerror(rc)); exit_code = crm_errno2exit(rc); goto done; } } else { output = createEmptyCib(0); if(validation) { crm_xml_add(output, XML_ATTR_VALIDATION, validation); } printf("Created new %s configuration\n", crm_element_value(output, XML_ATTR_VALIDATION)); } rc = write_xml_file(output, shadow_file, FALSE); free_xml(output); if (rc < 0) { fprintf(stderr, "Could not %s the shadow instance '%s': %s\n", command == 'r' ? "reset" : "create", shadow, pcmk_strerror(rc)); exit_code = crm_errno2exit(rc); goto done; } shadow_setup(shadow, FALSE); } else if (command == 'E') { char *editor = getenv("EDITOR"); if (editor == NULL) { fprintf(stderr, "No value for EDITOR defined\n"); exit_code = CRM_EX_NOT_CONFIGURED; goto done; } execlp(editor, "--", shadow_file, NULL); fprintf(stderr, "Could not invoke EDITOR (%s %s): %s\n", editor, shadow_file, strerror(errno)); exit_code = CRM_EX_OSFILE; goto done; } else if (command == 's') { shadow_setup(shadow, TRUE); goto done; } else if (command == 'p') { /* display the current contents */ char *output_s = NULL; xmlNode *output = filename2xml(shadow_file); output_s = dump_xml_formatted(output); printf("%s", output_s); free(output_s); free_xml(output); } else if (command == 'd') { /* diff against cluster */ xmlNode *diff = NULL; xmlNode *old_config = NULL; xmlNode *new_config = filename2xml(shadow_file); rc = real_cib->cmds->query(real_cib, NULL, &old_config, command_options); if (rc != pcmk_ok) { fprintf(stderr, "Could not query the CIB: %s\n", pcmk_strerror(rc)); exit_code = crm_errno2exit(rc); goto done; } xml_track_changes(new_config, NULL, new_config, FALSE); xml_calculate_changes(old_config, new_config); diff = xml_create_patchset(0, old_config, new_config, NULL, FALSE); xml_log_changes(LOG_INFO, __FUNCTION__, new_config); xml_accept_changes(new_config); if (diff != NULL) { xml_log_patchset(0, " ", diff); exit_code = CRM_EX_ERROR; } goto done; } else if (command == 'C') { /* commit to the cluster */ xmlNode *input = filename2xml(shadow_file); xmlNode *section_xml = input; const char *section = NULL; if (!full_upload) { section = XML_CIB_TAG_CONFIGURATION; section_xml = first_named_child(input, section); } rc = real_cib->cmds->replace(real_cib, section, section_xml, command_options); if (rc != pcmk_ok) { fprintf(stderr, "Could not commit shadow instance '%s' to the CIB: %s\n", shadow, pcmk_strerror(rc)); exit_code = crm_errno2exit(rc); } shadow_teardown(shadow); free_xml(input); } done: free(shadow_file); free(shadow); return crm_exit(exit_code); } diff --git a/tools/crm_failcount b/tools/crm_failcount index 5cf7e01935..d20bf09a59 100755 --- a/tools/crm_failcount +++ b/tools/crm_failcount @@ -1,269 +1,269 @@ #!/bin/bash USAGE_TEXT="Usage: crm_failcount [] Common options: --help Display this text, then exit --version Display version information, then exit -V, --verbose Specify multiple times to increase debug output -q, --quiet Print only the value (if querying) Commands: -G, --query Query the current value of the resource's fail count -D, --delete Delete resource's recorded failures Additional Options: -r, --resource=value Name of the resource to use (required) -n, --operation=value Name of operation to use (instead of all operations) - -I, --interval=value If operation is specified, its interval (MUST be in milliseconds) + -I, --interval=value If operation is specified, its interval -N, --node=value Use failcount on named node (instead of local node)" HELP_TEXT="crm_failcount - Query or delete resource fail counts $USAGE_TEXT" exit_usage() { if [ $# -gt 0 ]; then echo "error: $@" >&2 fi echo echo "$USAGE_TEXT" exit 1 } warn() { echo "warning: $@" >&2 } interval_re() { echo "^[[:blank:]]*([0-9]+)[[:blank:]]*(${1})[[:blank:]]*$" } # This function should follow crm_get_interval() as closely as possible parse_interval() { INT_S="$1" INT_8601RE="^P(([0-9]+)Y)?(([0-9]+)M)?(([0-9]+)D)?T?(([0-9]+)H)?(([0-9]+)M)?(([0-9]+)S)?$" if [[ $INT_S =~ $(interval_re "s|sec|") ]]; then echo $(( ${BASH_REMATCH[1]} * 1000 )) elif [[ $INT_S =~ $(interval_re "ms|msec") ]]; then echo "${BASH_REMATCH[1]}" elif [[ $INT_S =~ $(interval_re "m|min") ]]; then echo $(( ${BASH_REMATCH[1]} * 60000 )) elif [[ $INT_S =~ $(interval_re "h|hr") ]]; then echo $(( ${BASH_REMATCH[1]} * 3600000 )) elif [[ $INT_S =~ $(interval_re "us|usec") ]]; then echo $(( ${BASH_REMATCH[1]} / 1000 )) elif [[ $INT_S =~ ^P([0-9]+)W$ ]]; then echo $(( ${BASH_REMATCH[1]} * 604800000 )) elif [[ $INT_S =~ $INT_8601RE ]]; then echo $(( ( ${BASH_REMATCH[2]:-0} * 31536000000 ) \ + ( ${BASH_REMATCH[4]:-0} * 2592000000 ) \ + ( ${BASH_REMATCH[6]:-0} * 86400000 ) \ + ( ${BASH_REMATCH[8]:-0} * 3600000 ) \ + ( ${BASH_REMATCH[10]:-0} * 60000 ) \ + ( ${BASH_REMATCH[12]:-0} * 1000 ) )) else warn "Unrecognized interval, using 0" echo "0" fi } query_single_attr() { QSR_TARGET="$1" QSR_ATTR="$2" crm_attribute $VERBOSE --quiet --query -t status -d 0 \ -N "$QSR_TARGET" -n "$QSR_ATTR" } query_attr_sum() { QAS_TARGET="$1" QAS_PREFIX="$2" # Build xpath to match all transient node attributes with prefix QAS_XPATH="/cib/status/node_state[@uname='${QAS_TARGET}']" QAS_XPATH="${QAS_XPATH}/transient_attributes/instance_attributes" QAS_XPATH="${QAS_XPATH}/nvpair[starts-with(@name,'$QAS_PREFIX')]" # Query attributes that match xpath # @TODO We ignore stderr because we don't want "no results" to look # like an error, but that also makes $VERBOSE pointless. QAS_ALL=$(cibadmin --query --sync-call --local \ --xpath="$QAS_XPATH" 2>/dev/null) # @TODO There is currently no reliable way to distinguish "no results" # from actual CIB errors. For now, treat any error as "no results". # #if [ $? -ne 0 ]; then # echo error >&2 # return #fi # Extract the attribute values (one per line) from the output QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \ 's/.*.*/\1/p') # Sum the values QAS_SUM=0 for i in 0 $QAS_VALUE; do QAS_SUM=$(($QAS_SUM + $i)) done echo $QAS_SUM } query_failcount() { QF_TARGET="$1" QF_RESOURCE="$2" QF_OPERATION="$3" QF_INTERVAL="$4" QF_ATTR_RSC="fail-count-${QF_RESOURCE}" if [ -n "$QF_OPERATION" ]; then QF_ATTR_DISPLAY="${QF_ATTR_RSC}#${QF_OPERATION}_${QF_INTERVAL}" QF_COUNT=$(query_single_attr "$QF_TARGET" "$QF_ATTR_DISPLAY") else QF_ATTR_DISPLAY="$QF_ATTR_RSC" QF_COUNT=$(query_attr_sum "$QF_TARGET" "${QF_ATTR_RSC}#") fi # @COMPAT attributes set < 1.1.17: # If we didn't find any per-operation failcount, # check whether there is a legacy per-resource failcount. if [ "$QF_COUNT" = "0" ]; then QF_COUNT=$(query_single_attr "$QF_TARGET" "$QF_ATTR_RSC") if [ "$QF_COUNT" != "0" ]; then QF_ATTR_DISPLAY="$QF_ATTR_RSC" fi fi # Echo result (comparable to crm_attribute, for backward compatibility) if [ -n "$QUIET" ]; then echo $QF_COUNT else echo "scope=status name=$QF_ATTR_DISPLAY value=$QF_COUNT" fi } clear_failcount() { CF_TARGET="$1" CF_RESOURCE="$2" CF_OPERATION="$3" CF_INTERVAL="$4" if [ -n "$CF_OPERATION" ]; then CF_OPERATION="-n $CF_OPERATION -I ${CF_INTERVAL}ms" fi crm_resource $QUIET $VERBOSE --cleanup \ -N "$CF_TARGET" -r "$CF_RESOURCE" $CF_OPERATION } QUIET="" VERBOSE="" command="" resource="" operation="" interval="0" target=$(crm_node -n 2>/dev/null) SHORTOPTS="qDGQVN:U:v:i:l:r:n:I:" LONGOPTS_COMMON="help,version,verbose,quiet" LONGOPTS_COMMANDS="query,delete" LONGOPTS_OTHER="resource:,node:,operation:,interval:" LONGOPTS_COMPAT="delete-attr,get-value,resource-id:,uname:,lifetime:,attr-value:,attr-id:" LONGOPTS="$LONGOPTS_COMMON,$LONGOPTS_COMMANDS,$LONGOPTS_OTHER,$LONGOPTS_COMPAT" TEMP=$(getopt -o $SHORTOPTS --long $LONGOPTS -n crm_failcount -- "$@") if [ $? -ne 0 ]; then exit_usage fi eval set -- "$TEMP" # Quotes around $TEMP are essential while true ; do case "$1" in --help) echo "$HELP_TEXT" exit 0 ;; --version) crm_attribute --version exit $? ;; -q|-Q|--quiet) QUIET="--quiet" shift ;; -V|--verbose) VERBOSE="$VERBOSE $1" shift ;; -G|--query|--get-value) command="--query" shift ;; -D|--delete|--delete-attr) command="--delete" shift ;; -r|--resource|--resource-id) resource="$2" shift 2 ;; -n|--operation) operation="$2" shift 2 ;; -I|--interval) interval="$2" shift 2 ;; -N|--node|-U|--uname) target="$2" shift 2 ;; -v|--attr-value) if [ "$2" = "0" ]; then command="--delete" else warn "ignoring deprecated option '$1' with nonzero value" fi shift 2 ;; -i|--attr-id|-l|--lifetime) warn "ignoring deprecated option '$1'" shift 2 ;; --) shift break ;; *) exit_usage "unknown option '$1'" ;; esac done [ -n "$command" ] || exit_usage "must specify a command" [ -n "$resource" ] || exit_usage "resource name required" [ -n "$target" ] || exit_usage "node name required" interval=$(parse_interval $interval) if [ "$command" = "--query" ]; then query_failcount "$target" "$resource" "$operation" "$interval" else clear_failcount "$target" "$resource" "$operation" "$interval" fi diff --git a/tools/crm_verify.c b/tools/crm_verify.c index f90f605147..9eea07019e 100644 --- a/tools/crm_verify.c +++ b/tools/crm_verify.c @@ -1,274 +1,276 @@ /* * Copyright (C) 2004 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include gboolean USE_LIVE_CIB = FALSE; char *cib_save = NULL; extern gboolean stage0(pe_working_set_t * data_set); extern void cleanup_alloc_calculations(pe_working_set_t * data_set); extern xmlNode *do_calculations(pe_working_set_t * data_set, xmlNode * xml_input, crm_time_t * now); /* *INDENT-OFF* */ static struct crm_option long_options[] = { /* Top-level Options */ {"help", 0, 0, '?', "\tThis text"}, {"version", 0, 0, '$', "\tVersion information" }, {"verbose", 0, 0, 'V', "\tIncrease debug output\n"}, {"-spacer-", 1, 0, '-', "\nData sources:"}, {"live-check", 0, 0, 'L', "Check the configuration used by the running cluster\n"}, {"xml-file", 1, 0, 'x', "Check the configuration in the named file"}, {"xml-text", 1, 0, 'X', "Check the configuration in the supplied string"}, {"xml-pipe", 0, 0, 'p', "Check the configuration piped in via stdin"}, {"-spacer-", 1, 0, '-', "\nAdditional Options:"}, {"save-xml", 1, 0, 'S', "Save the verified XML to the named file. Most useful with -L"}, {"-spacer-", 1, 0, '-', "\nExamples:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', "Check the consistency of the configuration in the running cluster:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_verify --live-check", pcmk_option_example}, {"-spacer-", 1, 0, '-', "Check the consistency of the configuration in a given file and produce verbose output:", pcmk_option_paragraph}, {"-spacer-", 1, 0, '-', " crm_verify --xml-file file.xml --verbose", pcmk_option_example}, {0, 0, 0, 0} }; /* *INDENT-ON* */ int main(int argc, char **argv) { xmlNode *cib_object = NULL; xmlNode *status = NULL; int argerr = 0; int flag; int option_index = 0; pe_working_set_t data_set; cib_t *cib_conn = NULL; int rc = pcmk_ok; bool verbose = FALSE; gboolean xml_stdin = FALSE; const char *xml_tag = NULL; const char *xml_file = NULL; const char *xml_string = NULL; crm_log_cli_init("crm_verify"); crm_set_options(NULL, "[modifiers] data_source", long_options, - "\n\nChecks the well-formedness of an XML configuration, its conformance to the configured schema and for the presence of common misconfigurations." - "\n\nIt reports two classes of problems, errors and warnings." - " Errors must be fixed before the cluster will work properly." - " However, it is left up to the administrator to decide if the warnings should also be fixed."); + "check a Pacemaker configuration for errors" + "\n\nCheck the well-formedness of a complete Pacemaker XML configuration," + "\n\nits conformance to the configured schema, and the presence of common" + "\n\nmisconfigurations. Problems reported as errors must be fixed before the" + "\n\ncluster will work properly. It is left to the administrator to decide" + "\n\nwhether to fix problems reported as warnings."); while (1) { flag = crm_get_option(argc, argv, &option_index); if (flag == -1) break; switch (flag) { case 'X': crm_trace("Option %c => %s", flag, optarg); xml_string = optarg; break; case 'x': crm_trace("Option %c => %s", flag, optarg); xml_file = optarg; break; case 'p': xml_stdin = TRUE; break; case 'S': cib_save = optarg; break; case 'V': verbose = TRUE; crm_bump_log_level(argc, argv); break; case 'L': USE_LIVE_CIB = TRUE; break; case '$': case '?': crm_help(flag, CRM_EX_OK); break; default: fprintf(stderr, "Option -%c is not yet supported\n", flag); ++argerr; break; } } if (optind < argc) { printf("non-option ARGV-elements: "); while (optind < argc) { printf("%s ", argv[optind++]); } printf("\n"); } if (optind > argc) { ++argerr; } if (argerr) { crm_err("%d errors in option parsing", argerr); crm_help(flag, CRM_EX_USAGE); } crm_info("=#=#=#=#= Getting XML =#=#=#=#="); if (USE_LIVE_CIB) { cib_conn = cib_new(); rc = cib_conn->cmds->signon(cib_conn, crm_system_name, cib_command); } if (USE_LIVE_CIB) { if (rc == pcmk_ok) { int options = cib_scope_local | cib_sync_call; crm_info("Reading XML from: live cluster"); rc = cib_conn->cmds->query(cib_conn, NULL, &cib_object, options); } if (rc != pcmk_ok) { fprintf(stderr, "Live CIB query failed: %s\n", pcmk_strerror(rc)); goto done; } if (cib_object == NULL) { fprintf(stderr, "Live CIB query failed: empty result\n"); rc = -ENOMSG; goto done; } } else if (xml_file != NULL) { cib_object = filename2xml(xml_file); if (cib_object == NULL) { fprintf(stderr, "Couldn't parse input file: %s\n", xml_file); rc = -ENODATA; goto done; } } else if (xml_string != NULL) { cib_object = string2xml(xml_string); if (cib_object == NULL) { fprintf(stderr, "Couldn't parse input string: %s\n", xml_string); rc = -ENODATA; goto done; } } else if (xml_stdin) { cib_object = stdin2xml(); if (cib_object == NULL) { fprintf(stderr, "Couldn't parse input from STDIN.\n"); rc = -ENODATA; goto done; } } else { fprintf(stderr, "No configuration source specified." " Use --help for usage information.\n"); rc = -ENODATA; goto done; } xml_tag = crm_element_name(cib_object); if (safe_str_neq(xml_tag, XML_TAG_CIB)) { fprintf(stderr, "This tool can only check complete configurations (i.e. those starting with ).\n"); rc = -EBADMSG; goto done; } if (cib_save != NULL) { write_xml_file(cib_object, cib_save, FALSE); } status = get_object_root(XML_CIB_TAG_STATUS, cib_object); if (status == NULL) { create_xml_node(cib_object, XML_CIB_TAG_STATUS); } if (validate_xml(cib_object, NULL, FALSE) == FALSE) { crm_config_err("CIB did not pass schema validation"); free_xml(cib_object); cib_object = NULL; } else if (cli_config_update(&cib_object, NULL, FALSE) == FALSE) { crm_config_error = TRUE; free_xml(cib_object); cib_object = NULL; fprintf(stderr, "The cluster will NOT be able to use this configuration.\n"); fprintf(stderr, "Please manually update the configuration to conform to the %s syntax.\n", xml_latest_schema()); } set_working_set_defaults(&data_set); if (cib_object == NULL) { } else if (status != NULL || USE_LIVE_CIB) { /* live queries will always have a status section and can do a full simulation */ do_calculations(&data_set, cib_object, NULL); cleanup_alloc_calculations(&data_set); } else { data_set.now = crm_time_new(NULL); data_set.input = cib_object; stage0(&data_set); cleanup_alloc_calculations(&data_set); } if (crm_config_error) { fprintf(stderr, "Errors found during check: config not valid\n"); if (verbose == FALSE) { fprintf(stderr, " -V may provide more details\n"); } rc = -pcmk_err_generic; } else if (crm_config_warning) { fprintf(stderr, "Warnings found during check: config may not be valid\n"); if (verbose == FALSE) { fprintf(stderr, " Use -V for more details\n"); } rc = -pcmk_err_generic; } if (USE_LIVE_CIB && cib_conn) { cib_conn->cmds->signoff(cib_conn); cib_delete(cib_conn); } done: return rc; }