diff --git a/doc/sphinx/Pacemaker_Explained/fencing.rst b/doc/sphinx/Pacemaker_Explained/fencing.rst index d9b8f21d72..df928b5dbc 100644 --- a/doc/sphinx/Pacemaker_Explained/fencing.rst +++ b/doc/sphinx/Pacemaker_Explained/fencing.rst @@ -1,1026 +1,1170 @@ +.. index:: + single: fencing + single: STONITH + +.. _fencing: + Fencing ------- -.. Convert_to_RST: - - anchor:ch-fencing[Chapter 6, Fencing] - indexterm:[Fencing, Configuration] - indexterm:[STONITH, Configuration] - - == What Is Fencing? == - - 'Fencing' is the ability to make a node unable to run resources, even when that - node is unresponsive to cluster commands. - - Fencing is also known as 'STONITH', an acronym for "Shoot The Other Node In The - Head", since the most common fencing method is cutting power to the node. - Another method is "fabric fencing", cutting the node's access to some - capability required to run resources (such as network access or a shared disk). - - == Why Is Fencing Necessary? == - - Fencing protects your data from being corrupted by malfunctioning nodes or - unintentional concurrent access to shared resources. - - Fencing protects against the "split brain" failure scenario, where cluster - nodes have lost the ability to reliably communicate with each other but are - still able to run resources. If the cluster just assumed that uncommunicative - nodes were down, then multiple instances of a resource could be started on - different nodes. - - The effect of split brain depends on the resource type. For example, an IP - address brought up on two hosts on a network will cause packets to randomly be - sent to one or the other host, rendering the IP useless. For a database or - clustered file system, the effect could be much more severe, causing data - corruption or divergence. - - Fencing also is used when a resource cannot otherwise be stopped. If a failed - resource fails to stop, it cannot be recovered elsewhere. Fencing the - resource's node is the only way to ensure the resource is recoverable. - - Users may also configure the +on-fail+ property of any resource operation to - +fencing+, in which case the cluster will fence the resource's node if the - operation fails. - - == Fence Devices == - - A 'fence device' (or 'fencing device') is a special type of resource that - provides the means to fence a node. - - Examples of fencing devices include intelligent power switches and IPMI devices - that accept SNMP commands to cut power to a node, and iSCSI controllers that - allow SCSI reservations to be used to cut a node's access to a shared disk. - - Since fencing devices will be used to recover from loss of networking - connectivity to other nodes, it is essential that they do not rely on the same - network as the cluster itself, otherwise that network becomes a single point of - failure. - - Since loss of a node due to power outage is indistinguishable from loss of - network connectivity to that node, it is also essential that at least one fence - device for a node does not share power with that node. For example, an on-board - IPMI controller that shares power with its host should not be used as the sole - fencing device for that host. - - Since fencing is used to isolate malfunctioning nodes, no fence device should - rely on its target functioning properly. This includes, for example, devices - that ssh into a node and issue a shutdown command (such devices might be - suitable for testing, but never for production). - - == Fence Agents == - - A 'fence agent' (or 'fencing agent') is a +stonith+-class resource agent. - - The fence agent standard provides commands (such as +off+ and +reboot+) that - the cluster can use to fence nodes. As with other resource agent classes, - this allows a layer of abstraction so that Pacemaker doesn't need any knowledge - about specific fencing technologies -- that knowledge is isolated in the agent. - - == When a Fence Device Can Be Used == - - Fencing devices do not actually "run" like most services. Typically, they just - provide an interface for sending commands to an external device. - - Additionally, fencing may be initiated by Pacemaker, by other cluster-aware software - such as DRBD or DLM, or manually by an administrator, at any point in the - cluster life cycle, including before any resources have been started. - - To accommodate this, Pacemaker does not require the fence device resource to be - "started" in order to be used. Whether a fence device is started or not - determines whether a node runs any recurring monitor for the device, and gives - the node a slight preference for being chosen to execute fencing using that - device. - - By default, any node can execute any fencing device. If a fence device is - disabled by setting its +target-role+ to Stopped, then no node can use that - device. If mandatory location constraints prevent a specific node from - "running" a fence device, then that node will never be chosen to execute - fencing using the device. A node may fence itself, but the cluster will choose - that only if no other nodes can do the fencing. - - A common configuration scenario is to have one fence device per target node. - In such a case, users often configure anti-location constraints so that - the target node does not monitor its own device. The best practice is to make - the constraint optional (i.e. a finite negative score rather than +-INFINITY+), - so that the node can fence itself if no other nodes can. - - == Limitations of Fencing Resources == - - Fencing resources have certain limitations that other resource classes don't: - - * They may have only one set of meta-attributes and one set of instance - attributes. - * If <> are used to determine fencing resource options, these - may only be evaluated when first read, meaning that later changes to the - rules will have no effect. Therefore, it is better to avoid confusion and not - use rules at all with fencing resources. - - These limitations could be revisited if there is significant user demand. - - == Special Options for Fencing Resources == - - The table below lists special instance attributes that may be set for any - fencing resource ('not' meta-attributes, even though they are interpreted by - pacemaker rather than the fence agent). These are also listed in the man page - for +pacemaker-fenced+. - - .Additional Properties of Fencing Resources - [width="95%",cols="8m,3,6,<12",options="header",align="center"] - |========================================================= - - |Field - |Type - |Default - |Description - - |stonith-timeout - |NA - |NA - a|Older versions used this to override the default period to wait for a STONITH (reboot, on, off) action to complete for this device. - It has been replaced by the +pcmk_reboot_timeout+ and +pcmk_off_timeout+ properties. - indexterm:[stonith-timeout,Fencing] - indexterm:[Fencing,Property,stonith-timeout] - - //// - (not yet implemented) - priority - integer - 0 - The priority of the STONITH resource. Devices are tried in order of highest priority to lowest. - indexterm priority,Fencing - indexterm Fencing,Property,priority - //// - - |provides - |string - | - |Any special capability provided by the fence device. Currently, only one such - capability is meaningful: +unfencing+ (see <>). - indexterm:[provides,Fencing] - indexterm:[Fencing,Property,provides] - - |pcmk_host_map - |string - | - |A mapping of host names to ports numbers for devices that do not support host names. - Example: +node1:1;node2:2,3+ tells the cluster to use port 1 for - *node1* and ports 2 and 3 for *node2*. If +pcmk_host_check+ is explicitly set - to +static-list+, either this or +pcmk_host_list+ must be set. - indexterm:[pcmk_host_map,Fencing] - indexterm:[Fencing,Property,pcmk_host_map] - - |pcmk_host_list - |string - | - |A list of machines controlled by this device. If +pcmk_host_check+ is - explicitly set to +static-list+, either this or +pcmk_host_map+ must be set. - indexterm:[pcmk_host_list,Fencing] - indexterm:[Fencing,Property,pcmk_host_list] - - |pcmk_host_check - |string - |A value appropriate to other configuration options and - device capabilities (see note below) - a|How to determine which machines are controlled by the device. - Allowed values: - - * +dynamic-list:+ query the device via the "list" command - * +static-list:+ check the +pcmk_host_list+ or +pcmk_host_map+ attribute - * +status:+ query the device via the "status" command - * +none:+ assume every device can fence every machine - - indexterm:[pcmk_host_check,Fencing] - indexterm:[Fencing,Property,pcmk_host_check] - - |pcmk_delay_max - |time - |0s - |Enable a random delay of up to the time specified before executing fencing - actions. This is sometimes used in two-node clusters to ensure that the - nodes don't fence each other at the same time. The overall delay introduced - by pacemaker is derived from this random delay value adding a static delay so - that the sum is kept below the maximum delay. - - indexterm:[pcmk_delay_max,Fencing] - indexterm:[Fencing,Property,pcmk_delay_max] - - |pcmk_delay_base - |time - |0s - |Enable a static delay before executing fencing actions. This can be used - e.g. in two-node clusters to ensure that the nodes don't fence each other, - by having separate fencing resources with different values. The node that is - fenced with the shorter delay will lose a fencing race. The overall delay - introduced by pacemaker is derived from this value plus a random delay such - that the sum is kept below the maximum delay. - - indexterm:[pcmk_delay_base,Fencing] - indexterm:[Fencing,Property,pcmk_delay_base] - - |pcmk_action_limit - |integer - |1 - |The maximum number of actions that can be performed in parallel on this - device, if the cluster option +concurrent-fencing+ is +true+. -1 is unlimited. - - indexterm:[pcmk_action_limit,Fencing] - indexterm:[Fencing,Property,pcmk_action_limit] - - |pcmk_host_argument - |string - |+port+ otherwise +plug+ if supported according to the metadata of the fence agent - |'Advanced use only.' Which parameter should be supplied to the fence agent to - identify the node to be fenced. Some devices support neither the standard +plug+ - nor the deprecated +port+ parameter, or may provide additional ones. Use this to - specify an alternate, device-specific parameter. A value of +none+ tells the - cluster not to supply any additional parameters. - indexterm:[pcmk_host_argument,Fencing] - indexterm:[Fencing,Property,pcmk_host_argument] - - |pcmk_reboot_action - |string - |reboot - |'Advanced use only.' The command to send to the resource agent in order to - reboot a node. Some devices do not support the standard commands or may provide - additional ones. Use this to specify an alternate, device-specific command. - indexterm:[pcmk_reboot_action,Fencing] - indexterm:[Fencing,Property,pcmk_reboot_action] - - |pcmk_reboot_timeout - |time - |60s - |'Advanced use only.' Specify an alternate timeout to use for `reboot` actions - instead of the value of +stonith-timeout+. Some devices need much more or less - time to complete than normal. Use this to specify an alternate, device-specific - timeout. - indexterm:[pcmk_reboot_timeout,Fencing] - indexterm:[Fencing,Property,pcmk_reboot_timeout] - indexterm:[stonith-timeout,Fencing] - indexterm:[Fencing,Property,stonith-timeout] - - |pcmk_reboot_retries - |integer - |2 - |'Advanced use only.' The maximum number of times to retry the `reboot` command - within the timeout period. Some devices do not support multiple connections, and - operations may fail if the device is busy with another task, so Pacemaker will - automatically retry the operation, if there is time remaining. Use this option - to alter the number of times Pacemaker retries before giving up. - indexterm:[pcmk_reboot_retries,Fencing] - indexterm:[Fencing,Property,pcmk_reboot_retries] - - |pcmk_off_action - |string - |off - |'Advanced use only.' The command to send to the resource agent in order to - shut down a node. Some devices do not support the standard commands or may provide - additional ones. Use this to specify an alternate, device-specific command. - indexterm:[pcmk_off_action,Fencing] - indexterm:[Fencing,Property,pcmk_off_action] - - |pcmk_off_timeout - |time - |60s - |'Advanced use only.' Specify an alternate timeout to use for `off` actions - instead of the value of +stonith-timeout+. Some devices need much more or less - time to complete than normal. Use this to specify an alternate, device-specific - timeout. - indexterm:[pcmk_off_timeout,Fencing] - indexterm:[Fencing,Property,pcmk_off_timeout] - indexterm:[stonith-timeout,Fencing] - indexterm:[Fencing,Property,stonith-timeout] - - |pcmk_off_retries - |integer - |2 - |'Advanced use only.' The maximum number of times to retry the `off` command - within the timeout period. Some devices do not support multiple connections, and - operations may fail if the device is busy with another task, so Pacemaker will - automatically retry the operation, if there is time remaining. Use this option - to alter the number of times Pacemaker retries before giving up. - indexterm:[pcmk_off_retries,Fencing] - indexterm:[Fencing,Property,pcmk_off_retries] - - |pcmk_list_action - |string - |list - |'Advanced use only.' The command to send to the resource agent in order to - list nodes. Some devices do not support the standard commands or may provide - additional ones. Use this to specify an alternate, device-specific command. - indexterm:[pcmk_list_action,Fencing] - indexterm:[Fencing,Property,pcmk_list_action] - - |pcmk_list_timeout - |time - |60s - |'Advanced use only.' Specify an alternate timeout to use for `list` actions - instead of the value of +stonith-timeout+. Some devices need much more or less - time to complete than normal. Use this to specify an alternate, device-specific - timeout. - indexterm:[pcmk_list_timeout,Fencing] - indexterm:[Fencing,Property,pcmk_list_timeout] - - |pcmk_list_retries - |integer - |2 - |'Advanced use only.' The maximum number of times to retry the `list` command - within the timeout period. Some devices do not support multiple connections, and - operations may fail if the device is busy with another task, so Pacemaker will - automatically retry the operation, if there is time remaining. Use this option - to alter the number of times Pacemaker retries before giving up. - indexterm:[pcmk_list_retries,Fencing] - indexterm:[Fencing,Property,pcmk_list_retries] - - |pcmk_monitor_action - |string - |monitor - |'Advanced use only.' The command to send to the resource agent in order to - report extended status. Some devices do not support the standard commands or may provide - additional ones. Use this to specify an alternate, device-specific command. - indexterm:[pcmk_monitor_action,Fencing] - indexterm:[Fencing,Property,pcmk_monitor_action] - - |pcmk_monitor_timeout - |time - |60s - |'Advanced use only.' Specify an alternate timeout to use for `monitor` actions - instead of the value of +stonith-timeout+. Some devices need much more or less - time to complete than normal. Use this to specify an alternate, device-specific - timeout. - indexterm:[pcmk_monitor_timeout,Fencing] - indexterm:[Fencing,Property,pcmk_monitor_timeout] - - |pcmk_monitor_retries - |integer - |2 - |'Advanced use only.' The maximum number of times to retry the `monitor` command - within the timeout period. Some devices do not support multiple connections, and - operations may fail if the device is busy with another task, so Pacemaker will - automatically retry the operation, if there is time remaining. Use this option - to alter the number of times Pacemaker retries before giving up. - indexterm:[pcmk_monitor_retries,Fencing] - indexterm:[Fencing,Property,pcmk_monitor_retries] - - |pcmk_status_action - |string - |status - |'Advanced use only.' The command to send to the resource agent in order to - report status. Some devices do not support the standard commands or may provide - additional ones. Use this to specify an alternate, device-specific command. - indexterm:[pcmk_status_action,Fencing] - indexterm:[Fencing,Property,pcmk_status_action] - - |pcmk_status_timeout - |time - |60s - |'Advanced use only.' Specify an alternate timeout to use for `status` actions - instead of the value of +stonith-timeout+. Some devices need much more or less - time to complete than normal. Use this to specify an alternate, device-specific - timeout. - indexterm:[pcmk_status_timeout,Fencing] - indexterm:[Fencing,Property,pcmk_status_timeout] - - |pcmk_status_retries - |integer - |2 - |'Advanced use only.' The maximum number of times to retry the `status` command - within the timeout period. Some devices do not support multiple connections, and - operations may fail if the device is busy with another task, so Pacemaker will - automatically retry the operation, if there is time remaining. Use this option - to alter the number of times Pacemaker retries before giving up. - indexterm:[pcmk_status_retries,Fencing] - indexterm:[Fencing,Property,pcmk_status_retries] - - |========================================================= - - [NOTE] - ==== - The default value for +pcmk_host_check+ is +static-list+ if either - +pcmk_host_list+ or +pcmk_host_map+ is configured. If neither of those are - configured, the default is +dynamic-list+ if the fence device supports the list - action, or +status+ if the fence device supports the status action but not the - list action. If none of those conditions apply, the default is +none+. - ==== - - [[s-unfencing]] - == Unfencing == - - With fabric fencing (such as cutting network or shared disk access rather than - power), it is expected that the cluster will fence the node, and - then a system administrator must manually investigate what went wrong, correct - any issues found, then reboot (or restart the cluster services on) the node. - - Once the node reboots and rejoins the cluster, some fabric fencing devices - require an explicit command to restore the node's access. This capability is - called 'unfencing' and is typically implemented as the fence agent's +on+ - command. - - If any cluster resource has +requires+ set to +unfencing+, then that resource - will not be probed or started on a node until that node has been unfenced. - - == Fence Devices Dependent on Other Resources == - - In some cases, a fence device may require some other cluster resource (such as - an IP address) to be active in order to function properly. - - This is obviously undesirable in general: fencing may be required when the - depended-on resource is not active, or fencing may be required because the node - running the depended-on resource is no longer responding. - - However, this may be acceptable under certain conditions: - - * The dependent fence device should not be able to target any node that is - allowed to run the depended-on resource. - - * The depended-on resource should not be disabled during production operation. - - * The +concurrent-fencing+ cluster property should be set to +true+. Otherwise, - if both the node running the depended-on resource and some node targeted by - the dependent fence device need to be fenced, the fencing of the node - running the depended-on resource might be ordered first, making the second - fencing impossible and blocking further recovery. With concurrent fencing, - the dependent fence device might fail at first due to the depended-on - resource being unavailable, but it will be retried and eventually succeed - once the resource is brought back up. - - Even under those conditions, there is one unlikely problem scenario. The DC - always schedules fencing of itself after any other fencing needed, to avoid - unnecessary repeated DC elections. If the dependent fence device targets the - DC, and both the DC and a different node running the depended-on resource need - to be fenced, the DC fencing will always fail and block further recovery. Note, - however, that losing a DC node entirely causes some other node to become DC and - schedule the fencing, so this is only a risk when a stop or other operation - with +on-fail+ set to +fencing+ fails on the DC. - - == Configuring Fencing == - - . Find the correct driver: - + - ---- - # stonith_admin --list-installed - ---- - - . Find the required parameters associated with the device - (replacing $AGENT_NAME with the name obtained from the previous step): - + - ---- - # stonith_admin --metadata --agent $AGENT_NAME - ---- - - . Create a file called +stonith.xml+ containing a primitive resource - with a class of +stonith+, a type equal to the agent name obtained earlier, - and a parameter for each of the values returned in the previous step. - - . If the device does not know how to fence nodes based on their uname, - you may also need to set the special +pcmk_host_map+ parameter. See - `man pacemaker-fenced` for details. - - . If the device does not support the `list` command, you may also need - to set the special +pcmk_host_list+ and/or +pcmk_host_check+ - parameters. See `man pacemaker-fenced` for details. - - . If the device does not expect the victim to be specified with the - `port` parameter, you may also need to set the special - +pcmk_host_argument+ parameter. See `man pacemaker-fenced` for details. - - . Upload it into the CIB using cibadmin: - + - ---- - # cibadmin -C -o resources --xml-file stonith.xml - ---- - - . Set +stonith-enabled+ to true: - + - ---- - # crm_attribute -t crm_config -n stonith-enabled -v true - ---- - - . Once the stonith resource is running, you can test it by executing the - following (although you might want to stop the cluster on that machine - first): - + - ---- - # stonith_admin --reboot nodename - ---- - - === Example Fencing Configuration === - - Assume we have a chassis containing four nodes and an IPMI device - active on 192.0.2.1. We would choose the `fence_ipmilan` driver, - and obtain the following list of parameters: - - .Obtaining a list of Fence Agent Parameters - ==== - ---- - # stonith_admin --metadata -a fence_ipmilan - ---- - - [source,XML] - ---- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- - ==== - - Based on that, we would create a fencing resource fragment that might look - like this: - - .An IPMI-based Fencing Resource - ==== - [source,XML] - ---- - - - - - - - - - - - - ---- - ==== - - Finally, we need to enable fencing: - ---- - # crm_attribute -t crm_config -n stonith-enabled -v true - ---- - - == Fencing Topologies == - - Pacemaker supports fencing nodes with multiple devices through a feature called - 'fencing topologies'. Fencing topologies may be used to provide alternative - devices in case one fails, or to require multiple devices to all be executed - successfully in order to consider the node successfully fenced, or even a - combination of the two. - - Create the individual devices as you normally would, then define one or more - +fencing-level+ entries in the +fencing-topology+ section of the configuration. - - * Each fencing level is attempted in order of ascending +index+. Allowed - values are 1 through 9. - * If a device fails, processing terminates for the current level. - No further devices in that level are exercised, and the next level is attempted instead. - * If the operation succeeds for all the listed devices in a level, the level is deemed to have passed. - * The operation is finished when a level has passed (success), or all levels have been attempted (failed). - * If the operation failed, the next step is determined by the scheduler - and/or the controller. - - Some possible uses of topologies include: - - * Try on-board IPMI, then an intelligent power switch if that fails - * Try fabric fencing of both disk and network, then fall back to power fencing - if either fails - * Wait up to a certain time for a kernel dump to complete, then cut power to - the node - - .Properties of Fencing Levels - [width="95%",cols="1m,<3",options="header",align="center"] - |========================================================= - - |Field - |Description - - |id - |A unique name for the level - indexterm:[id,fencing-level] - indexterm:[Fencing,fencing-level,id] - - |target - |The name of a single node to which this level applies - indexterm:[target,fencing-level] - indexterm:[Fencing,fencing-level,target] - - |target-pattern - |An extended regular expression (as defined in - http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04[POSIX]) - matching the names of nodes to which this level applies - indexterm:[target-pattern,fencing-level] - indexterm:[Fencing,fencing-level,target-pattern] - - |target-attribute - |The name of a node attribute that is set (to +target-value+) for nodes to - which this level applies - indexterm:[target-attribute,fencing-level] - indexterm:[Fencing,fencing-level,target-attribute] - - |target-value - |The node attribute value (of +target-attribute+) that is set for nodes to - which this level applies - indexterm:[target-attribute,fencing-level] - indexterm:[Fencing,fencing-level,target-attribute] - - |index - |The order in which to attempt the levels. - Levels are attempted in ascending order 'until one succeeds'. - Valid values are 1 through 9. - indexterm:[index,fencing-level] - indexterm:[Fencing,fencing-level,index] - - |devices - |A comma-separated list of devices that must all be tried for this level - indexterm:[devices,fencing-level] - indexterm:[Fencing,fencing-level,devices] - - |========================================================= - - .Fencing topology with different devices for different nodes - ==== - [source,XML] - ---- - - - ... - - - - - - - - - - ... - - - - ---- - ==== - - === Example Dual-Layer, Dual-Device Fencing Topologies === - - The following example illustrates an advanced use of +fencing-topology+ in a cluster with the following properties: - - * 3 nodes (2 active prod-mysql nodes, 1 prod_mysql-rep in standby for quorum purposes) - * the active nodes have an IPMI-controlled power board reached at 192.0.2.1 and 192.0.2.2 - * the active nodes also have two independent PSUs (Power Supply Units) - connected to two independent PDUs (Power Distribution Units) reached at - 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) - * the first fencing method uses the `fence_ipmi` agent - * the second fencing method uses the `fence_apc_snmp` agent targetting 2 fencing devices (one per PSU, either port 10 or 11) - * fencing is only implemented for the active nodes and has location constraints - * fencing topology is set to try IPMI fencing first then default to a "sure-kill" dual PDU fencing - - In a normal failure scenario, STONITH will first select +fence_ipmi+ to try to kill the faulty node. - Using a fencing topology, if that first method fails, STONITH will then move on to selecting +fence_apc_snmp+ twice: - - * once for the first PDU - * again for the second PDU - - The fence action is considered successful only if both PDUs report the required status. If any of them fails, STONITH loops back to the first fencing method, +fence_ipmi+, and so on until the node is fenced or fencing action is cancelled. - - .First fencing method: single IPMI device - - Each cluster node has it own dedicated IPMI channel that can be called for fencing using the following primitives: - [source,XML] - ---- - - - - - - - - - - - - - - - - - - - - - - - ---- - - .Second fencing method: dual PDU devices - - Each cluster node also has two distinct power channels controlled by two - distinct PDUs. That means a total of 4 fencing devices configured as follows: - - - Node 1, PDU 1, PSU 1 @ port 10 - - Node 1, PDU 2, PSU 2 @ port 10 - - Node 2, PDU 1, PSU 1 @ port 11 - - Node 2, PDU 2, PSU 2 @ port 11 - - The matching fencing agents are configured as follows: - [source,XML] - ---- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- - - .Location Constraints - - To prevent STONITH from trying to run a fencing agent on the same node it is - supposed to fence, constraints are placed on all the fencing primitives: - [source,XML] - ---- - - - - - - - - - ---- - - .Fencing topology - - Now that all the fencing resources are defined, it's time to create the right topology. - We want to first fence using IPMI and if that does not work, fence both PDUs to effectively and surely kill the node. - [source,XML] - ---- - - - - - - - ---- - Please note, in +fencing-topology+, the lowest +index+ value determines the priority of the first fencing method. - - .Final configuration - - Put together, the configuration looks like this: - [source,XML] - ---- - - - - - - - +What Is Fencing? +################ + +*Fencing* is the ability to make a node unable to run resources, even when that +node is unresponsive to cluster commands. + +Fencing is also known as *STONITH*, an acronym for "Shoot The Other Node In The +Head", since the most common fencing method is cutting power to the node. +Another method is "fabric fencing", cutting the node's access to some +capability required to run resources (such as network access or a shared disk). + +.. index:: + single: fencing; why necessary + +Why Is Fencing Necessary? +######################### + +Fencing protects your data from being corrupted by malfunctioning nodes or +unintentional concurrent access to shared resources. + +Fencing protects against the "split brain" failure scenario, where cluster +nodes have lost the ability to reliably communicate with each other but are +still able to run resources. If the cluster just assumed that uncommunicative +nodes were down, then multiple instances of a resource could be started on +different nodes. + +The effect of split brain depends on the resource type. For example, an IP +address brought up on two hosts on a network will cause packets to randomly be +sent to one or the other host, rendering the IP useless. For a database or +clustered file system, the effect could be much more severe, causing data +corruption or divergence. + +Fencing is also used when a resource cannot otherwise be stopped. If a +resource fails to stop on a node, it cannot be started on a different node +without risking the same type of conflict as split-brain. Fencing the +original node ensures the resource can be safely started elsewhere. + +Users may also configure the ``on-fail`` property of :ref:`operation` or the +``loss-policy`` property of +:ref:`ticket constraints ` to ``fence``, in which +case the cluster will fence the resource's node if the operation fails or the +ticket is lost. + +.. index:: + single: fencing; device + +Fence Devices +############# + +A *fence device* or *fencing device* is a special type of resource that +provides the means to fence a node. + +Examples of fencing devices include intelligent power switches and IPMI devices +that accept SNMP commands to cut power to a node, and iSCSI controllers that +allow SCSI reservations to be used to cut a node's access to a shared disk. + +Since fencing devices will be used to recover from loss of networking +connectivity to other nodes, it is essential that they do not rely on the same +network as the cluster itself, otherwise that network becomes a single point of +failure. + +Since loss of a node due to power outage is indistinguishable from loss of +network connectivity to that node, it is also essential that at least one fence +device for a node does not share power with that node. For example, an on-board +IPMI controller that shares power with its host should not be used as the sole +fencing device for that host. + +Since fencing is used to isolate malfunctioning nodes, no fence device should +rely on its target functioning properly. This includes, for example, devices +that ssh into a node and issue a shutdown command (such devices might be +suitable for testing, but never for production). + +.. index:: + single: fencing; agent + +Fence Agents +############ + +A *fence agent* or *fencing agent* is a ``stonith``-class resource agent. + +The fence agent standard provides commands (such as ``off`` and ``reboot``) +that the cluster can use to fence nodes. As with other resource agent classes, +this allows a layer of abstraction so that Pacemaker doesn't need any knowledge +about specific fencing technologies -- that knowledge is isolated in the agent. + +When a Fence Device Can Be Used +############################### + +Fencing devices do not actually "run" like most services. Typically, they just +provide an interface for sending commands to an external device. + +Additionally, fencing may be initiated by Pacemaker, by other cluster-aware +software such as DRBD or DLM, or manually by an administrator, at any point in +the cluster life cycle, including before any resources have been started. + +To accommodate this, Pacemaker does not require the fence device resource to be +"started" in order to be used. Whether a fence device is started or not +determines whether a node runs any recurring monitor for the device, and gives +the node a slight preference for being chosen to execute fencing using that +device. + +By default, any node can execute any fencing device. If a fence device is +disabled by setting its ``target-role`` to ``Stopped``, then no node can use +that device. If a location constraint with a negative score prevents a specific +node from "running" a fence device, then that node will never be chosen to +execute fencing using the device. A node may fence itself, but the cluster will +choose that only if no other nodes can do the fencing. + +A common configuration scenario is to have one fence device per target node. +In such a case, users often configure anti-location constraints so that +the target node does not monitor its own device. + +Limitations of Fencing Resources +################################ + +Fencing resources have certain limitations that other resource classes don't: + +* They may have only one set of meta-attributes and one set of instance + attributes. +* If :ref:`rules` are used to determine fencing resource options, these + might be evaluated only when first read, meaning that later changes to the + rules will have no effect. Therefore, it is better to avoid confusion and not + use rules at all with fencing resources. + +These limitations could be revisited if there is sufficient user demand. + +.. index:: + single: fencing; special instance attributes + +.. _fencing-attributes: + +Special Options for Fencing Resources +##################################### + +The table below lists special instance attributes that may be set for any +fencing resource (*not* meta-attributes, even though they are interpreted by +Pacemaker rather than the fence agent). These are also listed in the man page +for ``pacemaker-fenced``. + +.. Not_Yet_Implemented: + + +----------------------+---------+--------------------+----------------------------------------+ + | priority | integer | 0 | .. index:: | + | | | | single: priority | + | | | | | + | | | | The priority of the fence device. | + | | | | Devices are tried in order of highest | + | | | | priority to lowest. | + +----------------------+---------+--------------------+----------------------------------------+ + +.. table:: **Additional Properties of Fencing Resources** + + +----------------------+---------+--------------------+----------------------------------------+ + | Field | Type | Default | Description | + +======================+=========+====================+========================================+ + | stonith-timeout | time | | .. index:: | + | | | | single: stonith-timeout | + | | | | | + | | | | Older versions used this to override | + | | | | the default period to wait for a fence | + | | | | action (reboot, on, or off) to | + | | | | complete for this device. It has been | + | | | | replaced by the | + | | | | ``pcmk_reboot_timeout`` and | + | | | | ``pcmk_off_timeout`` properties. | + +----------------------+---------+--------------------+----------------------------------------+ + | provides | string | | .. index:: | + | | | | single: provides | + | | | | | + | | | | Any special capability provided by the | + | | | | fence device. Currently, only one such | + | | | | capability is meaningful: | + | | | | :ref:`unfencing `. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_host_map | string | | .. index:: | + | | | | single: pcmk_host_map | + | | | | | + | | | | A mapping of host names to ports | + | | | | numbers for devices that do not | + | | | | support host names. | + | | | | | + | | | | Example: ``node1:1;node2:2,3`` tells | + | | | | the cluster to use port 1 for | + | | | | ``node1`` and ports 2 and 3 for | + | | | | ``node2``. If ``pcmk_host_check`` is | + | | | | explicitly set to ``static-list``, | + | | | | either this or ``pcmk_host_list`` must | + | | | | be set. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_host_list | string | | .. index:: | + | | | | single: pcmk_host_list | + | | | | | + | | | | A list of machines controlled by this | + | | | | device. If ``pcmk_host_check`` is | + | | | | explicitly set to ``static-list``, | + | | | | either this or ``pcmk_host_map`` must | + | | | | be set. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_host_check | string | The default is | .. index:: | + | | | ``static-list`` if | single: pcmk_host_check | + | | | either | | + | | | ``pcmk_host_list`` | How to determine which machines are | + | | | or | controlled by the device. Allowed | + | | | ``pcmk_host_map`` | values: | + | | | is configured. If | | + | | | neither of those | * ``dynamic-list:`` query the device | + | | | are configured, | via the agent's ``list`` action | + | | | the default is | * ``static-list:`` check the | + | | | ``dynamic-list`` | ``pcmk_host_list`` or | + | | | if the fence | ``pcmk_host_map`` attribute | + | | | device supports | * ``status:`` query the device via the | + | | | the list action, | "status" command | + | | | or ``status`` if | * ``none:`` assume the device can | + | | | the fence device | fence any node | + | | | supports the | | + | | | status action but | | + | | | not the list | | + | | | action. If none of | | + | | | those conditions | | + | | | apply, the default | | + | | | is ``none``. | | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_delay_max | time | 0s | .. index:: | + | | | | single: pcmk_delay_max | + | | | | | + | | | | Enable a random delay of up to the | + | | | | time specified before executing | + | | | | fencing actions. This is sometimes | + | | | | used in two-node clusters to ensure | + | | | | that the nodes don't fence each other | + | | | | at the same time. The overall delay | + | | | | introduced by pacemaker is derived | + | | | | from this random delay value adding a | + | | | | static delay so that the sum is kept | + | | | | below the maximum delay. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_delay_base | time | 0s | .. index:: | + | | | | single: pcmk_delay_base | + | | | | | + | | | | Enable a static delay before executing | + | | | | fencing actions. This can be used, for | + | | | | example, in two-node clusters to | + | | | | ensure that the nodes don't fence each | + | | | | other, by having separate fencing | + | | | | resources with different values. The | + | | | | node that is fenced with the shorter | + | | | | delay will lose a fencing race. The | + | | | | overall delay introduced by pacemaker | + | | | | is derived from this value plus a | + | | | | random delay such that the sum is kept | + | | | | below the maximum delay. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_action_limit | integer | 1 | .. index:: | + | | | | single: pcmk_action_limit | + | | | | | + | | | | The maximum number of actions that can | + | | | | be performed in parallel on this | + | | | | device, if the cluster option | + | | | | ``concurrent-fencing`` is ``true``. A | + | | | | value of -1 means unlimited. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_host_argument | string | ``port`` otherwise | .. index:: | + | | | ``plug`` if | single: pcmk_host_argument | + | | | supported | | + | | | according to the | *Advanced use only.* Which parameter | + | | | metadata of the | should be supplied to the fence agent | + | | | fence agent | to identify the node to be fenced. | + | | | | Some devices support neither the | + | | | | standard ``plug`` nor the deprecated | + | | | | ``port`` parameter, or may provide | + | | | | additional ones. Use this to specify | + | | | | an alternate, device-specific | + | | | | parameter. A value of ``none`` tells | + | | | | the cluster not to supply any | + | | | | additional parameters. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_reboot_action | string | reboot | .. index:: | + | | | | single: pcmk_reboot_action | + | | | | | + | | | | *Advanced use only.* The command to | + | | | | send to the resource agent in order to | + | | | | reboot a node. Some devices do not | + | | | | support the standard commands or may | + | | | | provide additional ones. Use this to | + | | | | specify an alternate, device-specific | + | | | | command. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_reboot_timeout | time | 60s | .. index:: | + | | | | single: pcmk_reboot_timeout | + | | | | | + | | | | *Advanced use only.* Specify an | + | | | | alternate timeout to use for | + | | | | ``reboot`` actions instead of the | + | | | | value of ``stonith-timeout``. Some | + | | | | devices need much more or less time to | + | | | | complete than normal. Use this to | + | | | | specify an alternate, device-specific | + | | | | timeout. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_reboot_retries | integer | 2 | .. index:: | + | | | | single: pcmk_reboot_retries | + | | | | | + | | | | *Advanced use only.* The maximum | + | | | | number of times to retry the | + | | | | ``reboot`` command within the timeout | + | | | | period. Some devices do not support | + | | | | multiple connections, and operations | + | | | | may fail if the device is busy with | + | | | | another task, so Pacemaker will | + | | | | automatically retry the operation, if | + | | | | there is time remaining. Use this | + | | | | option to alter the number of times | + | | | | Pacemaker retries before giving up. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_off_action | string | off | .. index:: | + | | | | single: pcmk_off_action | + | | | | | + | | | | *Advanced use only.* The command to | + | | | | send to the resource agent in order to | + | | | | shut down a node. Some devices do not | + | | | | support the standard commands or may | + | | | | provide additional ones. Use this to | + | | | | specify an alternate, device-specific | + | | | | command. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_off_timeout | time | 60s | .. index:: | + | | | | single: pcmk_off_timeout | + | | | | | + | | | | *Advanced use only.* Specify an | + | | | | alternate timeout to use for | + | | | | ``off`` actions instead of the | + | | | | value of ``stonith-timeout``. Some | + | | | | devices need much more or less time to | + | | | | complete than normal. Use this to | + | | | | specify an alternate, device-specific | + | | | | timeout. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_off_retries | integer | 2 | .. index:: | + | | | | single: pcmk_off_retries | + | | | | | + | | | | *Advanced use only.* The maximum | + | | | | number of times to retry the | + | | | | ``off`` command within the timeout | + | | | | period. Some devices do not support | + | | | | multiple connections, and operations | + | | | | may fail if the device is busy with | + | | | | another task, so Pacemaker will | + | | | | automatically retry the operation, if | + | | | | there is time remaining. Use this | + | | | | option to alter the number of times | + | | | | Pacemaker retries before giving up. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_list_action | string | list | .. index:: | + | | | | single: pcmk_list_action | + | | | | | + | | | | *Advanced use only.* The command to | + | | | | send to the resource agent in order to | + | | | | list nodes. Some devices do not | + | | | | support the standard commands or may | + | | | | provide additional ones. Use this to | + | | | | specify an alternate, device-specific | + | | | | command. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_list_timeout | time | 60s | .. index:: | + | | | | single: pcmk_list_timeout | + | | | | | + | | | | *Advanced use only.* Specify an | + | | | | alternate timeout to use for | + | | | | ``list`` actions instead of the | + | | | | value of ``stonith-timeout``. Some | + | | | | devices need much more or less time to | + | | | | complete than normal. Use this to | + | | | | specify an alternate, device-specific | + | | | | timeout. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_list_retries | integer | 2 | .. index:: | + | | | | single: pcmk_list_retries | + | | | | | + | | | | *Advanced use only.* The maximum | + | | | | number of times to retry the | + | | | | ``list`` command within the timeout | + | | | | period. Some devices do not support | + | | | | multiple connections, and operations | + | | | | may fail if the device is busy with | + | | | | another task, so Pacemaker will | + | | | | automatically retry the operation, if | + | | | | there is time remaining. Use this | + | | | | option to alter the number of times | + | | | | Pacemaker retries before giving up. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_monitor_action | string | monitor | .. index:: | + | | | | single: pcmk_monitor_action | + | | | | | + | | | | *Advanced use only.* The command to | + | | | | send to the resource agent in order to | + | | | | report extended status. Some devices do| + | | | | not support the standard commands or | + | | | | may provide additional ones. Use this | + | | | | to specify an alternate, | + | | | | device-specific command. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_monitor_timeout | time | 60s | .. index:: | + | | | | single: pcmk_monitor_timeout | + | | | | | + | | | | *Advanced use only.* Specify an | + | | | | alternate timeout to use for | + | | | | ``monitor`` actions instead of the | + | | | | value of ``stonith-timeout``. Some | + | | | | devices need much more or less time to | + | | | | complete than normal. Use this to | + | | | | specify an alternate, device-specific | + | | | | timeout. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_monitor_retries | integer | 2 | .. index:: | + | | | | single: pcmk_monitor_retries | + | | | | | + | | | | *Advanced use only.* The maximum | + | | | | number of times to retry the | + | | | | ``monitor`` command within the timeout | + | | | | period. Some devices do not support | + | | | | multiple connections, and operations | + | | | | may fail if the device is busy with | + | | | | another task, so Pacemaker will | + | | | | automatically retry the operation, if | + | | | | there is time remaining. Use this | + | | | | option to alter the number of times | + | | | | Pacemaker retries before giving up. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_status_action | string | status | .. index:: | + | | | | single: pcmk_status_action | + | | | | | + | | | | *Advanced use only.* The command to | + | | | | send to the resource agent in order to | + | | | | report status. Some devices do | + | | | | not support the standard commands or | + | | | | may provide additional ones. Use this | + | | | | to specify an alternate, | + | | | | device-specific command. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_status_timeout | time | 60s | .. index:: | + | | | | single: pcmk_status_timeout | + | | | | | + | | | | *Advanced use only.* Specify an | + | | | | alternate timeout to use for | + | | | | ``status`` actions instead of the | + | | | | value of ``stonith-timeout``. Some | + | | | | devices need much more or less time to | + | | | | complete than normal. Use this to | + | | | | specify an alternate, device-specific | + | | | | timeout. | + +----------------------+---------+--------------------+----------------------------------------+ + | pcmk_status_retries | integer | 2 | .. index:: | + | | | | single: pcmk_status_retries | + | | | | | + | | | | *Advanced use only.* The maximum | + | | | | number of times to retry the | + | | | | ``status`` command within the timeout | + | | | | period. Some devices do not support | + | | | | multiple connections, and operations | + | | | | may fail if the device is busy with | + | | | | another task, so Pacemaker will | + | | | | automatically retry the operation, if | + | | | | there is time remaining. Use this | + | | | | option to alter the number of times | + | | | | Pacemaker retries before giving up. | + +----------------------+---------+--------------------+----------------------------------------+ + +.. index:: + single: unfencing + single: fencing; unfencing + +.. _unfencing: + +Unfencing +######### + +With fabric fencing (such as cutting network or shared disk access rather than +power), it is expected that the cluster will fence the node, and then a system +administrator must manually investigate what went wrong, correct any issues +found, then reboot (or restart the cluster services on) the node. + +Once the node reboots and rejoins the cluster, some fabric fencing devices +require an explicit command to restore the node's access. This capability is +called *unfencing* and is typically implemented as the fence agent's ``on`` +command. + +If any cluster resource has ``requires`` set to ``unfencing``, then that +resource will not be probed or started on a node until that node has been +unfenced. + +Fence Devices Dependent on Other Resources +########################################## + +In some cases, a fence device may require some other cluster resource (such as +an IP address) to be active in order to function properly. + +This is obviously undesirable in general: fencing may be required when the +depended-on resource is not active, or fencing may be required because the node +running the depended-on resource is no longer responding. + +However, this may be acceptable under certain conditions: + +* The dependent fence device should not be able to target any node that is + allowed to run the depended-on resource. + +* The depended-on resource should not be disabled during production operation. + +* The ``concurrent-fencing`` cluster property should be set to ``true``. + Otherwise, if both the node running the depended-on resource and some node + targeted by the dependent fence device need to be fenced, the fencing of the + node running the depended-on resource might be ordered first, making the + second fencing impossible and blocking further recovery. With concurrent + fencing, the dependent fence device might fail at first due to the + depended-on resource being unavailable, but it will be retried and eventually + succeed once the resource is brought back up. + +Even under those conditions, there is one unlikely problem scenario. The DC +always schedules fencing of itself after any other fencing needed, to avoid +unnecessary repeated DC elections. If the dependent fence device targets the +DC, and both the DC and a different node running the depended-on resource need +to be fenced, the DC fencing will always fail and block further recovery. Note, +however, that losing a DC node entirely causes some other node to become DC and +schedule the fencing, so this is only a risk when a stop or other operation +with ``on-fail`` set to ``fencing`` fails on the DC. + +.. index:: + single: fencing; configuration + +Configuring Fencing +################### + +Higher-level tools can provide simpler interfaces to this process, but using +Pacemaker command-line tools, this is how you could configure a fence device. + +#. Find the correct driver: + + .. code-block:: none + + # stonith_admin --list-installed + + .. note:: + + You may have to install packages to make fence agents available on your + host. Searching your available packages for ``fence-`` is usually + helpful. Ensure the packages providing the fence agents you require are + installed on every cluster node. + +#. Find the required parameters associated with the device + (replacing ``$AGENT_NAME`` with the name obtained from the previous step): + + .. code-block:: none + + # stonith_admin --metadata --agent $AGENT_NAME + +#. Create a file called ``stonith.xml`` containing a primitive resource + with a class of ``stonith``, a type equal to the agent name obtained earlier, + and a parameter for each of the values returned in the previous step. + +#. If the device does not know how to fence nodes based on their uname, + you may also need to set the special ``pcmk_host_map`` parameter. See + :ref:`fencing-attributes` for details. + +#. If the device does not support the ``list`` command, you may also need + to set the special ``pcmk_host_list`` and/or ``pcmk_host_check`` + parameters. See :ref:`fencing-attributes` for details. + +#. If the device does not expect the victim to be specified with the + ``port`` parameter, you may also need to set the special + ``pcmk_host_argument`` parameter. See :ref:`fencing-attributes` for details. + +#. Upload it into the CIB using cibadmin: + + .. code-block:: none + + # cibadmin --create --scope resources --xml-file stonith.xml + +#. Set ``stonith-enabled`` to true: + + .. code-block:: none + + # crm_attribute --type crm_config --name stonith-enabled --update true + +#. Once the stonith resource is running, you can test it by executing the + following, replacing ``$NODE_NAME`` with the name of the node to fence + (although you might want to stop the cluster on that machine first): + + .. code-block:: none + + # stonith_admin --reboot $NODE_NAME + + +Example Fencing Configuration +_____________________________ + +For this example, we assume we have a cluster node, ``pcmk-1``, whose IPMI +controller is reachable at the IP address 192.0.2.1. The IPMI controller uses +the username ``testuser`` and the password ``abc123``. + +#. Looking at what's installed, we may see a variety of available agents: + + .. code-block:: none + + # stonith_admin --list-installed + + .. code-block:: none + + (... some output omitted ...) + fence_idrac + fence_ilo3 + fence_ilo4 + fence_ilo5 + fence_imm + fence_ipmilan + (... some output omitted ...) + + Perhaps after some reading some man pages and doing some Internet searches, + we might decide ``fence_ipmilan`` is our best choice. + +#. Next, we would check what parameters ``fence_ipmilan`` provides: + + .. code-block:: none + + # stonith_admin --metadata -a fence_ipmilan + + .. code-block:: xml + + + + + + + + fence_ipmilan is an I/O Fencing agentwhich can be used with machines controlled by IPMI.This agent calls support software ipmitool (http://ipmitool.sf.net/). WARNING! This fence agent might report success before the node is powered off. You should use -m/method onoff if your fence device works correctly with that option. + + + + + + Fencing action + + + + + + IPMI Lan Auth type. + + + + + Ciphersuite to use (same as ipmitool -C parameter) + + + + + Hexadecimal-encoded Kg key for IPMIv2 authentication + + + + + IP address or hostname of fencing device + + + + + IP address or hostname of fencing device + + + + + TCP/UDP port to use for connection with device + + + + + Use Lanplus to improve security of connection + + + + + Login name + + + + + + Method to fence + + + + + Login password or passphrase + + + + + Script to run to retrieve password + + + + + Login password or passphrase + + + + + Script to run to retrieve password + + + + + IP address or hostname of fencing device (together with --port-as-ip) + + + + + IP address or hostname of fencing device (together with --port-as-ip) + + + + + + Privilege level on IPMI device + + + + + Bridge IPMI requests to the remote target address + + + + + Login name + + + + + Disable logging to stderr. Does not affect --verbose or --debug-file or logging to syslog. + + + + + Verbose mode + + + + + Write debug information to given file + + + + + Write debug information to given file + + + + + Display version information and exit + + + + + Display help and exit + + + + + Wait X seconds before fencing is started + + + + + Path to ipmitool binary + + + + + Wait X seconds for cmd prompt after login + + + + + Make "port/plug" to be an alias to IP address + + + + + Test X seconds for status change after ON/OFF + + + + + Wait X seconds after issuing ON/OFF + + + + + Wait X seconds for cmd prompt after issuing command + + + + + Count of attempts to retry power on + + + + + Use sudo (without password) when calling 3rd party software + + + + + Use sudo (without password) when calling 3rd party software + + + + + Path to sudo binary + + + + + + + + + + + + + + + + + + Once we've decided what parameter values we think we need, it is a good idea + to run the fence agent's status action manually, to verify that our values + work correctly: + + .. code-block:: none + + # fence_ipmilan --lanplus -a 192.0.2.1 -l testuser -p abc123 -o status + + Chassis Power is on + +#. Based on that, we might create a fencing resource configuration like this in + ``stonith.xml`` (or any file name, just use the same name with ``cibadmin`` + later): + + .. code-block:: xml + + + + + + + + + + + + + + .. note:: + + Even though the man page shows that the ``action`` parameter is + supported, we do not provide that in the resource configuration. + Pacemaker will supply an appropriate action whenever the fence device + must be used. + +#. In this case, we don't need to configure ``pcmk_host_map`` because + ``fence_ipmilan`` ignores the target node name and instead uses its + ``ip`` parameter to know how to contact the IPMI controller. + +#. We do need to let Pacemaker know which cluster node can be fenced by this + device, since ``fence_ipmilan`` doesn't support the ``list`` action. Add + a line like this to the agent's instance attributes: + + .. code-block:: xml + + + +#. We don't need to configure ``pcmk_host_argument`` since ``ip`` is all the + fence agent needs (it ignores the target name). + +#. Make the configuration active: + + .. code-block:: none + + # cibadmin --create --scope resources --xml-file stonith.xml + +#. Set ``stonith-enabled`` to true (this only has to be done once): + + .. code-block:: none + + # crm_attribute --type crm_config --name stonith-enabled --update true + +#. Since our cluster is still in testing, we can reboot ``pcmk-1`` without + bothering anyone, so we'll test our fencing configuration by running this + from one of the other cluster nodes: + + .. code-block:: none + + # stonith_admin --reboot pcmk-1 + + Then we will verify that the node did, in fact, reboot. + +We can repeat that process to create a separate fencing resource for each node. + +With some other fence device types, a single fencing resource is able to be +used for all nodes. In fact, we could do that with ``fence_ipmilan``, using the +``port-as-ip`` parameter along with ``pcmk_host_map``. Either approach is +fine. + +.. index:: + single: fencing; topology + single: fencing-topology + single: fencing-level + +Fencing Topologies +################## + +Pacemaker supports fencing nodes with multiple devices through a feature called +*fencing topologies*. Fencing topologies may be used to provide alternative +devices in case one fails, or to require multiple devices to all be executed +successfully in order to consider the node successfully fenced, or even a +combination of the two. + +Create the individual devices as you normally would, then define one or more +``fencing-level`` entries in the ``fencing-topology`` section of the +configuration. + +* Each fencing level is attempted in order of ascending ``index``. Allowed + values are 1 through 9. +* If a device fails, processing terminates for the current level. No further + devices in that level are exercised, and the next level is attempted instead. +* If the operation succeeds for all the listed devices in a level, the level is + deemed to have passed. +* The operation is finished when a level has passed (success), or all levels + have been attempted (failed). +* If the operation failed, the next step is determined by the scheduler and/or + the controller. + +Some possible uses of topologies include: + +* Try on-board IPMI, then an intelligent power switch if that fails +* Try fabric fencing of both disk and network, then fall back to power fencing + if either fails +* Wait up to a certain time for a kernel dump to complete, then cut power to + the node + +.. table:: **Attributes of a fencing-level Element** + + +------------------+-----------------------------------------------------------------------------------------+ + | Attribute | Description | + +==================+=========================================================================================+ + | id | .. index:: | + | | pair: fencing-level; id | + | | | + | | A unique name for this element (required) | + +------------------+-----------------------------------------------------------------------------------------+ + | target | .. index:: | + | | pair: fencing-level; target | + | | | + | | The name of a single node to which this level applies | + +------------------+-----------------------------------------------------------------------------------------+ + | target-pattern | .. index:: | + | | pair: fencing-level; target-pattern | + | | | + | | An extended regular expression (as defined in `POSIX | + | | `_) | + | | matching the names of nodes to which this level applies | + +------------------+-----------------------------------------------------------------------------------------+ + | target-attribute | .. index:: | + | | pair: fencing-level; target-attribute | + | | | + | | The name of a node attribute that is set (to ``target-value``) for nodes to which this | + | | level applies | + +------------------+-----------------------------------------------------------------------------------------+ + | target-value | .. index:: | + | | pair: fencing-level; target-value | + | | | + | | The node attribute value (of ``target-attribute``) that is set for nodes to which this | + | | level applies | + +------------------+-----------------------------------------------------------------------------------------+ + | index | .. index:: | + | | pair: fencing-level; index | + | | | + | | The order in which to attempt the levels. Levels are attempted in ascending order | + | | *until one succeeds*. Valid values are 1 through 9. | + +------------------+-----------------------------------------------------------------------------------------+ + | devices | .. index:: | + | | pair: fencing-level; devices | + | | | + | | A comma-separated list of devices that must all be tried for this level | + +------------------+-----------------------------------------------------------------------------------------+ + +.. note:: **Fencing topology with different devices for different nodes** + + .. code-block:: xml + + + + ... + + + + + + + + + ... - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ... - - - ---- - - == Remapping Reboots == - - When the cluster needs to reboot a node, whether because +stonith-action+ is +reboot+ or because - a reboot was manually requested (such as by `stonith_admin --reboot`), it will remap that to - other commands in two cases: - - . If the chosen fencing device does not support the +reboot+ command, the cluster - will ask it to perform +off+ instead. - - . If a fencing topology level with multiple devices must be executed, the cluster - will ask all the devices to perform +off+, then ask the devices to perform +on+. - - To understand the second case, consider the example of a node with redundant - power supplies connected to intelligent power switches. Rebooting one switch - and then the other would have no effect on the node. Turning both switches off, - and then on, actually reboots the node. - - In such a case, the fencing operation will be treated as successful as long as - the +off+ commands succeed, because then it is safe for the cluster to recover - any resources that were on the node. Timeouts and errors in the +on+ phase will - be logged but ignored. - - When a reboot operation is remapped, any action-specific timeout for the - remapped action will be used (for example, +pcmk_off_timeout+ will be used when - executing the +off+ command, not +pcmk_reboot_timeout+). + + + + +Example Dual-Layer, Dual-Device Fencing Topologies +__________________________________________________ + +The following example illustrates an advanced use of ``fencing-topology`` in a +cluster with the following properties: + +* 2 nodes (prod-mysql1 and prod-mysql2) +* the nodes have IPMI controllers reachable at 192.0.2.1 and 192.0.2.2 +* the nodes each have two independent Power Supply Units (PSUs) connected to + two independent Power Distribution Units (PDUs) reachable at 198.51.100.1 + (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) +* fencing via the IPMI controller uses the ``fence_ipmilan`` agent (1 fence device + per controller, with each device targeting a separate node) +* fencing via the PDUs uses the ``fence_apc_snmp`` agent (1 fence device per + PDU, with both devices targeting both nodes) +* a random delay is used to lessen the chance of a "death match" +* fencing topology is set to try IPMI fencing first then dual PDU fencing if + that fails + +In a node failure scenario, Pacemaker will first select ``fence_ipmilan`` to +try to kill the faulty node. Using the fencing topology, if that method fails, +it will then move on to selecting ``fence_apc_snmp`` twice (once for the first +PDU, then again for the second PDU). + +The fence action is considered successful only if both PDUs report the required +status. If any of them fails, fencing loops back to the first fencing method, +``fence_ipmilan``, and so on, until the node is fenced or the fencing action is +cancelled. + +.. note:: **First fencing method: single IPMI device per target** + + Each cluster node has it own dedicated IPMI controller that can be contacted + for fencing using the following primitives: + + .. code-block:: xml + + + + + + + + + + + + + + + + + + + + + + +.. note:: **Second fencing method: dual PDU devices** + + Each cluster node also has 2 distinct power supplies controlled by 2 + distinct PDUs: + + * Node 1: PDU 1 port 10 and PDU 2 port 10 + * Node 2: PDU 1 port 11 and PDU 2 port 11 + + The matching fencing agents are configured as follows: + + .. code-block:: xml + + + + + + + + + + + + + + + + + + + + +.. note:: **Fencing topology** + + Now that all the fencing resources are defined, it's time to create the + right topology. We want to first fence using IPMI and if that does not work, + fence both PDUs to effectively and surely kill the node. + + .. code-block:: xml + + + + + + + + + In ``fencing-topology``, the lowest ``index`` value for a target determines + its first fencing method. + +Remapping Reboots +################# + +When the cluster needs to reboot a node, whether because ``stonith-action`` is +``reboot`` or because a reboot was requested externally (such as by +``stonith_admin --reboot``), it will remap that to other commands in two cases: + +* If the chosen fencing device does not support the ``reboot`` command, the + cluster will ask it to perform ``off`` instead. + +* If a fencing topology level with multiple devices must be executed, the + cluster will ask all the devices to perform ``off``, then ask the devices to + perform ``on``. + +To understand the second case, consider the example of a node with redundant +power supplies connected to intelligent power switches. Rebooting one switch +and then the other would have no effect on the node. Turning both switches off, +and then on, actually reboots the node. + +In such a case, the fencing operation will be treated as successful as long as +the ``off`` commands succeed, because then it is safe for the cluster to +recover any resources that were on the node. Timeouts and errors in the ``on`` +phase will be logged but ignored. + +When a reboot operation is remapped, any action-specific timeout for the +remapped action will be used (for example, ``pcmk_off_timeout`` will be used +when executing the ``off`` command, not ``pcmk_reboot_timeout``). diff --git a/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst index 5a7554c4b9..133e79096f 100644 --- a/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst +++ b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst @@ -1,342 +1,345 @@ Multi-Site Clusters and Tickets ------------------------------- .. Convert_to_RST: Apart from local clusters, Pacemaker also supports multi-site clusters. That means you can have multiple, geographically dispersed sites, each with a local cluster. Failover between these clusters can be coordinated manually by the administrator, or automatically by a higher-level entity called a 'Cluster Ticket Registry (CTR)'. == Challenges for Multi-Site Clusters == Typically, multi-site environments are too far apart to support synchronous communication and data replication between the sites. That leads to significant challenges: - How do we make sure that a cluster site is up and running? - How do we make sure that resources are only started once? - How do we make sure that quorum can be reached between the different sites and a split-brain scenario avoided? - How do we manage failover between sites? - How do we deal with high latency in case of resources that need to be stopped? In the following sections, learn how to meet these challenges. == Conceptual Overview == Multi-site clusters can be considered as “overlay” clusters where each cluster site corresponds to a cluster node in a traditional cluster. The overlay cluster can be managed by a CTR in order to guarantee that any cluster resource will be active on no more than one cluster site. This is achieved by using 'tickets' that are treated as failover domain between cluster sites, in case a site should be down. The following sections explain the individual components and mechanisms that were introduced for multi-site clusters in more detail. === Ticket === Tickets are, essentially, cluster-wide attributes. A ticket grants the right to run certain resources on a specific cluster site. Resources can be bound to a certain ticket by +rsc_ticket+ constraints. Only if the ticket is available at a site can the respective resources be started there. Vice versa, if the ticket is revoked, the resources depending on that ticket must be stopped. The ticket thus is similar to a 'site quorum', i.e. the permission to manage/own resources associated with that site. (One can also think of the current +have-quorum+ flag as a special, cluster-wide ticket that is granted in case of node majority.) Tickets can be granted and revoked either manually by administrators (which could be the default for classic enterprise clusters), or via the automated CTR mechanism described below. A ticket can only be owned by one site at a time. Initially, none of the sites has a ticket. Each ticket must be granted once by the cluster administrator. The presence or absence of tickets for a site is stored in the CIB as a cluster status. With regards to a certain ticket, there are only two states for a site: +true+ (the site has the ticket) or +false+ (the site does not have the ticket). The absence of a certain ticket (during the initial state of the multi-site cluster) is the same as the value +false+. === Dead Man Dependency === A site can only activate resources safely if it can be sure that the other site has deactivated them. However after a ticket is revoked, it can take a long time until all resources depending on that ticket are stopped "cleanly", especially in case of cascaded resources. To cut that process short, the concept of a 'Dead Man Dependency' was introduced. If a dead man dependency is in force, if a ticket is revoked from a site, the nodes that are hosting dependent resources are fenced. This considerably speeds up the recovery process of the cluster and makes sure that resources can be migrated more quickly. This can be configured by specifying a +loss-policy="fence"+ in +rsc_ticket+ constraints. === Cluster Ticket Registry === A CTR is a coordinated group of network daemons that automatically handles granting, revoking, and timing out tickets (instead of the administrator revoking the ticket somewhere, waiting for everything to stop, and then granting it on the desired site). Pacemaker does not implement its own CTR, but interoperates with external software designed for that purpose (similar to how resource and fencing agents are not directly part of pacemaker). Participating clusters run the CTR daemons, which connect to each other, exchange information about their connectivity, and vote on which sites gets which tickets. A ticket is granted to a site only once the CTR is sure that the ticket has been relinquished by the previous owner, implemented via a timer in most scenarios. If a site loses connection to its peers, its tickets time out and recovery occurs. After the connection timeout plus the recovery timeout has passed, the other sites are allowed to re-acquire the ticket and start the resources again. This can also be thought of as a "quorum server", except that it is not a single quorum ticket, but several. === Configuration Replication === As usual, the CIB is synchronized within each cluster, but it is 'not' synchronized across cluster sites of a multi-site cluster. You have to configure the resources that will be highly available across the multi-site cluster for every site accordingly. - - - [[s-ticket-constraints]] - == Configuring Ticket Dependencies == + +.. _ticket-constraints: + +Configuring Ticket Dependencies +############################### + +.. Convert_to_RST_2: The `rsc_ticket` constraint lets you specify the resources depending on a certain ticket. Together with the constraint, you can set a `loss-policy` that defines what should happen to the respective resources if the ticket is revoked. The attribute `loss-policy` can have the following values: * +fence:+ Fence the nodes that are running the relevant resources. * +stop:+ Stop the relevant resources. * +freeze:+ Do nothing to the relevant resources. * +demote:+ Demote relevant resources that are running in master mode to slave mode. .Constraint that fences node if +ticketA+ is revoked ==== [source,XML] ------- ------- ==== The example above creates a constraint with the ID +rsc1-req-ticketA+. It defines that the resource +rsc1+ depends on +ticketA+ and that the node running the resource should be fenced if +ticketA+ is revoked. If resource +rsc1+ were a promotable resource (i.e. it could run in master or slave mode), you might want to configure that only master mode depends on +ticketA+. With the following configuration, +rsc1+ will be demoted to slave mode if +ticketA+ is revoked: .Constraint that demotes +rsc1+ if +ticketA+ is revoked ==== [source,XML] ------- ------- ==== You can create multiple `rsc_ticket` constraints to let multiple resources depend on the same ticket. However, `rsc_ticket` also supports resource sets (see <>), so one can easily list all the resources in one `rsc_ticket` constraint instead. .Ticket constraint for multiple resources ==== [source,XML] ------- ------- ==== In the example above, there are two resource sets, so we can list resources with different roles in a single +rsc_ticket+ constraint. There's no dependency between the two resource sets, and there's no dependency among the resources within a resource set. Each of the resources just depends on +ticketA+. Referencing resource templates in +rsc_ticket+ constraints, and even referencing them within resource sets, is also supported. If you want other resources to depend on further tickets, create as many constraints as necessary with +rsc_ticket+. == Managing Multi-Site Clusters == === Granting and Revoking Tickets Manually === You can grant tickets to sites or revoke them from sites manually. If you want to re-distribute a ticket, you should wait for the dependent resources to stop cleanly at the previous site before you grant the ticket to the new site. Use the `crm_ticket` command line tool to grant and revoke tickets. //// These commands will actually just print a message telling the user that they require '--force'. That is probably a good exercise rather than letting novice users cut and paste '--force' here. //// To grant a ticket to this site: ------- # crm_ticket --ticket ticketA --grant ------- To revoke a ticket from this site: ------- # crm_ticket --ticket ticketA --revoke ------- [IMPORTANT] ==== If you are managing tickets manually, use the `crm_ticket` command with great care, because it cannot check whether the same ticket is already granted elsewhere. ==== === Granting and Revoking Tickets via a Cluster Ticket Registry === We will use https://github.com/ClusterLabs/booth[Booth] here as an example of software that can be used with pacemaker as a Cluster Ticket Registry. Booth implements the http://en.wikipedia.org/wiki/Raft_%28computer_science%29[Raft] algorithm to guarantee the distributed consensus among different cluster sites, and manages the ticket distribution (and thus the failover process between sites). Each of the participating clusters and 'arbitrators' runs the Booth daemon `boothd`. An 'arbitrator' is the multi-site equivalent of a quorum-only node in a local cluster. If you have a setup with an even number of sites, you need an additional instance to reach consensus about decisions such as failover of resources across sites. In this case, add one or more arbitrators running at additional sites. Arbitrators are single machines that run a booth instance in a special mode. An arbitrator is especially important for a two-site scenario, otherwise there is no way for one site to distinguish between a network failure between it and the other site, and a failure of the other site. The most common multi-site scenario is probably a multi-site cluster with two sites and a single arbitrator on a third site. However, technically, there are no limitations with regards to the number of sites and the number of arbitrators involved. `Boothd` at each site connects to its peers running at the other sites and exchanges connectivity details. Once a ticket is granted to a site, the booth mechanism will manage the ticket automatically: If the site which holds the ticket is out of service, the booth daemons will vote which of the other sites will get the ticket. To protect against brief connection failures, sites that lose the vote (either explicitly or implicitly by being disconnected from the voting body) need to relinquish the ticket after a time-out. Thus, it is made sure that a ticket will only be re-distributed after it has been relinquished by the previous site. The resources that depend on that ticket will fail over to the new site holding the ticket. The nodes that have run the resources before will be treated according to the `loss-policy` you set within the `rsc_ticket` constraint. Before the booth can manage a certain ticket within the multi-site cluster, you initially need to grant it to a site manually via the `booth` command-line tool. After you have initially granted a ticket to a site, `boothd` will take over and manage the ticket automatically. [IMPORTANT] ==== The `booth` command-line tool can be used to grant, list, or revoke tickets and can be run on any machine where `boothd` is running. If you are managing tickets via Booth, use only `booth` for manual intervention, not `crm_ticket`. That ensures the same ticket will only be owned by one cluster site at a time. ==== ==== Booth Requirements ==== * All clusters that will be part of the multi-site cluster must be based on Pacemaker. * Booth must be installed on all cluster nodes and on all arbitrators that will be part of the multi-site cluster. * Nodes belonging to the same cluster site should be synchronized via NTP. However, time synchronization is not required between the individual cluster sites. === General Management of Tickets === Display the information of tickets: ------- # crm_ticket --info ------- Or you can monitor them with: ------- # crm_mon --tickets ------- Display the +rsc_ticket+ constraints that apply to a ticket: ------- # crm_ticket --ticket ticketA --constraints ------- When you want to do maintenance or manual switch-over of a ticket, revoking the ticket would trigger the loss policies. If +loss-policy="fence"+, the dependent resources could not be gracefully stopped/demoted, and other unrelated resources could even be affected. The proper way is making the ticket 'standby' first with: ------- # crm_ticket --ticket ticketA --standby ------- Then the dependent resources will be stopped or demoted gracefully without triggering the loss policies. If you have finished the maintenance and want to activate the ticket again, you can run: ------- # crm_ticket --ticket ticketA --activate ------- == For more information == * https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html[SUSE's Geo Clustering quick start] * https://github.com/ClusterLabs/booth[Booth]