diff --git a/doc/Pacemaker_Explained/en-US/Ch-Fencing.txt b/doc/Pacemaker_Explained/en-US/Ch-Fencing.txt index 64202b3c85..53d8793024 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Fencing.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Fencing.txt @@ -1,946 +1,1021 @@ :compat-mode: legacy -= STONITH = += Fencing = //// -We prefer [[ch-stonith]], but older versions of asciidoc don't deal well +We prefer [[ch-fencing]], but older versions of asciidoc don't deal well with that construct for chapter headings //// -anchor:ch-stonith[Chapter 6, STONITH] +anchor:ch-fencing[Chapter 6, Fencing] +indexterm:[Fencing, Configuration] indexterm:[STONITH, Configuration] -== What Is STONITH? == +== What Is Fencing? == -STONITH (an acronym for "Shoot The Other Node In The Head"), also called -'fencing', protects your data from being corrupted by rogue nodes or concurrent -access. +'Fencing' is the ability to make a node unable to run resources, even when that +node is unresponsive to cluster commands. -Just because a node is unresponsive, this doesn't mean it isn't -accessing your data. The only way to be 100% sure that your data is -safe, is to use STONITH so we can be certain that the node is truly -offline, before allowing the data to be accessed from another node. +Fencing is also known as 'STONITH', an acronym for "Shoot The Other Node In The +Head", since the most common fencing method is cutting power to the node. +Another method is "fabric fencing", cutting the node's access to some +capability required to run resources (such as network access or a shared disk). -STONITH also has a role to play in the event that a clustered service -cannot be stopped. In this case, the cluster uses STONITH to force the -whole node offline, thereby making it safe to start the service -elsewhere. +== Why Is Fencing Necessary? == -== What STONITH Device Should You Use? == +Fencing protects your data from being corrupted by malfunctioning nodes or +unintentional concurrent access to shared resources. -It is crucial that the STONITH device can allow the cluster to -differentiate between a node failure and a network one. +Fencing protects against the "split brain" failure scenario, where cluster +nodes have lost the ability to reliably communicate with each other but are +still able to run resources. If the cluster just assumed that uncommunicative +nodes were down, then multiple instances of a resource could be started on +different nodes. -The biggest mistake people make in choosing a STONITH device is to -use a remote power switch (such as many on-board IPMI controllers) that -shares power with the node it controls. In such cases, the cluster -cannot be sure if the node is really offline, or active and suffering -from a network fault. +The effect of split brain depends on the resource type. For example, an IP +address brought up on two hosts on a network will cause packets to randomly be +sent to one or the other host, rendering the IP useless. For a database or +clustered file system, the effect could be much more severe, causing data +corruption or divergence. -Likewise, any device that relies on the machine being active (such as -SSH-based "devices" used during testing) are inappropriate. +Fencing also is used when a resource cannot otherwise be stopped. If a failed +resource fails to stop, it cannot be recovered elsewhere. Fencing the +resource's node is the only way to ensure the resource is recoverable. -== Special Treatment of STONITH Resources == +Users may also configure the +on-fail+ property of any resource operation to ++fencing+, in which case the cluster will fence the resource's node if the +operation fails. -STONITH resources are somewhat special in Pacemaker. +== Fence Devices == -STONITH may be initiated by pacemaker or by other parts of the cluster -(such as resources like DRBD or DLM). To accommodate this, pacemaker -does not require the STONITH resource to be in the 'started' state -in order to be used, thus allowing reliable use of STONITH devices in such a -case. +A 'fence device' (or 'fencing device') is a special type of resource that +provides the means to fence a node. -All nodes have access to STONITH devices' definitions and instantiate them -on-the-fly when needed, but preference is given to 'verified' instances, which -are the ones that are 'started' according to the cluster's knowledge. +Examples of fencing devices include intelligent power switches and IPMI devices +that accept SNMP commands to cut power to a node, and iSCSI controllers that +allow SCSI reservations to be used to cut a node's access to a shared disk. -In the case of a cluster split, the partition with a verified instance -will have a slight advantage, because the STONITH daemon in the other partition -will have to hear from all its current peers before choosing a node to -perform the fencing. +Since fencing devices will be used to recover from loss of networking +connectivity to other nodes, it is essential that they do not rely on the same +network as the cluster itself, otherwise that network becomes a single point of +failure. -Fencing resources do work the same as regular resources in some respects: +Since loss of a node due to power outage is indistinguishable from loss of +network connectivity to that node, it is also essential that at least one fence +device for a node does not share power with that node. For example, an on-board +IPMI controller that shares power with its host should not be used as the sole +fencing device for that host. -* +target-role+ can be used to enable or disable the resource -* Location constraints can be used to prevent a specific node from using the resource +Since fencing is used to isolate malfunctioning nodes, no fence device should +rely on its target functioning properly. This includes, for example, devices +that ssh into a node and issue a shutdown command (such devices might be +suitable for testing, but never for production). -[IMPORTANT] -=========== -Currently there is a limitation that fencing resources may only have -one set of meta-attributes and one set of instance attributes. This -can be revisited if it becomes a significant limitation for people. -=========== +== Fence Agents == -See the table below or run `man pacemaker-fenced` to see special instance attributes -that may be set for any fencing resource, regardless of fence agent. +A 'fence agent' (or 'fencing agent') is a +stonith+-class resource agent. + +The fence agent standard provides commands (such as +off+ and +reboot+) that +the cluster can use to fence nodes. As with other resource agent classes, +this allows a layer of abstraction so that Pacemaker doesn't need any knowledge +about specific fencing technologies -- that knowledge is isolated in the agent. + +== When a Fence Device Can Be Used == + +Fencing devices do not actually "run" like most services. Typically, they just +provide an interface for sending commands to an external device. + +Additionally, fencing may be initiated by Pacemaker, by other cluster-aware software +such as DRBD or DLM, or manually by an administrator, at any point in the +cluster life cycle, including before any resources have been started. + +To accommodate this, Pacemaker does not require the fence device resource to be +"started" in order to be used. Whether a fence device is started or not +determines whether a node runs any recurring monitor for the device, and gives +the node a slight preference for being chosen to execute fencing using that +device. + +By default, any node can execute any fencing device. If a fence device is +disabled by setting its +target-role+ to Stopped, then no node can use that +device. If mandatory location constraints prevent a specific node from +"running" a fence device, then that node will never be chosen to execute +fencing using the device. A node may fence itself, but the cluster will choose +that only if no other nodes can do the fencing. + +A common configuration scenario is to have one fence device per target node. +In such a case, users often configure anti-location constraints so that +the target node does not monitor its own device. The best practice is to make +the constraint optional (i.e. a finite negative score rather than +-INFINITY+), +so that the node can fence itself if no other nodes can. + +== Limitations of Fencing Resources == + +Fencing resources have certain limitations that other resource classes don't: + +* They may have only one set of meta-attributes and one set of instance + attributes. +* If <> are used to determine fencing resource options, these + may only be evaluated when first read, meaning that later changes to the + rules will have no effect. Therefore, it is better to avoid confusion and not + use rules at all with fencing resources. + +These limitations could be revisited if there is significant user demand. + +== Special Options for Fencing Resources == + +The table below lists special instance attributes that may be set for any +fencing resource ('not' meta-attributes, even though they are interpreted by +pacemaker rather than the fence agent). These are also listed in the man page +for +pacemaker-fenced+. .Additional Properties of Fencing Resources [width="95%",cols="8m,3,6,<12",options="header",align="center"] |========================================================= |Field |Type |Default |Description |stonith-timeout |NA |NA a|Older versions used this to override the default period to wait for a STONITH (reboot, on, off) action to complete for this device. It has been replaced by the +pcmk_reboot_timeout+ and +pcmk_off_timeout+ properties. indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] //// (not yet implemented) priority integer 0 The priority of the STONITH resource. Devices are tried in order of highest priority to lowest. indexterm priority,Fencing indexterm Fencing,Property,priority //// |provides |string | |Any special capability provided by the fence device. Currently, only one such capability is meaningful: +unfencing+ (see <>). indexterm:[provides,Fencing] indexterm:[Fencing,Property,provides] |pcmk_host_map |string | |A mapping of host names to ports numbers for devices that do not support host names. Example: +node1:1;node2:2,3+ tells the cluster to use port 1 for *node1* and ports 2 and 3 for *node2*. If +pcmk_host_check+ is explicitly set to +static-list+, either this or +pcmk_host_list+ must be set. indexterm:[pcmk_host_map,Fencing] indexterm:[Fencing,Property,pcmk_host_map] |pcmk_host_list |string | |A list of machines controlled by this device. If +pcmk_host_check+ is explicitly set to +static-list+, either this or +pcmk_host_map+ must be set. indexterm:[pcmk_host_list,Fencing] indexterm:[Fencing,Property,pcmk_host_list] |pcmk_host_check |string |static-list if either +pcmk_host_list+ or +pcmk_host_map+ is set, otherwise dynamic-list if the fence device supports the list action, otherwise status if the fence device supports the status action, otherwise none a|How to determine which machines are controlled by the device. Allowed values: * +dynamic-list:+ query the device via the "list" command * +static-list:+ check the +pcmk_host_list+ or +pcmk_host_map+ attribute * +status:+ query the device via the "status" command * +none:+ assume every device can fence every machine indexterm:[pcmk_host_check,Fencing] indexterm:[Fencing,Property,pcmk_host_check] |pcmk_delay_max |time |0s -|Enable a random delay of up to the time specified before executing stonith +|Enable a random delay of up to the time specified before executing fencing actions. This is sometimes used in two-node clusters to ensure that the nodes don't fence each other at the same time. The overall delay introduced by pacemaker is derived from this random delay value adding a static delay so that the sum is kept below the maximum delay. indexterm:[pcmk_delay_max,Fencing] indexterm:[Fencing,Property,pcmk_delay_max] |pcmk_delay_base |time |0s -|Enable a static delay before executing stonith actions. This can be used +|Enable a static delay before executing fencing actions. This can be used e.g. in two-node clusters to ensure that the nodes don't fence each other, by having separate fencing resources with different values. The node that is fenced with the shorter delay will lose a fencing race. The overall delay introduced by pacemaker is derived from this value plus a random delay such that the sum is kept below the maximum delay. indexterm:[pcmk_delay_base,Fencing] indexterm:[Fencing,Property,pcmk_delay_base] |pcmk_action_limit |integer |1 |The maximum number of actions that can be performed in parallel on this device, if the cluster option +concurrent-fencing+ is +true+. -1 is unlimited. indexterm:[pcmk_action_limit,Fencing] indexterm:[Fencing,Property,pcmk_action_limit] |pcmk_host_argument |string |port |'Advanced use only.' Which parameter should be supplied to the resource agent to identify the node to be fenced. Some devices do not support the standard +port+ parameter or may provide additional ones. Use this to specify an alternate, device-specific parameter. A value of +none+ tells the cluster not to supply any additional parameters. indexterm:[pcmk_host_argument,Fencing] indexterm:[Fencing,Property,pcmk_host_argument] |pcmk_reboot_action |string |reboot |'Advanced use only.' The command to send to the resource agent in order to reboot a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_reboot_action,Fencing] indexterm:[Fencing,Property,pcmk_reboot_action] |pcmk_reboot_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `reboot` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_reboot_timeout,Fencing] indexterm:[Fencing,Property,pcmk_reboot_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_reboot_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `reboot` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_reboot_retries,Fencing] indexterm:[Fencing,Property,pcmk_reboot_retries] |pcmk_off_action |string |off |'Advanced use only.' The command to send to the resource agent in order to shut down a node. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_off_action,Fencing] indexterm:[Fencing,Property,pcmk_off_action] |pcmk_off_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `off` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_off_timeout,Fencing] indexterm:[Fencing,Property,pcmk_off_timeout] indexterm:[stonith-timeout,Fencing] indexterm:[Fencing,Property,stonith-timeout] |pcmk_off_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `off` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_off_retries,Fencing] indexterm:[Fencing,Property,pcmk_off_retries] |pcmk_list_action |string |list |'Advanced use only.' The command to send to the resource agent in order to list nodes. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_list_action,Fencing] indexterm:[Fencing,Property,pcmk_list_action] |pcmk_list_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `list` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_list_timeout,Fencing] indexterm:[Fencing,Property,pcmk_list_timeout] |pcmk_list_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `list` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_list_retries,Fencing] indexterm:[Fencing,Property,pcmk_list_retries] |pcmk_monitor_action |string |monitor |'Advanced use only.' The command to send to the resource agent in order to report extended status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_monitor_action,Fencing] indexterm:[Fencing,Property,pcmk_monitor_action] |pcmk_monitor_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `monitor` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_monitor_timeout,Fencing] indexterm:[Fencing,Property,pcmk_monitor_timeout] |pcmk_monitor_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `monitor` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_monitor_retries,Fencing] indexterm:[Fencing,Property,pcmk_monitor_retries] |pcmk_status_action |string |status |'Advanced use only.' The command to send to the resource agent in order to report status. Some devices do not support the standard commands or may provide additional ones. Use this to specify an alternate, device-specific command. indexterm:[pcmk_status_action,Fencing] indexterm:[Fencing,Property,pcmk_status_action] |pcmk_status_timeout |time |60s |'Advanced use only.' Specify an alternate timeout to use for `status` actions instead of the value of +stonith-timeout+. Some devices need much more or less time to complete than normal. Use this to specify an alternate, device-specific timeout. indexterm:[pcmk_status_timeout,Fencing] indexterm:[Fencing,Property,pcmk_status_timeout] |pcmk_status_retries |integer |2 |'Advanced use only.' The maximum number of times to retry the `status` command within the timeout period. Some devices do not support multiple connections, and operations may fail if the device is busy with another task, so Pacemaker will automatically retry the operation, if there is time remaining. Use this option to alter the number of times Pacemaker retries before giving up. indexterm:[pcmk_status_retries,Fencing] indexterm:[Fencing,Property,pcmk_status_retries] |========================================================= [[s-unfencing]] == Unfencing == -Most fence devices cut the power to the target. By contrast, fence devices that -perform 'fabric fencing' cut off a node's access to some critical resource, -such as a shared disk or a network switch. - -With fabric fencing, it is expected that the cluster will fence the node, and +With fabric fencing (such as cutting network or shared disk access rather than +power), it is expected that the cluster will fence the node, and then a system administrator must manually investigate what went wrong, correct any issues found, then reboot (or restart the cluster services on) the node. Once the node reboots and rejoins the cluster, some fabric fencing devices -require that an explicit command to restore the node's access to the critical -resource. This capability is called 'unfencing' and is typically implemented -as the fence agent's +on+ command. +require an explicit command to restore the node's access. This capability is +called 'unfencing' and is typically implemented as the fence agent's +on+ +command. If any cluster resource has +requires+ set to +unfencing+, then that resource will not be probed or started on a node until that node has been unfenced. -== Configuring STONITH == +== Fence Devices Dependent on Other Resources == + +In some cases, a fence device may require some other cluster resource (such as +an IP address) to be active in order to function properly. + +This is obviously undesirable in general: fencing may be required when the +depended-on resource is not active, or fencing may be required because the node +running the depended-on resource is no longer responding. + +However, this may be acceptable under certain conditions: -[NOTE] -=========== -Higher-level configuration shells include functionality to simplify the -process below, particularly the step for deciding which parameters are -required. However since this document deals only with core -components, you should refer to the STONITH chapter of the -http://www.clusterlabs.org/doc/[Clusters from Scratch] guide for those details. -=========== +* The dependent fence device should not be able to target any node that is + allowed to run the depended-on resource. + +* The depended-on resource should not be disabled during production operation. + +* The +concurrent-fencing+ cluster property should be set to +true+. Otherwise, + if both the node running the depended-on resource and some node targeted by + the dependent fence device need to be fenced, the fencing of the node + running the depended-on resource might be ordered first, making the second + fencing impossible and blocking further recovery. With concurrent fencing, + the dependent fence device might fail at first due to the depended-on + resource being unavailable, but it will be retried and eventually succeed + once the resource is brought back up. + +Even under those conditions, there is one unlikely problem scenario. The DC +always schedules fencing of itself after any other fencing needed, to avoid +unnecessary repeated DC elections. If the dependent fence device targets the +DC, and both the DC and a different node running the depended-on resource need +to be fenced, the DC fencing will always fail and block further recovery. Note, +however, that losing a DC node entirely causes some other node to become DC and +schedule the fencing, so this is only a risk when a stop or other operation +with +on-fail+ set to +fencing+ fails on the DC. + +== Configuring Fencing == . Find the correct driver: + ---- # stonith_admin --list-installed ---- . Find the required parameters associated with the device (replacing $AGENT_NAME with the name obtained from the previous step): + ---- # stonith_admin --metadata --agent $AGENT_NAME ---- . Create a file called +stonith.xml+ containing a primitive resource with a class of +stonith+, a type equal to the agent name obtained earlier, and a parameter for each of the values returned in the previous step. . If the device does not know how to fence nodes based on their uname, you may also need to set the special +pcmk_host_map+ parameter. See `man pacemaker-fenced` for details. . If the device does not support the `list` command, you may also need to set the special +pcmk_host_list+ and/or +pcmk_host_check+ parameters. See `man pacemaker-fenced` for details. . If the device does not expect the victim to be specified with the `port` parameter, you may also need to set the special +pcmk_host_argument+ parameter. See `man pacemaker-fenced` for details. . Upload it into the CIB using cibadmin: + ---- # cibadmin -C -o resources --xml-file stonith.xml ---- . Set +stonith-enabled+ to true: + ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- . Once the stonith resource is running, you can test it by executing the following (although you might want to stop the cluster on that machine first): + ---- # stonith_admin --reboot nodename ---- -=== Example STONITH Configuration === +=== Example Fencing Configuration === -Assume we have an chassis containing four nodes and an IPMI device +Assume we have a chassis containing four nodes and an IPMI device active on 192.0.2.1. We would choose the `fence_ipmilan` driver, and obtain the following list of parameters: -.Obtaining a list of STONITH Parameters +.Obtaining a list of Fence Agent Parameters ==== ---- # stonith_admin --metadata -a fence_ipmilan ---- [source,XML] ---- ---- ==== -Based on that, we would create a STONITH resource fragment that might look +Based on that, we would create a fencing resource fragment that might look like this: -.An IPMI-based STONITH Resource +.An IPMI-based Fencing Resource ==== [source,XML] ---- ---- ==== -Finally, we need to enable STONITH: +Finally, we need to enable fencing: ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- -== Advanced STONITH Configurations == - -Some people consider that having one fencing device is a single point -of failure footnote:[Not true, since a node or resource must fail -before fencing even has a chance to]; others prefer removing the node -from the storage and network instead of turning it off. +== Fencing Topologies == -Whatever the reason, Pacemaker supports fencing nodes with multiple -devices through a feature called 'fencing topologies'. +Pacemaker supports fencing nodes with multiple devices through a feature called +'fencing topologies'. Fencing topologies may be used to provide alternative +devices in case one fails, or to require multiple devices to all be executed +successfully in order to consider the node successfully fenced, or even a +combination of the two. -Simply create the individual devices as you normally would, then -define one or more +fencing-level+ entries in the +fencing-topology+ section of -the configuration. +Create the individual devices as you normally would, then define one or more ++fencing-level+ entries in the +fencing-topology+ section of the configuration. * Each fencing level is attempted in order of ascending +index+. Allowed values are 1 through 9. * If a device fails, processing terminates for the current level. No further devices in that level are exercised, and the next level is attempted instead. * If the operation succeeds for all the listed devices in a level, the level is deemed to have passed. * The operation is finished when a level has passed (success), or all levels have been attempted (failed). * If the operation failed, the next step is determined by the scheduler and/or the controller. Some possible uses of topologies include: -* Try poison-pill and fail back to power -* Try disk and network, and fall back to power if either fails -* Initiate a kdump and then poweroff the node +* Try on-board IPMI, then an intelligent power switch if that fails +* Try fabric fencing of both disk and network, then fall back to power fencing + if either fails +* Wait up to a certain time for a kernel dump to complete, then cut power to + the node .Properties of Fencing Levels [width="95%",cols="1m,<3",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the level indexterm:[id,fencing-level] indexterm:[Fencing,fencing-level,id] |target |The name of a single node to which this level applies indexterm:[target,fencing-level] indexterm:[Fencing,fencing-level,target] |target-pattern |An extended regular expression (as defined in http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04[POSIX]) matching the names of nodes to which this level applies indexterm:[target-pattern,fencing-level] indexterm:[Fencing,fencing-level,target-pattern] |target-attribute |The name of a node attribute that is set (to +target-value+) for nodes to which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |target-value |The node attribute value (of +target-attribute+) that is set for nodes to which this level applies indexterm:[target-attribute,fencing-level] indexterm:[Fencing,fencing-level,target-attribute] |index |The order in which to attempt the levels. Levels are attempted in ascending order 'until one succeeds'. Valid values are 1 through 9. indexterm:[index,fencing-level] indexterm:[Fencing,fencing-level,index] |devices |A comma-separated list of devices that must all be tried for this level indexterm:[devices,fencing-level] indexterm:[Fencing,fencing-level,devices] |========================================================= .Fencing topology with different devices for different nodes ==== [source,XML] ---- ... ... ---- ==== === Example Dual-Layer, Dual-Device Fencing Topologies === The following example illustrates an advanced use of +fencing-topology+ in a cluster with the following properties: * 3 nodes (2 active prod-mysql nodes, 1 prod_mysql-rep in standby for quorum purposes) * the active nodes have an IPMI-controlled power board reached at 192.0.2.1 and 192.0.2.2 * the active nodes also have two independent PSUs (Power Supply Units) connected to two independent PDUs (Power Distribution Units) reached at 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) * the first fencing method uses the `fence_ipmi` agent * the second fencing method uses the `fence_apc_snmp` agent targetting 2 fencing devices (one per PSU, either port 10 or 11) * fencing is only implemented for the active nodes and has location constraints * fencing topology is set to try IPMI fencing first then default to a "sure-kill" dual PDU fencing In a normal failure scenario, STONITH will first select +fence_ipmi+ to try to kill the faulty node. Using a fencing topology, if that first method fails, STONITH will then move on to selecting +fence_apc_snmp+ twice: * once for the first PDU * again for the second PDU The fence action is considered successful only if both PDUs report the required status. If any of them fails, STONITH loops back to the first fencing method, +fence_ipmi+, and so on until the node is fenced or fencing action is cancelled. .First fencing method: single IPMI device Each cluster node has it own dedicated IPMI channel that can be called for fencing using the following primitives: [source,XML] ---- ---- .Second fencing method: dual PDU devices Each cluster node also has two distinct power channels controlled by two distinct PDUs. That means a total of 4 fencing devices configured as follows: - Node 1, PDU 1, PSU 1 @ port 10 - Node 1, PDU 2, PSU 2 @ port 10 - Node 2, PDU 1, PSU 1 @ port 11 - Node 2, PDU 2, PSU 2 @ port 11 The matching fencing agents are configured as follows: [source,XML] ---- ---- .Location Constraints To prevent STONITH from trying to run a fencing agent on the same node it is supposed to fence, constraints are placed on all the fencing primitives: [source,XML] ---- ---- .Fencing topology Now that all the fencing resources are defined, it's time to create the right topology. We want to first fence using IPMI and if that does not work, fence both PDUs to effectively and surely kill the node. [source,XML] ---- ---- Please note, in +fencing-topology+, the lowest +index+ value determines the priority of the first fencing method. .Final configuration Put together, the configuration looks like this: [source,XML] ---- ... ... ---- == Remapping Reboots == When the cluster needs to reboot a node, whether because +stonith-action+ is +reboot+ or because a reboot was manually requested (such as by `stonith_admin --reboot`), it will remap that to other commands in two cases: . If the chosen fencing device does not support the +reboot+ command, the cluster will ask it to perform +off+ instead. . If a fencing topology level with multiple devices must be executed, the cluster will ask all the devices to perform +off+, then ask the devices to perform +on+. To understand the second case, consider the example of a node with redundant power supplies connected to intelligent power switches. Rebooting one switch and then the other would have no effect on the node. Turning both switches off, and then on, actually reboots the node. In such a case, the fencing operation will be treated as successful as long as the +off+ commands succeed, because then it is safe for the cluster to recover any resources that were on the node. Timeouts and errors in the +on+ phase will be logged but ignored. When a reboot operation is remapped, any action-specific timeout for the remapped action will be used (for example, +pcmk_off_timeout+ will be used when executing the +off+ command, not +pcmk_reboot_timeout+). diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt index 77d83194ef..2d4566b117 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt @@ -1,903 +1,903 @@ :compat-mode: legacy = Cluster Resources = [[s-resource-primitive]] == What is a Cluster Resource? == indexterm:[Resource] A resource is a service made highly available by a cluster. The simplest type of resource, a 'primitive' resource, is described in this chapter. More complex forms, such as groups and clones, are described in later chapters. Every primitive resource has a 'resource agent'. A resource agent is an external program that abstracts the service it provides and present a consistent view to the cluster. This allows the cluster to be agnostic about the resources it manages. The cluster doesn't need to understand how the resource works because it relies on the resource agent to do the right thing when given a `start`, `stop` or `monitor` command. For this reason, it is crucial that resource agents are well-tested. Typically, resource agents come in the form of shell scripts. However, they can be written using any technology (such as C, Python or Perl) that the author is comfortable with. [[s-resource-supported]] == Resource Classes == indexterm:[Resource,class] Pacemaker supports several classes of agents: * OCF * LSB * Upstart * Systemd * Service * Fencing * Nagios Plugins === Open Cluster Framework === indexterm:[Resource,OCF] indexterm:[OCF,Resources] indexterm:[Open Cluster Framework,Resources] The OCF standard footnote:[See https://github.com/ClusterLabs/OCF-spec/tree/master/ra . The Pacemaker implementation has been somewhat extended from the OCF specs.] is basically an extension of the Linux Standard Base conventions for init scripts to: * support parameters, * make them self-describing, and * make them extensible OCF specs have strict definitions of the exit codes that actions must return. footnote:[ The resource-agents source code includes the `ocf-tester` script, which can be useful in this regard. ] The cluster follows these specifications exactly, and giving the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state. Parameters are passed to the resource agent as environment variables, with the special prefix +OCF_RESKEY_+. So, a parameter which the user thinks of as +ip+ will be passed to the resource agent as +OCF_RESKEY_ip+. The number and purpose of the parameters is left to the resource agent; however, the resource agent should use the `meta-data` command to advertise any that it supports. The OCF class is the most preferred as it is an industry standard, highly flexible (allowing parameters to be passed to agents in a non-positional manner) and self-describing. For more information, see the http://www.linux-ha.org/wiki/OCF_Resource_Agents[reference] and the 'Resource Agents' chapter of 'Pacemaker Administration'. === Linux Standard Base === indexterm:[Resource,LSB] indexterm:[LSB,Resources] indexterm:[Linux Standard Base,Resources] 'LSB' resource agents are more commonly known as 'init scripts'. If a full path is not given, they are assumed to be located in +/etc/init.d+. Commonly, they are provided by the OS distribution. In order to be used with a Pacemaker cluster, they must conform to the LSB specification. footnote:[ See http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html for the LSB Spec as it relates to init scripts. ] [WARNING] ==== Many distributions or particular software packages claim LSB compliance but ship with broken init scripts. For details on how to check whether your init script is LSB-compatible, see the 'Resource Agents' chapter of 'Pacemaker Administration'. Common problematic violations of the LSB standard include: * Not implementing the +status+ operation at all * Not observing the correct exit status codes for +start+/+stop+/+status+ actions * Starting a started resource returns an error * Stopping a stopped resource returns an error ==== [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== [[s-resource-supported-systemd]] === Systemd === indexterm:[Resource,Systemd] indexterm:[Systemd,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons and scripts with an alternative called http://www.freedesktop.org/wiki/Software/systemd[Systemd]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, systemd has 'unit files'. Generally, the services (unit files) are provided by the OS distribution, but there are online guides for converting from init scripts. footnote:[For example, http://0pointer.de/blog/projects/systemd-for-admins-3.html] [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === Upstart === indexterm:[Resource,Upstart] indexterm:[Upstart,Resources] Some newer distributions have replaced the old http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of initialization daemons (and scripts) with an alternative called http://upstart.ubuntu.com/[Upstart]. Pacemaker is able to manage these services _if they are present_. Instead of init scripts, upstart has 'jobs'. Generally, the services (jobs) are provided by the OS distribution. [IMPORTANT] ==== Remember to make sure the computer is _not_ configured to start any services at boot time -- that should be controlled by the cluster. ==== === System Services === indexterm:[Resource,System Services] indexterm:[System Service,Resources] Since there are various types of system services (+systemd+, +upstart+, and +lsb+), Pacemaker supports a special +service+ alias which intelligently figures out which one applies to a given cluster node. This is particularly useful when the cluster contains a mix of +systemd+, +upstart+, and +lsb+. In order, Pacemaker will try to find the named service as: . an LSB init script . a Systemd unit file . an Upstart job === STONITH === indexterm:[Resource,STONITH] indexterm:[STONITH,Resources] The STONITH class is used exclusively for fencing-related resources. This is -discussed later in <>. +discussed later in <>. === Nagios Plugins === indexterm:[Resource,Nagios Plugins] indexterm:[Nagios Plugins,Resources] Nagios Plugins footnote:[The project has two independent forks, hosted at https://www.nagios-plugins.org/ and https://www.monitoring-plugins.org/. Output from both projects' plugins is similar, so plugins from either project can be used with pacemaker.] allow us to monitor services on remote hosts. Pacemaker is able to do remote monitoring with the plugins _if they are present_. A common use case is to configure them as resources belonging to a resource container (usually a virtual machine), and the container will be restarted if any of them has failed. Another use is to configure them as ordinary resources to be used for monitoring hosts or services via the network. The supported parameters are same as the long options of the plugin. [[primitive-resource]] == Resource Properties == These values tell the cluster which resource agent to use for the resource, where to find that resource agent and what standards it conforms to. .Properties of a Primitive Resource [width="95%",cols="1m,<6",options="header",align="center"] |========================================================= |Field |Description |id |Your name for the resource indexterm:[id,Resource] indexterm:[Resource,Property,id] |class |The standard the resource agent conforms to. Allowed values: +lsb+, +nagios+, +ocf+, +service+, +stonith+, +systemd+, +upstart+ indexterm:[class,Resource] indexterm:[Resource,Property,class] |type |The name of the Resource Agent you wish to use. E.g. +IPaddr+ or +Filesystem+ indexterm:[type,Resource] indexterm:[Resource,Property,type] |provider |The OCF spec allows multiple vendors to supply the same resource agent. To use the OCF resource agents supplied by the Heartbeat project, you would specify +heartbeat+ here. indexterm:[provider,Resource] indexterm:[Resource,Property,provider] |========================================================= The XML definition of a resource can be queried with the `crm_resource` tool. For example: ---- # crm_resource --resource Email --query-xml ---- might produce: .A system resource definition ===== [source,XML] ===== [NOTE] ===== One of the main drawbacks to system services (LSB, systemd or Upstart) resources is that they do not allow any parameters! ===== //// See https://tools.ietf.org/html/rfc5737 for choice of example IP address //// .An OCF resource definition ===== [source,XML] ------- ------- ===== [[s-resource-options]] == Resource Options == Resources have two types of options: 'meta-attributes' and 'instance attributes'. Meta-attributes apply to any type of resource, while instance attributes are specific to each resource agent. === Resource Meta-Attributes === Meta-attributes are used by the cluster to decide how a resource should behave and can be easily set using the `--meta` option of the `crm_resource` command. .Meta-attributes of a Primitive Resource [width="95%",cols="2m,2,<5",options="header",align="center"] |========================================================= |Field |Default |Description |priority |0 |If not all resources can be active, the cluster will stop lower priority resources in order to keep higher priority ones active. indexterm:[priority,Resource Option] indexterm:[Resource,Option,priority] |target-role |Started a|What state should the cluster attempt to keep this resource in? Allowed values: * +Stopped:+ Force the resource to be stopped * +Started:+ Allow the resource to be started (and in the case of <>, promoted to master if appropriate) * +Slave:+ Allow the resource to be started, but only in Slave mode if the resource is <> * +Master:+ Equivalent to +Started+ indexterm:[target-role,Resource Option] indexterm:[Resource,Option,target-role] |is-managed |TRUE |Is the cluster allowed to start and stop the resource? Allowed values: +true+, +false+ indexterm:[is-managed,Resource Option] indexterm:[Resource,Option,is-managed] |maintenance |FALSE |Similar to the +maintenance-mode+ <>, but for a single resource. If true, the resource will not be started, stopped, or monitored on any node. This differs from +is-managed+ in that monitors will not be run. Allowed values: +true+, +false+ indexterm:[maintenance,Resource Option] indexterm:[Resource,Option,maintenance] |resource-stickiness |1 for individual clone instances, 0 for all other resources |A score that will be added to the current node when a resource is already active. This allows running resources to stay where they are, even if they would be placed elsewhere if they were being started from a stopped state. indexterm:[resource-stickiness,Resource Option] indexterm:[Resource,Option,resource-stickiness] |requires |+quorum+ for resources with a +class+ of +stonith+, otherwise +unfencing+ if unfencing is active in the cluster, otherwise +fencing+ if +stonith-enabled+ is true, otherwise +quorum+ a|Conditions under which the resource can be started Allowed values: * +nothing:+ can always be started * +quorum:+ The cluster can only start this resource if a majority of the configured nodes are active * +fencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes - have been <> + have been <> * +unfencing:+ The cluster can only start this resource if a majority of the configured nodes are active _and_ any failed or unknown nodes have been fenced _and_ only on nodes that have been <> indexterm:[requires,Resource Option] indexterm:[Resource,Option,requires] |migration-threshold |INFINITY |How many failures may occur for this resource on a node, before this node is marked ineligible to host this resource. A value of 0 indicates that this feature is disabled (the node will never be marked ineligible); by constrast, the cluster treats INFINITY (the default) as a very large but finite number. This option has an effect only if the failed operation specifies +on-fail+ as +restart+ (the default), and additionally for failed +start+ operations, if the cluster property +start-failure-is-fatal+ is +false+. indexterm:[migration-threshold,Resource Option] indexterm:[Resource,Option,migration-threshold] |failure-timeout |0 |How many seconds to wait before acting as if the failure had not occurred, and potentially allowing the resource back to the node on which it failed. A value of 0 indicates that this feature is disabled. As with any time-based actions, this is not guaranteed to be checked more frequently than the value of +cluster-recheck-interval+ (see <>). indexterm:[failure-timeout,Resource Option] indexterm:[Resource,Option,failure-timeout] |multiple-active |stop_start a|What should the cluster do if it ever finds the resource active on more than one node? Allowed values: * +block:+ mark the resource as unmanaged * +stop_only:+ stop all active instances and leave them that way * +stop_start:+ stop all active instances and start the resource in one location only indexterm:[multiple-active,Resource Option] indexterm:[Resource,Option,multiple-active] |allow-migrate |TRUE for ocf:pacemaker:remote resources, FALSE otherwise |Whether the cluster should try to "live migrate" this resource when it needs to be moved (see <>) |container-attribute-target | |Specific to bundle resources; see <> |remote-node | |The name of the Pacemaker Remote guest node this resource is associated with, if any. If specified, this both enables the resource as a guest node and defines the unique name used to identify the guest node. The guest must be configured to run the Pacemaker Remote daemon when it is started. +WARNING:+ This value cannot overlap with any resource or node IDs. |remote-port |3121 |If +remote-node+ is specified, the port on the guest used for its Pacemaker Remote connection. The Pacemaker Remote daemon on the guest must be configured to listen on this port. |remote-addr |value of +remote-node+ |If +remote-node+ is specified, the IP address or hostname used to connect to the guest via Pacemaker Remote. The Pacemaker Remote daemon on the guest must be configured to accept connections on this address. |remote-connect-timeout |60s |If +remote-node+ is specified, how long before a pending guest connection will time out. |========================================================= As an example of setting resource options, if you performed the following commands on an LSB Email resource: ------- # crm_resource --meta --resource Email --set-parameter priority --parameter-value 100 # crm_resource -m -r Email -p multiple-active -v block ------- the resulting resource definition might be: .An LSB resource with cluster options ===== [source,XML] ------- ------- ===== In addition to the cluster-defined meta-attributes described above, you may also configure arbitrary meta-attributes of your own choosing. Most commonly, this would be done for use in <>. For example, an IT department might define a custom meta-attribute to indicate which company department each resource is intended for. To reduce the chance of name collisions with cluster-defined meta-attributes added in the future, it is recommended to use a unique, organization-specific prefix for such attributes. [[s-resource-defaults]] === Setting Global Defaults for Resource Meta-Attributes === To set a default value for a resource option, add it to the +rsc_defaults+ section with `crm_attribute`. For example, ---- # crm_attribute --type rsc_defaults --name is-managed --update false ---- would prevent the cluster from starting or stopping any of the resources in the configuration (unless of course the individual resources were specifically enabled by having their +is-managed+ set to +true+). === Resource Instance Attributes === The resource agents of some resource classes (lsb, systemd and upstart 'not' among them) can be given parameters which determine how they behave and which instance of a service they control. If your resource agent supports parameters, you can add them with the `crm_resource` command. For example, ---- # crm_resource --resource Public-IP --set-parameter ip --parameter-value 192.0.2.2 ---- would create an entry in the resource like this: .An example OCF resource with instance attributes ===== [source,XML] ------- ------- ===== For an OCF resource, the result would be an environment variable called +OCF_RESKEY_ip+ with a value of +192.0.2.2+. The list of instance attributes supported by an OCF resource agent can be found by calling the resource agent with the `meta-data` command. The output contains an XML description of all the supported attributes, their purpose and default values. .Displaying the metadata for the Dummy resource agent template ===== ---- # export OCF_ROOT=/usr/lib/ocf # $OCF_ROOT/resource.d/pacemaker/Dummy meta-data ---- [source,XML] ------- 1.0 This is a Dummy Resource Agent. It does absolutely nothing except keep track of whether its running or not. Its purpose in life is for testing and to serve as a template for RA writers. NB: Please pay attention to the timeouts specified in the actions section below. They should be meaningful for the kind of resource the agent manages. They should be the minimum advised timeouts, but they shouldn't/cannot cover _all_ possible resource instances. So, try to be neither overly generous nor too stingy, but moderate. The minimum timeouts should never be below 10 seconds. Example stateless resource agent Location to store the resource state in. State file Fake attribute that can be changed to cause a reload Fake attribute that can be changed to cause a reload Number of seconds to sleep during operations. This can be used to test how the cluster reacts to operation timeouts. Operation sleep duration in seconds. ------- ===== == Resource Operations == indexterm:[Resource,Action] 'Operations' are actions the cluster can perform on a resource by calling the resource agent. Resource agents must support certain common operations such as start, stop, and monitor, and may implement any others. Operations may be explicitly configured for two purposes: to override defaults for options (such as timeout) that the cluster will use whenever it initiates the operation, and to run an operation on a recurring basis (for example, to monitor the resource for failure). .An OCF resource with a non-default start timeout ===== [source,XML] ------- ------- ===== Pacemaker identifies operations by a combination of name and interval, so this combination must be unique for each resource. That is, you should not configure two operations for the same resource with the same name and interval. .Properties of an Operation [width="95%",cols="2m,3,<6",options="header",align="center"] |========================================================= |Field |Default |Description |id | |A unique name for the operation. indexterm:[id,Action Property] indexterm:[Action,Property,id] |name | |The action to perform. This can be any action supported by the agent; common values include +monitor+, +start+, and +stop+. indexterm:[name,Action Property] indexterm:[Action,Property,name] |interval |0 |How frequently (in seconds) to perform the operation. A value of 0 means "when needed". A positive value defines a 'recurring action', which is typically used with <>. indexterm:[interval,Action Property] indexterm:[Action,Property,interval] |timeout | |How long to wait before declaring the action has failed indexterm:[timeout,Action Property] indexterm:[Action,Property,timeout] |on-fail |restart '(except for +stop+ operations, which default to' fence 'when STONITH is enabled and' block 'otherwise)' a|The action to take if this action ever fails. Allowed values: * +ignore:+ Pretend the resource did not fail. * +block:+ Don't perform any further operations on the resource. * +stop:+ Stop the resource and do not start it elsewhere. * +restart:+ Stop the resource and start it again (possibly on a different node). * +fence:+ STONITH the node on which the resource failed. * +standby:+ Move _all_ resources away from the node on which the resource failed. indexterm:[on-fail,Action Property] indexterm:[Action,Property,on-fail] |enabled |TRUE |If +false+, ignore this operation definition. This is typically used to pause a particular recurring +monitor+ operation; for instance, it can complement the respective resource being unmanaged (+is-managed=false+), as this alone will <>. Disabling the operation does not suppress all actions of the given type. Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |record-pending |TRUE |If +true+, the intention to perform the operation is recorded so that GUIs and CLI tools can indicate that an operation is in progress. This is best set as an _operation default_ (see <>). Allowed values: +true+, +false+. indexterm:[enabled,Action Property] indexterm:[Action,Property,enabled] |role | |Run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring +monitor+ operations. Allowed (case-sensitive) values: +Stopped+, +Started+, and in the case of <>, +Slave+ and +Master+. indexterm:[role,Action Property] indexterm:[Action,Property,role] |========================================================= [[s-resource-monitoring]] === Monitoring Resources for Failure === When Pacemaker first starts a resource, it runs one-time +monitor+ operations (referred to as 'probes') to ensure the resource is running where it's supposed to be, and not running where it's not supposed to be. (This behavior can be affected by the +resource-discovery+ location constraint property.) Other than those initial probes, Pacemaker will 'not' (by default) check that the resource continues to stay healthy. footnote:[Currently, anyway. Automatic monitoring operations may be added in a future version of Pacemaker.] You must configure +monitor+ operations explicitly to perform these checks. .An OCF resource with a recurring health check ===== [source,XML] ------- ------- ===== By default, a +monitor+ operation will ensure that the resource is running where it is supposed to. The +target-role+ property can be used for further checking. For example, if a resource has one +monitor+ operation with +interval=10 role=Started+ and a second +monitor+ operation with +interval=11 role=Stopped+, the cluster will run the first monitor on any nodes it thinks 'should' be running the resource, and the second monitor on any nodes that it thinks 'should not' be running the resource (for the truly paranoid, who want to know when an administrator manually starts a service by mistake). [NOTE] ==== Currently, monitors with +role=Stopped+ are not implemented for <> resources. ==== [[s-monitoring-unmanaged]] === Monitoring Resources When Administration is Disabled === Recurring +monitor+ operations behave differently under various administrative settings: * When a resource is unmanaged (by setting +is-managed=false+): No monitors will be stopped. + If the unmanaged resource is stopped on a node where the cluster thinks it should be running, the cluster will detect and report that it is not, but it will not consider the monitor failed, and will not try to start the resource until it is managed again. + Starting the unmanaged resource on a different node is strongly discouraged and will at least cause the cluster to consider the resource failed, and may require the resource's +target-role+ to be set to +Stopped+ then +Started+ to be recovered. * When a node is put into standby: All resources will be moved away from the node, and all +monitor+ operations will be stopped on the node, except those specifying +role+ as +Stopped+ (which will be newly initiated if appropriate). * When the cluster is put into maintenance mode: All resources will be marked as unmanaged. All monitor operations will be stopped, except those specifying +role+ as +Stopped+ (which will be newly initiated if appropriate). As with single unmanaged resources, starting a resource on a node other than where the cluster expects it to be will cause problems. [[s-operation-defaults]] === Setting Global Defaults for Operations === You can change the global default values for operation properties in a given cluster. These are defined in an +op_defaults+ section of the CIB's +configuration+ section, and can be set with `crm_attribute`. For example, ---- # crm_attribute --type op_defaults --name timeout --update 20s ---- would default each operation's +timeout+ to 20 seconds. If an operation's definition also includes a value for +timeout+, then that value would be used for that operation instead. === When Implicit Operations Take a Long Time === The cluster will always perform a number of implicit operations: +start+, +stop+ and a non-recurring +monitor+ operation used at startup to check whether the resource is already active. If one of these is taking too long, then you can create an entry for them and specify a longer timeout. .An OCF resource with custom timeouts for its implicit actions ===== [source,XML] ------- ------- ===== === Multiple Monitor Operations === Provided no two operations (for a single resource) have the same name and interval, you can have as many +monitor+ operations as you like. In this way, you can do a superficial health check every minute and progressively more intense ones at higher intervals. To tell the resource agent what kind of check to perform, you need to provide each monitor with a different value for a common parameter. The OCF standard creates a special parameter called +OCF_CHECK_LEVEL+ for this purpose and dictates that it is "made available to the resource agent without the normal +OCF_RESKEY+ prefix". Whatever name you choose, you can specify it by adding an +instance_attributes+ block to the +op+ tag. It is up to each resource agent to look for the parameter and decide how to use it. .An OCF resource with two recurring health checks, performing different levels of checks specified via +OCF_CHECK_LEVEL+. ===== [source,XML] ------- ------- ===== === Disabling a Monitor Operation === The easiest way to stop a recurring monitor is to just delete it. However, there can be times when you only want to disable it temporarily. In such cases, simply add +enabled=false+ to the operation's definition. .Example of an OCF resource with a disabled health check ===== [source,XML] ------- ------- ===== This can be achieved from the command line by executing: ---- # cibadmin --modify --xml-text '' ---- Once you've done whatever you needed to do, you can then re-enable it with ---- # cibadmin --modify --xml-text '' ----