diff --git a/doc/Pacemaker_Explained/en-US/Ch-Status.xml b/doc/Pacemaker_Explained/en-US/Ch-Status.xml index 59469362ef..90fedb466b 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Status.xml +++ b/doc/Pacemaker_Explained/en-US/Ch-Status.xml @@ -1,298 +1,298 @@ Status - Here be dragons Most users never need understand the contents of the status section and can be content with the output from crm_mon. - However for those with a curious inclination, the following attempts to proved an overview of its contents. + However for those with a curious inclination, the following attempts to provide an overview of its contents.
Node Status In addition to the cluster's configuration, the CIB holds an up-to-date representation of each cluster node in the status section.
A bare-bones status entry for a healthy node called cl-virt-1 ]]>
Users are highly recommended not to modify any part of a node's state directly. The cluster will periodically regenerate the entire section from authoritative sources. So any changes should be with the tools for those subsystems. Authoritative Sources for State Information Dataset Authoritative Source node_state fields crmd transient_attributes tag attrd lrm tag lrmd
The fields used in the node_state objects are named as they are largely for historical reasons and are rooted in Pacemaker's origins as the Heartbeat resource manager. They have remained unchanged to preserve compatibility with older versions. Node Status Fields Field Description id Unique identifier for the node. Corosync based clusters use the same value as uname, Heartbeat cluster use a human-readable (but annoying) UUID. uname The node's machine name (output from uname -n) ha Is the cluster software active on the node. Allowed values: active, dead in_ccm Is the node part of the cluster's membership. Allowed values: true, false crmd Is the crmd process active on the node. Allowed values: online, offline join Is the node participating in hosting resources. Allowed values: down, pending, member, banned expected Expected value for join crm-debug-origin Diagnostic indicator. The origin of the most recent change(s).
The cluster uses these fields to determine if, at the node level, the node is healthy or is in a failed state and needs to be fenced.
Transient Node Attributes Like regular node attributes, the name/value pairs listed here also help describe the node. However they are forgotten by the cluster when the node goes offline. This can be useful, for instance, when you only want a node to be in standby mode (not able to run resources) until the next reboot. In addition to any values the administrator sets, the cluster will also store information about failed resources here.
Example set of transient node attributes for node "cl-virt-1" ]]>
In the above example, we can see that the pingd:0 resource has failed once, at Mon Apr 6 11:22:22 2009. You can use the following Perl one-liner to print a human readable of any seconds-since-epoch value: perl -e 'print scalar(localtime($seconds))."\n"' We also see that the node is connected to three "pingd" peers and that all known resources have been checked for on this machine (probe_complete).
Operation History A node's resource history is held in the lrm_resources tag (a child of the lrm tag). The information stored here includes enough information for the cluster to stop the resource safely if it is removed from the configuration section. Specifically we store the resource's id, class, type and provider.
A record of the apcstonith resource <lrm_resource id="apcstonith" type="apcmastersnmp" class="stonith">
Additionally, we store the last job for every combination of resource, action and interval. The concatenation of the values in this tuple are used to create the id of the lrm_rsc_op object. Contents of an lrm_rsc_op job. Field Description id Identifier for the job constructed from the resource id, operation and interval. call-id The job's ticket number. Used as a sort key to determine the order in which the jobs were executed. operation The action the resource agent was invoked with. interval The frequency, in milliseconds, at which the operation will be repeated. 0 indicates a one-off job. op-status The job's status. Generally this will be either 0 (done) or -1 (pending). Rarely used in favor of rc-code. rc-code The job's result. Refer to for details on what the values here mean and how they are interpreted. last-run Diagnostic indicator. Machine local date/time, in seconds since epoch, at which the job was executed. last-rc-change Diagnostic indicator. Machine local date/time, in seconds since epoch, at which the job first returned the current value of rc-code exec-time Diagnostic indicator. Time, in seconds, that the job was running for queue-time Diagnostic indicator. Time, in seconds, that the job was queued for in the LRMd crm_feature_set The version which this job description conforms to. Used when processing op-digest transition-key A concatenation of the job's graph action number, the graph number, the expected result and the UUID of the crmd instance that scheduled it. This is used to construct transition-magic (below). transition-magic A concatenation of the job's op-status, rc-code and transition-key. Guaranteed to be unique for the life of the cluster (which ensures it is part of CIB update notifications) and contains all the information needed for the crmd to correctly analyze and process the completed job. Most importantly, the decomposed elements tell the crmd if the job entry was expected and whether it failed. op-digest An MD5 sum representing the parameters passed to the job. Used to detect changes to the configuration and restart resources if necessary. crm-debug-origin Diagnostic indicator. The origin of the current values.
Simple Example
A monitor operation performed by the cluster to determine the current state of the apcstonith resource ]]>
In the above example, the job is a non-recurring monitor often referred to as a "probe" for the apcstonith resource. The cluster schedules probes for every configured resource on when a new node starts, in order to determine the resource's current state before it takes further any further action. From the transition-key, we can see that this was the 22nd action of the 2nd graph produced by this instance of the crmd (2668bbeb-06d5-40f9-936d-24cb7f87006a). The third field of the transition-key contains a 7, this indicates that the job expects to find the resource inactive. By now looking at the rc-code property, we see that this was the case. Evidently, the cluster started the resource elsewhere as that is the only job recorded for this node.
Complex Resource History Example
Resource history of a pingd clone with multiple jobs ]]>
When more than one job record exists, it is important to first sort them by call-id before interpret them. Once sorted, the above example can be summarized as: A non-recurring monitor operation returning 7 (not running), with a call-id of 3 A stop operation returning 0 (success), with a call-id of 32 A start operation returning 0 (success), with a call-id of 33 A recurring monitor returning 0 (success), with a call-id of 34 The cluster processes each job record to build up a picture of the resource's state. After the first and second entries, it is considered stopped and after the third it considered active. Based on the last operation, we can tell that the resource is currently active. Additionally, from the presence of a stop operation with a lower call-id than that of the start operation, we can conclude that the resource has been restarted. Specifically this occurred as part of actions 11 and 31 of transition 11 from the crmd instance with the key 2668bbeb-06d5-40f9-936d-24cb7f87006a. This information can be helpful for locating the relevant section of the logs when looking for the source of a failure.