diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
index a2fbfe2473..df6b71aae4 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
@@ -1,409 +1,410 @@
= Cluster-Wide Configuration =
== CIB Properties ==
Certain settings are defined by CIB properties (that is, attributes of the
+cib+ tag) rather than with the rest of the cluster configuration in the
+configuration+ section.
The reason is simply a matter of parsing. These options are used by the
configuration database which is, by design, mostly ignorant of the content it
holds. So the decision was made to place them in an easy-to-find location.
.CIB Properties
[width="95%",cols="2m,5<",options="header",align="center"]
|=========================================================
|Field |Description
| admin_epoch |
indexterm:[Configuration Version,Cluster]
indexterm:[Cluster,Option,Configuration Version]
indexterm:[admin_epoch,Cluster Option]
indexterm:[Cluster,Option,admin_epoch]
When a node joins the cluster, the cluster performs a check to see
which node has the best configuration. It asks the node with the highest
(+admin_epoch+, +epoch+, +num_updates+) tuple to replace the configuration on
all the nodes -- which makes setting them, and setting them correctly, very
important. +admin_epoch+ is never modified by the cluster; you can use this
to make the configurations on any inactive nodes obsolete. _Never set this
value to zero_. In such cases, the cluster cannot tell the difference between
your configuration and the "empty" one used when nothing is found on disk.
| epoch |
indexterm:[epoch,Cluster Option]
indexterm:[Cluster,Option,epoch]
The cluster increments this every time the configuration is updated (usually by
the administrator).
| num_updates |
indexterm:[num_updates,Cluster Option]
indexterm:[Cluster,Option,num_updates]
The cluster increments this every time the configuration or status is updated
(usually by the cluster) and resets it to 0 when epoch changes.
| validate-with |
indexterm:[validate-with,Cluster Option]
indexterm:[Cluster,Option,validate-with]
Determines the type of XML validation that will be done on the configuration.
If set to +none+, the cluster will not verify that updates conform to the
DTD (nor reject ones that don't). This option can be useful when
operating a mixed-version cluster during an upgrade.
|cib-last-written |
indexterm:[cib-last-written,Cluster Property]
indexterm:[Cluster,Property,cib-last-written]
Indicates when the configuration was last written to disk. Maintained by the
cluster; for informational purposes only.
|have-quorum |
indexterm:[have-quorum,Cluster Property]
indexterm:[Cluster,Property,have-quorum]
Indicates if the cluster has quorum. If false, this may mean that the
cluster cannot start resources or fence other nodes (see
+no-quorum-policy+ below). Maintained by the cluster.
|dc-uuid |
indexterm:[dc-uuid,Cluster Property]
indexterm:[Cluster,Property,dc-uuid]
Indicates which cluster node is the current leader. Used by the
cluster when placing resources and determining the order of some
events. Maintained by the cluster.
|=========================================================
=== Working with CIB Properties ===
Although these fields can be written to by the user, in
most cases the cluster will overwrite any values specified by the
user with the "correct" ones.
To change the ones that can be specified by the user,
for example +admin_epoch+, one should use:
----
# cibadmin --modify --crm_xml ''
----
A complete set of CIB properties will look something like this:
.Attributes set for a cib object
======
[source,XML]
-------
-------
======
+[[s-cluster-options]]
== Cluster Options ==
Cluster options, as you might expect, control how the cluster behaves
when confronted with certain situations.
They are grouped into sets within the +crm_config+ section, and, in advanced
configurations, there may be more than one set. (This will be described later
in the section on <> where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including
their default values, by running the `man pengine` and `man crmd` commands.
.Cluster Options
[width="95%",cols="5m,2,11>).
| enable-startup-probes | TRUE |
indexterm:[enable-startup-probes,Cluster Option]
indexterm:[Cluster,Option,enable-startup-probes]
Should the cluster check for active resources during startup?
| maintenance-mode | FALSE |
indexterm:[maintenance-mode,Cluster Option]
indexterm:[Cluster,Option,maintenance-mode]
Should the cluster refrain from monitoring, starting and stopping resources?
| stonith-enabled | TRUE |
indexterm:[stonith-enabled,Cluster Option]
indexterm:[Cluster,Option,stonith-enabled]
Should failed nodes and nodes with resources that can't be stopped be
shot? If you value your data, set up a STONITH device and enable this.
If true, or unset, the cluster will refuse to start resources unless
one or more STONITH resources have been configured.
If false, unresponsive nodes are immediately assumed to be running no
resources, and resource takeover to online nodes starts without any
further protection (which means _data loss_ if the unresponsive node
still accesses shared storage, for example). See also the +requires+
meta-attribute in <>.
| stonith-action | reboot |
indexterm:[stonith-action,Cluster Option]
indexterm:[Cluster,Option,stonith-action]
Action to send to STONITH device. Allowed values are +reboot+ and +off+.
The value +poweroff+ is also allowed, but is only used for
legacy devices.
| stonith-timeout | 60s |
indexterm:[stonith-timeout,Cluster Option]
indexterm:[Cluster,Option,stonith-timeout]
How long to wait for STONITH actions (reboot, on, off) to complete
| concurrent-fencing | FALSE |
indexterm:[concurrent-fencing,Cluster Option]
indexterm:[Cluster,Option,concurrent-fencing]
Is the cluster allowed to initiate multiple fence actions concurrently?
| cluster-delay | 60s |
indexterm:[cluster-delay,Cluster Option]
indexterm:[Cluster,Option,cluster-delay]
Estimated maximum round-trip delay over the network (excluding action
execution). If the TE requires an action to be executed on another node,
it will consider the action failed if it does not get a response
from the other node in this time (after considering the action's
own timeout). The "correct" value will depend on the speed and load of your
network and cluster nodes.
| dc-deadtime | 20s |
indexterm:[dc-deadtime,Cluster Option]
indexterm:[Cluster,Option,dc-deadtime]
How long to wait for a response from other nodes during startup.
The "correct" value will depend on the speed/load of your network and the type of switches used.
| cluster-recheck-interval | 15min |
indexterm:[cluster-recheck-interval,Cluster Option]
indexterm:[Cluster,Option,cluster-recheck-interval]
Polling interval for time-based changes to options, resource parameters and constraints.
The Cluster is primarily event-driven, but your configuration can have
elements that take effect based on the time of day. To ensure these changes
take effect, we can optionally poll the cluster's status for changes. A value
of 0 disables polling. Positive values are an interval (in seconds unless other
SI units are specified, e.g. 5min).
| pe-error-series-max | -1 |
indexterm:[pe-error-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-error-series-max]
The number of PE inputs resulting in ERRORs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
| pe-warn-series-max | -1 |
indexterm:[pe-warn-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-warn-series-max]
The number of PE inputs resulting in WARNINGs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
| pe-input-series-max | -1 |
indexterm:[pe-input-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-input-series-max]
The number of "normal" PE inputs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
| remove-after-stop | FALSE |
indexterm:[remove-after-stop,Cluster Option]
indexterm:[Cluster,Option,remove-after-stop]
_Advanced Use Only:_ Should the cluster remove resources from the LRM after
they are stopped? Values other than the default are, at best, poorly tested and
potentially dangerous.
| startup-fencing | TRUE |
indexterm:[startup-fencing,Cluster Option]
indexterm:[Cluster,Option,startup-fencing]
_Advanced Use Only:_ Should the cluster shoot unseen nodes?
Not using the default is very unsafe!
| election-timeout | 2min |
indexterm:[election-timeout,Cluster Option]
indexterm:[Cluster,Option,election-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| shutdown-escalation | 20min |
indexterm:[shutdown-escalation,Cluster Option]
indexterm:[Cluster,Option,shutdown-escalation]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-integration-timeout | 3min |
indexterm:[crmd-integration-timeout,Cluster Option]
indexterm:[Cluster,Option,crmd-integration-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-finalization-timeout | 30min |
indexterm:[crmd-finalization-timeout,Cluster Option]
indexterm:[Cluster,Option,crmd-finalization-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-transition-delay | 0s |
indexterm:[crmd-transition-delay,Cluster Option]
indexterm:[Cluster,Option,crmd-transition-delay]
_Advanced Use Only:_ Delay cluster recovery for the configured interval to
allow for additional/related events to occur. Useful if your configuration is
sensitive to the order in which ping updates arrive.
Enabling this option will slow down cluster recovery under
all conditions.
|default-resource-stickiness | 0 |
indexterm:[default-resource-stickiness,Cluster Option]
indexterm:[Cluster,Option,default-resource-stickiness]
_Deprecated:_ See <> instead
| is-managed-default | TRUE |
indexterm:[is-managed-default,Cluster Option]
indexterm:[Cluster,Option,is-managed-default]
_Deprecated:_ See <> instead
| default-action-timeout | 20s |
indexterm:[default-action-timeout,Cluster Option]
indexterm:[Cluster,Option,default-action-timeout]
_Deprecated:_ See <> instead
|=========================================================
=== Querying and Setting Cluster Options ===
indexterm:[Querying,Cluster Option]
indexterm:[Setting,Cluster Option]
indexterm:[Cluster,Querying Options]
indexterm:[Cluster,Setting Options]
Cluster options can be queried and modified using the `crm_attribute` tool. To
get the current value of +cluster-delay+, you can run:
----
# crm_attribute --query --name cluster-delay
----
which is more simply written as
----
# crm_attribute -G -n cluster-delay
----
If a value is found, you'll see a result like this:
----
# crm_attribute -G -n cluster-delay
scope=crm_config name=cluster-delay value=60s
----
If no value is found, the tool will display an error:
----
# crm_attribute -G -n clusta-deway
scope=crm_config name=clusta-deway value=(null)
Error performing operation: No such device or address
----
To use a different value (for example, 30 seconds), simply run:
----
# crm_attribute --name cluster-delay --update 30s
----
To go back to the cluster's default value, you can delete the value, for example:
----
# crm_attribute --name cluster-delay --delete
Deleted crm_config option: id=cib-bootstrap-options-cluster-delay name=cluster-delay
----
=== When Options are Listed More Than Once ===
If you ever see something like the following, it means that the option you're modifying is present more than once.
.Deleting an option that is listed twice
=======
------
# crm_attribute --name batch-limit --delete
Multiple attributes match name=batch-limit in crm_config:
Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit)
Value: 100 (set=custom, id=custom-batch-limit)
Please choose from one of the matches above and supply the 'id' with --id
-------
=======
In such cases, follow the on-screen instructions to perform the
requested action. To determine which value is currently being used by
the cluster, refer to <>.
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt
index 9453954d8e..bb4f101013 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Resources.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Resources.txt
@@ -1,835 +1,838 @@
= Cluster Resources =
== What is a Cluster Resource? ==
indexterm:[Resource]
A resource is a service made highly available by a cluster.
The simplest type of resource, a 'primitive' resource, is described
in this chapter. More complex forms, such as groups and clones,
are described in later chapters.
Every primitive resource has a 'resource agent'. A resource agent is an
external program that abstracts the service it provides and present a
consistent view to the cluster.
This allows the cluster to be agnostic about the resources it manages.
The cluster doesn't need to understand how the resource works because
it relies on the resource agent to do the right thing when given a
`start`, `stop` or `monitor` command. For this reason, it is crucial that
resource agents are well-tested.
Typically, resource agents come in the form of shell scripts. However,
they can be written using any technology (such as C, Python or Perl)
that the author is comfortable with.
[[s-resource-supported]]
== Resource Classes ==
indexterm:[Resource,class]
Pacemaker supports several classes of agents:
* OCF
* LSB
* Upstart
* Systemd
* Service
* Fencing
* Nagios Plugins
=== Open Cluster Framework ===
indexterm:[Resource,OCF]
indexterm:[OCF,Resources]
indexterm:[Open Cluster Framework,Resources]
The OCF standard
footnote:[See
http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD
-- at least as it relates to resource agents. The Pacemaker implementation has
been somewhat extended from the OCF specs, but none of those changes are
incompatible with the original OCF specification.]
is basically an extension of the Linux Standard Base conventions for
init scripts to:
* support parameters,
* make them self-describing, and
* make them extensible
OCF specs have strict definitions of the exit codes that actions must return.
footnote:[
The resource-agents source code includes the `ocf-tester` script, which
can be useful in this regard.
]
The cluster follows these specifications exactly, and giving the wrong
exit code will cause the cluster to behave in ways you will likely
find puzzling and annoying. In particular, the cluster needs to
distinguish a completely stopped resource from one which is in some
erroneous and indeterminate state.
Parameters are passed to the resource agent as environment variables, with the
special prefix +OCF_RESKEY_+. So, a parameter which the user thinks
of as +ip+ will be passed to the resource agent as +OCF_RESKEY_ip+. The
number and purpose of the parameters is left to the resource agent; however,
the resource agent should use the `meta-data` command to advertise any that it
supports.
The OCF class is the most preferred as it is an industry standard,
highly flexible (allowing parameters to be passed to agents in a
non-positional manner) and self-describing.
For more information, see the
http://www.linux-ha.org/wiki/OCF_Resource_Agents[reference] and
<>.
=== Linux Standard Base ===
indexterm:[Resource,LSB]
indexterm:[LSB,Resources]
indexterm:[Linux Standard Base,Resources]
LSB resource agents are those found in +/etc/init.d+.
Generally, they are provided by the OS distribution and, in order to be used
with the cluster, they must conform to the LSB Spec.
footnote:[
See
http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
for the LSB Spec as it relates to init scripts.
]
[WARNING]
====
Many distributions claim LSB compliance but ship with broken init
scripts. For details on how to check whether your init script is
LSB-compatible, see <>. Common problematic violations of
the LSB standard include:
* Not implementing the status operation at all
* Not observing the correct exit status codes for `start/stop/status` actions
* Starting a started resource returns an error
* Stopping a stopped resource returns an error
====
[IMPORTANT]
====
Remember to make sure the computer is _not_ configured to start any
services at boot time -- that should be controlled by the cluster.
====
=== Systemd ===
indexterm:[Resource,Systemd]
indexterm:[Systemd,Resources]
Some newer distributions have replaced the old
http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of
initialization daemons and scripts with an alternative called
http://www.freedesktop.org/wiki/Software/systemd[Systemd].
Pacemaker is able to manage these services _if they are present_.
Instead of init scripts, systemd has 'unit files'. Generally, the
services (unit files) are provided by the OS distribution, but there
are online guides for converting from init scripts.
footnote:[For example,
http://0pointer.de/blog/projects/systemd-for-admins-3.html]
[IMPORTANT]
====
Remember to make sure the computer is _not_ configured to start any
services at boot time -- that should be controlled by the cluster.
====
=== Upstart ===
indexterm:[Resource,Upstart]
indexterm:[Upstart,Resources]
Some newer distributions have replaced the old
http://en.wikipedia.org/wiki/Init#SysV-style["SysV"] style of
initialization daemons (and scripts) with an alternative called
http://upstart.ubuntu.com/[Upstart].
Pacemaker is able to manage these services _if they are present_.
Instead of init scripts, upstart has 'jobs'. Generally, the
services (jobs) are provided by the OS distribution.
[IMPORTANT]
====
Remember to make sure the computer is _not_ configured to start any
services at boot time -- that should be controlled by the cluster.
====
=== System Services ===
indexterm:[Resource,System Services]
indexterm:[System Service,Resources]
Since there are various types of system services (+systemd+,
+upstart+, and +lsb+), Pacemaker supports a special +service+ alias which
intelligently figures out which one applies to a given cluster node.
This is particularly useful when the cluster contains a mix of
+systemd+, +upstart+, and +lsb+.
In order, Pacemaker will try to find the named service as:
. an LSB init script
. a Systemd unit file
. an Upstart job
=== STONITH ===
indexterm:[Resource,STONITH]
indexterm:[STONITH,Resources]
The STONITH class is used exclusively for fencing-related resources. This is
discussed later in <>.
=== Nagios Plugins ===
indexterm:[Resource,Nagios Plugins]
indexterm:[Nagios Plugins,Resources]
Nagios Plugins
footnote:[The project has two independent forks, hosted at
https://www.nagios-plugins.org/ and https://www.monitoring-plugins.org/. Output
from both projects' plugins is similar, so plugins from either project can be
used with pacemaker.]
allow us to monitor services on remote hosts.
Pacemaker is able to do remote monitoring with the plugins _if they are
present_.
A common use case is to configure them as resources belonging to a resource
container (usually a virtual machine), and the container will be restarted
if any of them has failed. Another use is to configure them as ordinary
resources to be used for monitoring hosts or services via the network.
The supported parameters are same as the long options of the plugin.
[[primitive-resource]]
== Resource Properties ==
These values tell the cluster which resource agent to use for the resource,
where to find that resource agent and what standards it conforms to.
.Properties of a Primitive Resource
[width="95%",cols="1m,6<",options="header",align="center"]
|=========================================================
|Field
|Description
|id
|Your name for the resource
indexterm:[id,Resource]
indexterm:[Resource,Property,id]
|class
|The standard the resource agent conforms to. Allowed values:
+lsb+, +nagios+, +ocf+, +service+, +stonith+, +systemd+, +upstart+
indexterm:[class,Resource]
indexterm:[Resource,Property,class]
|type
|The name of the Resource Agent you wish to use. E.g. +IPaddr+ or +Filesystem+
indexterm:[type,Resource]
indexterm:[Resource,Property,type]
|provider
|The OCF spec allows multiple vendors to supply the same
resource agent. To use the OCF resource agents supplied by
the Heartbeat project, you would specify +heartbeat+ here.
indexterm:[provider,Resource]
indexterm:[Resource,Property,provider]
|=========================================================
The XML definition of a resource can be queried with the `crm_resource` tool.
For example:
----
# crm_resource --resource Email --query-xml
----
might produce:
.A system resource definition
=====
[source,XML]
=====
[NOTE]
=====
One of the main drawbacks to system services (LSB, systemd or
Upstart) resources is that they do not allow any parameters!
=====
////
See https://tools.ietf.org/html/rfc5737 for choice of example IP address
////
.An OCF resource definition
=====
[source,XML]
-------
-------
=====
[[s-resource-options]]
== Resource Options ==
Resources have two types of options: 'meta-attributes' and 'instance attributes'.
Meta-attributes apply to any type of resource, while instance attributes
are specific to each resource agent.
=== Resource Meta-Attributes ===
Meta-attributes are used by the cluster to decide how a resource should
behave and can be easily set using the `--meta` option of the
`crm_resource` command.
.Meta-attributes of a Primitive Resource
[width="95%",cols="2m,2,5> resources, they will not be promoted to
master)
* +master:+ Allow the resource to be started and, if appropriate, promoted
indexterm:[target-role,Resource Option]
indexterm:[Resource,Option,target-role]
|is-managed
|TRUE
|Is the cluster allowed to start and stop the resource? Allowed
values: +true+, +false+
indexterm:[is-managed,Resource Option]
indexterm:[Resource,Option,is-managed]
|resource-stickiness
|value of +resource-stickiness+ in the +rsc_defaults+ section
|How much does the resource prefer to stay where it is?
indexterm:[resource-stickiness,Resource Option]
indexterm:[Resource,Option,resource-stickiness]
|requires
|fencing (unless +stonith-enabled+ is +false+ or +class+ is
+stonith+, in which case it defaults to quorum)
|Conditions under which the resource can be started ('Since 1.1.8')
Allowed values:
* +nothing:+ can always be started
* +quorum:+ The cluster can only start this resource if a majority of
the configured nodes are active
* +fencing:+ The cluster can only start this resource if a majority
of the configured nodes are active _and_ any failed or unknown nodes
have been powered off
* +unfencing:+ The cluster can only start this resource if a majority
of the configured nodes are active _and_ any failed or unknown nodes
have been powered off _and_ only on nodes that have been 'unfenced'
indexterm:[requires,Resource Option]
indexterm:[Resource,Option,requires]
|migration-threshold
|INFINITY
|How many failures may occur for this resource on a node, before this
node is marked ineligible to host this resource. A value of INFINITY
indicates that this feature is disabled.
indexterm:[migration-threshold,Resource Option]
indexterm:[Resource,Option,migration-threshold]
|failure-timeout
|0
|How many seconds to wait before acting as if the failure had not
occurred, and potentially allowing the resource back to the node on
which it failed. A value of 0 indicates that this feature is disabled.
+ As with any time-based actions, this is not guaranteed to be checked more
+ frequently than the value of +cluster-recheck-interval+ (see
+ <>).
indexterm:[failure-timeout,Resource Option]
indexterm:[Resource,Option,failure-timeout]
|multiple-active
|stop_start
|What should the cluster do if it ever finds the resource active on
more than one node? Allowed values:
* +block:+ mark the resource as unmanaged
* +stop_only:+ stop all active instances and leave them that way
* +stop_start:+ stop all active instances and start the resource in
one location only
indexterm:[multiple-active,Resource Option]
indexterm:[Resource,Option,multiple-active]
|remote-node
|
|The name of the remote-node this resource defines. This both enables the
resource as a remote-node and defines the unique name used to identify the
remote-node. If no other parameters are set, this value will also be assumed as
the hostname to connect to at the port specified by +remote-port+. +WARNING:+
This value cannot overlap with any resource or node IDs. If not specified,
this feature is disabled.
|remote-port
|3121
|Port to use for the guest connection to pacemaker_remote
|remote-addr
|value of +remote-node+
|The IP address or hostname to connect to if remote-node's name is not the
hostname of the guest.
|+remote-connect-timeout+
|60s
|How long before a pending guest connection will time out.
|=========================================================
[NOTE]
====
Support for remote nodes was added in pacemaker 1.1.10. If you are using an
earlier version, options related to remote nodes will not be available.
====
As an example of setting resource options, if you performed the following
commands on an LSB Email resource:
-------
# crm_resource --meta --resource Email --set-parameter priority --parameter-value 100
# crm_resource -m -r Email -p multiple-active -v block
-------
the resulting resource definition might be:
.An LSB resource with cluster options
=====
[source,XML]
-------
-------
=====
[[s-resource-defaults]]
=== Setting Global Defaults for Resource Meta-Attributes ===
To set a default value for a resource option, add it to the
+rsc_defaults+ section with `crm_attribute`. For example,
----
# crm_attribute --type rsc_defaults --name is-managed --update false
----
would prevent the cluster from starting or stopping any of the
resources in the configuration (unless of course the individual
resources were specifically enabled by having their +is-managed+ set to
+true+).
=== Resource Instance Attributes ===
The resource agents of some resource classes (lsb, systemd and upstart 'not' among them)
can be given parameters which determine how they behave and which instance
of a service they control.
If your resource agent supports parameters, you can add them with the
`crm_resource` command. For example,
----
# crm_resource --resource Public-IP --set-parameter ip --parameter-value 192.0.2.2
----
would create an entry in the resource like this:
.An example OCF resource with instance attributes
=====
[source,XML]
-------
-------
=====
For an OCF resource, the result would be an environment variable
called +OCF_RESKEY_ip+ with a value of +192.0.2.2+.
The list of instance attributes supported by an OCF resource agent can be
found by calling the resource agent with the `meta-data` command.
The output contains an XML description of all the supported
attributes, their purpose and default values.
.Displaying the metadata for the Dummy resource agent template
=====
----
# export OCF_ROOT=/usr/lib/ocf
# $OCF_ROOT/resource.d/pacemaker/Dummy meta-data
----
[source,XML]
-------
1.0
This is a Dummy Resource Agent. It does absolutely nothing except
keep track of whether its running or not.
Its purpose in life is for testing and to serve as a template for RA writers.
NB: Please pay attention to the timeouts specified in the actions
section below. They should be meaningful for the kind of resource
the agent manages. They should be the minimum advised timeouts,
but they shouldn't/cannot cover _all_ possible resource
instances. So, try to be neither overly generous nor too stingy,
but moderate. The minimum timeouts should never be below 10 seconds.
Example stateless resource agent
Location to store the resource state in.
State file
Fake attribute that can be changed to cause a reload
Fake attribute that can be changed to cause a reload
Number of seconds to sleep during operations. This can be used to test how
the cluster reacts to operation timeouts.
Operation sleep duration in seconds.
-------
=====
== Resource Operations ==
indexterm:[Resource,Action]
'Operations' are actions the cluster can perform on a resource by calling the
resource agent. Resource agents must support certain common operations such as
start, stop and monitor, and may implement any others.
Some operations are generated by the cluster itself, for example, stopping and
starting resources as needed.
You can configure operations in the cluster configuration. As an example, by
default the cluster will 'not' ensure your resources stay healthy once they are
started. footnote:[Currently, anyway. Automatic monitoring operations may be
added in a future version of Pacemaker.] To instruct the cluster to do this,
you need to add a +monitor+ operation to the resource's definition.
.An OCF resource with a recurring health check
=====
[source,XML]
-------
-------
=====
.Properties of an Operation
[width="95%",cols="2m,3,6>.
indexterm:[interval,Action Property]
indexterm:[Action,Property,interval]
|timeout
|
|How long to wait before declaring the action has failed
indexterm:[timeout,Action Property]
indexterm:[Action,Property,timeout]
|on-fail
|restart '(except for stop operations, which default to' fence 'when
STONITH is enabled and' block 'otherwise)'
|The action to take if this action ever fails. Allowed values:
* +ignore:+ Pretend the resource did not fail.
* +block:+ Don't perform any further operations on the resource.
* +stop:+ Stop the resource and do not start it elsewhere.
* +restart:+ Stop the resource and start it again (possibly on a different node).
* +fence:+ STONITH the node on which the resource failed.
* +standby:+ Move _all_ resources away from the node on which the resource failed.
indexterm:[on-fail,Action Property]
indexterm:[Action,Property,on-fail]
|enabled
|TRUE
|If +false+, ignore this operation definition. This is typically used to pause
a particular recurring monitor operation; for instance, it can complement
the respective resource being unmanaged (+is-managed=false+), as this alone
will <>.
Disabling the operation does not suppress all actions of the given type.
Allowed values: +true+, +false+.
indexterm:[enabled,Action Property]
indexterm:[Action,Property,enabled]
|record-pending
|
|If +true+, the intention to perform the operation is recorded so that
GUIs and CLI tools can indicate that an operation is in progress.
This is best set as an 'operation default' (see next section).
Allowed values: +true+, +false+.
indexterm:[enabled,Action Property]
indexterm:[Action,Property,enabled]
|role
|
|Run the operation only on node(s) that the cluster thinks should be in
the specified role. This only makes sense for recurring monitor operations.
Allowed (case-sensitive) values: +Stopped+, +Started+, and in the
case of <> resources, +Slave+ and +Master+.
indexterm:[role,Action Property]
indexterm:[Action,Property,role]
|=========================================================
[[s-resource-monitoring]]
=== Monitoring Resources for Failure ===
When Pacemaker first starts a resource, it runs one-time monitor operations
(referred to as 'probes') to ensure the resource is running where it's
supposed to be, and not running where it's not supposed to be. (This behavior
can be affected by the +resource-discovery+ location constraint property.)
Other than those initial probes, Pacemaker will not (by default) check that
the resource continues to stay healthy. As in the example above, you must
configure monitor operations explicitly to perform these checks.
By default, a monitor operation will ensure that the resource is running
where it is supposed to. The +target-role+ property can be used for further
checking.
For example, if a resource has one monitor operation with
+interval=10 role=Started+ and a second monitor operation with
+interval=11 role=Stopped+, the cluster will run the first monitor on any nodes
it thinks 'should' be running the resource, and the second monitor on any nodes
that it thinks 'should not' be running the resource (for the truly paranoid,
who want to know when an administrator manually starts a service by mistake).
[[s-monitoring-unmanaged]]
=== Monitoring Resources When Administration is Disabled ===
Recurring monitor operations behave differently under various administrative
settings:
* When a resource is unmanaged (by setting +is-managed=false+): No monitors
will be stopped.
+
If the unmanaged resource is stopped on a node where the cluster thinks it
should be running, the cluster will detect and report that it is not, but it
will not consider the monitor failed, and will not try to start the resource
until it is managed again.
+
Starting the unmanaged resource on a different node is strongly discouraged
and will at least cause the cluster to consider the resource failed, and
may require the resource's +target-role+ to be set to +Stopped+ then +Started+
to be recovered.
* When a node is put into standby: All resources will be moved away from the
node, and all monitor operations will be stopped on the node, except those
with +role=Stopped+. Monitor operations with +role=Stopped+ will be started
on the node if appropriate.
* When the cluster is put into maintenance mode: All resources will be marked
as unmanaged. All monitor operations will be stopped, except those with
+role=Stopped+. As with single unmanaged resources, starting a resource
on a node other than where the cluster expects it to be will cause problems.
[[s-operation-defaults]]
=== Setting Global Defaults for Operations ===
You can change the global default values for operation properties
in a given cluster. These are defined in an +op_defaults+ section
of the CIB's +configuration+ section, and can be set with `crm_attribute`.
For example,
----
# crm_attribute --type op_defaults --name timeout --update 20s
----
would default each operation's +timeout+ to 20 seconds. If an
operation's definition also includes a value for +timeout+, then that
value would be used for that operation instead.
=== When Implicit Operations Take a Long Time ===
The cluster will always perform a number of implicit operations: +start+,
+stop+ and a non-recurring +monitor+ operation used at startup to check
whether the resource is already active. If one of these is taking too long,
then you can create an entry for them and specify a longer timeout.
.An OCF resource with custom timeouts for its implicit actions
=====
[source,XML]
-------
-------
=====
=== Multiple Monitor Operations ===
Provided no two operations (for a single resource) have the same name
and interval, you can have as many monitor operations as you like. In
this way, you can do a superficial health check every minute and
progressively more intense ones at higher intervals.
To tell the resource agent what kind of check to perform, you need to
provide each monitor with a different value for a common parameter.
The OCF standard creates a special parameter called +OCF_CHECK_LEVEL+
for this purpose and dictates that it is "made available to the
resource agent without the normal +OCF_RESKEY+ prefix".
Whatever name you choose, you can specify it by adding an
+instance_attributes+ block to the +op+ tag. It is up to each
resource agent to look for the parameter and decide how to use it.
.An OCF resource with two recurring health checks, performing different levels of checks specified via +OCF_CHECK_LEVEL+.
=====
[source,XML]
-------
-------
=====
=== Disabling a Monitor Operation ===
The easiest way to stop a recurring monitor is to just delete it.
However, there can be times when you only want to disable it
temporarily. In such cases, simply add +enabled="false"+ to the
operation's definition.
.Example of an OCF resource with a disabled health check
=====
[source,XML]
-------
-------
=====
This can be achieved from the command line by executing:
----
# cibadmin --modify --xml-text ''
----
Once you've done whatever you needed to do, you can then re-enable it with
----
# cibadmin --modify --xml-text ''
----
diff --git a/doc/Pacemaker_Remote/en-US/Ch-Options.txt b/doc/Pacemaker_Remote/en-US/Ch-Options.txt
index f04b8b6e94..5faaaf2713 100644
--- a/doc/Pacemaker_Remote/en-US/Ch-Options.txt
+++ b/doc/Pacemaker_Remote/en-US/Ch-Options.txt
@@ -1,121 +1,121 @@
= Configuration Explained =
The walk-through examples use some of these options, but don't explain exactly
what they mean or do. This section is meant to be the go-to resource for all
the options available for configuring pacemaker_remote-based nodes.
(((configuration)))
== Resource Meta-Attributes for Guest Nodes ==
When configuring a virtual machine to use as a guest node, these are the
metadata options available to enable the resource as a guest node and
define its connection parameters.
.Meta-attributes for configuring VM resources as guest nodes
[width="95%",cols="2m,1,4<",options="header",align="center"]
|=========================================================
|Option
|Default
|Description
|remote-node
|'none'
|The node name of the guest node this resource defines. This both enables the
resource as a guest node and defines the unique name used to identify the
guest node. If no other parameters are set, this value will also be assumed as
the hostname to use when connecting to pacemaker_remote on the VM. This value
*must not* overlap with any resource or node IDs.
|remote-port
|3121
|The port on the virtual machine that the cluster will use to connect to
pacemaker_remote.
|remote-addr
|'value of' +remote-node+
|The IP address or hostname to use when connecting to pacemaker_remote on the VM.
|remote-connect-timeout
|60s
|How long before a pending guest connection will time out.
|=========================================================
== Connection Resources for Remote Nodes ==
A remote node is defined by a connection resource. That connection resource
has instance attributes that define where the remote node is located on the
network and how to communicate with it.
Descriptions of these instance attributes can be retrieved using the following
`pcs` command:
----
# pcs resource describe remote
ocf:pacemaker:remote - remote resource agent
Resource options:
server: Server location to connect to. This can be an ip address or hostname.
port: tcp port to connect to.
- reconnect_interval: Time in seconds to wait before attempting to reconnect to
- a remote node after an active connection to the remote
- node has been severed. This wait is recurring. If
- reconnect fails after the wait period, a new reconnect
- attempt will be made after observing the wait time. When
- this option is in use, pacemaker will keep attempting to
- reach out and connect to the remote node indefinitely
- after each wait interval.
+ reconnect_interval: Interval in seconds at which Pacemaker will attempt to
+ reconnect to a remote node after an active connection to
+ the remote node has been severed. When this value is
+ nonzero, Pacemaker will retry the connection
+ indefinitely, at the specified interval. As with any
+ time-based actions, this is not guaranteed to be checked
+ more frequently than the value of the
+ cluster-recheck-interval cluster option.
----
When defining a remote node's connection resource, it is common and recommended
to name the connection resource the same as the remote node's hostname. By
default, if no *server* option is provided, the cluster will attempt to contact
the remote node using the resource name as the hostname.
Example defining a remote node with the hostname *remote1*:
----
# pcs resource create remote1 remote
----
Example defining a remote node to connect to a specific IP address and port:
----
# pcs resource create remote1 remote server=192.168.122.200 port=8938
----
== Environment Variables for Daemon Start-up ==
Authentication and encryption of the connection between cluster nodes
and nodes running pacemaker_remote is achieved using
with https://en.wikipedia.org/wiki/TLS-PSK[TLS-PSK] encryption/authentication
over TCP (port 3121 by default). This means that both the cluster node and
remote node must share the same private key. By default, this
key is placed at +/etc/pacemaker/authkey+ on each node.
You can change the default port and/or key location for Pacemaker and
pacemaker_remote via environment variables. These environment variables can be
enabled by placing them in the +/etc/sysconfig/pacemaker+ file.
----
#==#==# Pacemaker Remote
# Use a custom directory for finding the authkey.
PCMK_authkey_location=/etc/pacemaker/authkey
#
# Specify a custom port for Pacemaker Remote connections
PCMK_remote_port=3121
----
== Removing Remote Nodes and Guest Nodes ==
If the resource creating a guest node, or the *ocf:pacemaker:remote* resource
creating a connection to a remote node, is removed from the configuration, the
affected node will continue to show up in output as an offline node.
If you want to get rid of that output, run (replacing $NODE_NAME appropriately):
----
# crm_node --force --remove $NODE_NAME
----
[WARNING]
=========
Be absolutely sure that the node's resource has been deleted from the
configuration first.
=========
diff --git a/extra/resources/ClusterMon b/extra/resources/ClusterMon
index 8efdf1beae..5d1472d1a9 100644
--- a/extra/resources/ClusterMon
+++ b/extra/resources/ClusterMon
@@ -1,267 +1,267 @@
#!/bin/bash
#
#
# ClusterMon OCF RA.
# Starts crm_mon in background which logs cluster status as
# html to the specified file.
#
# Copyright (c) 2004 SUSE LINUX AG, Lars Marowsky-Brée
# All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
# OCF instance parameters:
# OCF_RESKEY_user
# OCF_RESKEY_pidfile
# OCF_RESKEY_update
# OCF_RESKEY_extra_options
# OCF_RESKEY_htmlfile
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
#######################################################################
meta_data() {
cat <
-
+
1.0
This is a ClusterMon Resource Agent.
It outputs current cluster status to the html.
Runs crm_mon in the background, recording the cluster status to an HTML file
The user we want to run crm_mon as
The user we want to run crm_mon as
How frequently should we update the cluster status
Update interval
Additional options to pass to crm_mon. Eg. -n -r
Extra options
PID file location to ensure only one instance is running
PID file
Location to write HTML output to.
HTML output
END
}
#######################################################################
ClusterMon_usage() {
cat </dev/null 2>&1; rc=$?
case $rc in
0) exit $OCF_SUCCESS;;
1) exit $OCF_NOT_RUNNING;;
*) exit $OCF_ERR_GENERIC;;
esac
fi
fi
exit $OCF_NOT_RUNNING
}
CheckOptions() {
while getopts Vi:nrh:cdp: OPTION
do
case $OPTION in
V|n|r|c|d);;
i) ocf_log warn "You should not have specified the -i option, since OCF_RESKEY_update is set already!";;
h) ocf_log warn "You should not have specified the -h option, since OCF_RESKEY_htmlfile is set already!";;
p) ocf_log warn "You should not have specified the -p option, since OCF_RESKEY_pidfile is set already!";;
*) return $OCF_ERR_ARGS;;
esac
done
if [ $? -ne 0 ]; then
return $OCF_ERR_ARGS
fi
# We should have eaten all options at this stage
shift $(($OPTIND -1))
if [ $# -gt 0 ]; then
false
else
true
fi
}
ClusterMon_validate() {
# Existence of the user
if [ ! -z $OCF_RESKEY_user ]; then
getent passwd "$OCF_RESKEY_user" >/dev/null
if [ $? -eq 0 ]; then
: Yes, user exists. We can further check his permission on crm_mon if necessary
else
ocf_log err "The user $OCF_RESKEY_user does not exist!"
exit $OCF_ERR_ARGS
fi
fi
# Pidfile better be an absolute path
case $OCF_RESKEY_pidfile in
/*) ;;
*) ocf_log warn "You should have pidfile($OCF_RESKEY_pidfile) of absolute path!" ;;
esac
# Check the update interval
if ocf_is_decimal "$OCF_RESKEY_update" && [ $OCF_RESKEY_update -gt 0 ]; then
:
else
ocf_log err "Invalid update interval $OCF_RESKEY_update. It should be positive integer!"
exit $OCF_ERR_ARGS
fi
if CheckOptions $OCF_RESKEY_extra_options; then
:
else
ocf_log err "Invalid options $OCF_RESKEY_extra_options!"
exit $OCF_ERR_ARGS
fi
# Htmlfile better be an absolute path
case $OCF_RESKEY_htmlfile in
/*) ;;
*) ocf_log warn "You should have htmlfile($OCF_RESKEY_htmlfile) of absolute path!" ;;
esac
echo "Validate OK"
return $OCF_SUCCESS
}
if [ $# -ne 1 ]; then
ClusterMon_usage
exit $OCF_ERR_ARGS
fi
: ${OCF_RESKEY_update:="15000"}
: ${OCF_RESKEY_pidfile:="/tmp/ClusterMon_${OCF_RESOURCE_INSTANCE}.pid"}
: ${OCF_RESKEY_htmlfile:="/tmp/ClusterMon_${OCF_RESOURCE_INSTANCE}.html"}
OCF_RESKEY_update=`expr $OCF_RESKEY_update / 1000`
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) ClusterMon_start
;;
stop) ClusterMon_stop
;;
monitor) ClusterMon_monitor
;;
validate-all) ClusterMon_validate
;;
usage|help) ClusterMon_usage
exit $OCF_SUCCESS
;;
*) ClusterMon_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
exit $?
diff --git a/extra/resources/HealthCPU b/extra/resources/HealthCPU
index 32a10ad3e7..c5fbb5372a 100644
--- a/extra/resources/HealthCPU
+++ b/extra/resources/HealthCPU
@@ -1,222 +1,222 @@
#!/bin/sh
#
#
# HealthCPU OCF RA. Measures CPUs idling and writes
# #health-cpu status into the CIB
#
# Copyright (c) 2009 Michael Schwartzkopff
# in collaboration with the Bull company. Merci!
#
# All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
################################
#
# TODO: Enter default values
# Error handling in getting uptime
#
##################################
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
#######################################################################
meta_data() {
cat <
-0.1
+1.0
Systhem health agent that measures the CPU idling and updates the #health-cpu attribute.
System health CPU usage
Location to store the resource state in.
State file
Lower (!) limit of idle percentage to switch the health attribute to yellow. I.e.
the #health-cpu will go yellow if the %idle of the CPU falls below 50%.
Lower limit for yellow health attribute
Lower (!) limit of idle percentage to switch the health attribute to red. I.e.
the #health-cpu will go red if the %idle of the CPU falls below 10%.
Lower limit for red health attribute
END
}
#######################################################################
# don't exit on TERM, to test that lrmd makes sure that we do exit
trap sigterm_handler TERM
sigterm_handler() {
ocf_log info "They use TERM to bring us down. No such luck."
return
}
dummy_usage() {
cat <
-0.1
+1.0
Systhem health agent that checks the S.M.A.R.T. status of the given drives and
updates the #health-smart attribute.
SMART health status
Location to store the resource state in.
State file
The drive(s) to check as a SPACE separated list. Enter the full path to the device, e.g. "/dev/sda".
Drives to check
The device type(s) to assume for the drive(s) being tested as a SPACE separated list.
Device types
Lower limit of the temperature in deg C of the drive(s). Below this limit the status will be red.
Lower limit for the red smart attribute
Upper limit of the temperature if deg C of the drives(s). If the drive reports
a temperature higher than this value the status of #health-smart will be red.
Upper limit for red smart attribute
Number of deg C below/above the upper/lower temp limits at which point the status of #health-smart will change to yellow.
Deg C below/above the upper limits for yellow smart attribute
END
}
#######################################################################
check_temperature() {
if [ $1 -lt ${lower_red_limit} ] ; then
ocf_log info "Drive ${DRIVE} ${DEVICE} too cold: ${1} C"
$ATTRDUP -n "#health-smart" -U "red" -d "5s"
return 1
fi
if [ $1 -gt ${upper_red_limit} ] ; then
ocf_log info "Drive ${DRIVE} ${DEVICE} too hot: ${1} C"
$ATTRDUP -n "#health-smart" -U "red" -d "5s"
return 1
fi
if [ $1 -lt ${lower_yellow_limit} ] ; then
ocf_log info "Drive ${DRIVE} ${DEVICE} quite cold: ${1} C"
$ATTRDUP -n "#health-smart" -U "yellow" -d "5s"
return 1
fi
if [ $1 -gt ${upper_yellow_limit} ] ; then
ocf_log info "Drive ${DRIVE} ${DEVICE} quite hot: ${1} C"
$ATTRDUP -n "#health-smart" -U "yellow" -d "5s"
return 1
fi
}
init_smart() {
#Set temperature defaults
if [ -z ${OCF_RESKEY_temp_warning} ]; then
yellow_threshold=5
else
yellow_threshold=${OCF_RESKEY_temp_warning}
fi
if [ -z ${OCF_RESKEY_temp_lower_limit} ] ; then
lower_red_limit=0
else
lower_red_limit=${OCF_RESKEY_temp_lower_limit}
fi
lower_yellow_limit=$((${lower_red_limit}+${yellow_threshold}))
if [ -z ${OCF_RESKEY_temp_upper_limit} ] ; then
upper_red_limit=60
else
upper_red_limit=${OCF_RESKEY_temp_upper_limit}
fi
upper_yellow_limit=$((${upper_red_limit}-${yellow_threshold}))
#Set disk defaults
if [ -z "${OCF_RESKEY_drives}" ] ; then
DRIVES="/dev/sda"
else
DRIVES=${OCF_RESKEY_drives}
fi
#Test for presence of smartctl
if [ ! -x $SMARTCTL ] ; then
ocf_log err "${SMARTCTL} not installed."
exit $OCF_ERR_INSTALLED
fi
for DRIVE in $DRIVES; do
if [ "${OCF_RESKEY_devices}" ]; then
for DEVICE in ${OCF_RESKEY_devices}; do
$SMARTCTL -d $DEVICE -i ${DRIVE} | grep -q "SMART support is: Enabled"
if [ $? -ne "0" ] ; then
ocf_log err "S.M.A.R.T. not enabled for drive "${DRIVE}
exit $OCF_ERR_INSTALLED
fi
done
else
$SMARTCTL -i ${DRIVE} | grep -q "SMART support is: Enabled"
if [ $? -ne "0" ] ; then
ocf_log err "S.M.A.R.T. not enabled for drive "${DRIVE}
exit $OCF_ERR_INSTALLED
fi
fi
done
}
HealthSMART_usage() {
cat <
-
+
1.0
This is a SysInfo Resource Agent.
It records (in the CIB) various attributes of a node
Sample Linux output:
arch: i686
os: Linux-2.4.26-gentoo-r14
free_swap: 1999
cpu_info: Intel(R) Celeron(R) CPU 2.40GHz
cpu_speed: 4771.02
cpu_cores: 1
cpu_load: 0.00
ram_total: 513
ram_free: 117
root_free: 2.4
#health_disk: red
Sample Darwin output:
arch: i386
os: Darwin-8.6.2
cpu_info: Intel Core Duo
cpu_speed: 2.16
cpu_cores: 2
cpu_load: 0.18
ram_total: 2016
ram_free: 787
root_free: 13
#health_disk: green
Units:
free_swap: Mb
ram_*: Mb
cpu_speed (Linux): bogomips
cpu_speed (Darwin): Ghz
*_free: GB (or user-defined: disk_unit)
SysInfo resource agent
PID file
PID file
Interval to allow values to stabilize
Dampening Delay
Filesystems or Paths to be queried for free disk space as a SPACE
separated list - e.g "/dev/sda1 /tmp".
Results will be written to an attribute with leading slashes
removed, and other slashes replaced with underscore, and the word
'free' appended - e.g for /dev/sda1 it would be 'dev_sda1_free'.
Note: The root filesystem '/' is always queried to an attribute
named 'root_free'
List of Filesytems/Paths to query for free disk space
Unit to report disk free space in.
Can be one of: B, K, M, G, T, P (case-insensitive)
Unit to report disk free space in
The amount of free space required in monitored disks. If any
of the monitored disks has less than this amount of free space,
, with the node attribute "#health_disk" changing to "red",
all resources will move away from the node. Set the node-health-strategy
property appropriately for this to take effect.
If the unit is not specified, it defaults to disk_unit.
minimum disk free space required
END
}
#######################################################################
UpdateStat() {
name=$1; shift
value="$*"
printf "%s:\t%s\n" "$name" "$value"
if [ "$__OCF_ACTION" = "start" ] ; then
${HA_SBIN_DIR}/attrd_updater ${OCF_RESKEY_delay} -S status -n $name -B "$value"
else
${HA_SBIN_DIR}/attrd_updater ${OCF_RESKEY_delay} -S status -n $name -v "$value"
fi
}
SysInfoStats() {
UpdateStat arch "`uname -m`"
UpdateStat os "`uname -s`-`uname -r`"
case `uname -s` in
"Darwin")
mem=`top -l 1 | grep Mem: | awk '{print $10}'`
mem_used=`top -l 1 | grep Mem: | awk '{print $8}'`
mem=`SysInfo_mem_units $mem`
mem_used=`SysInfo_mem_units $mem_used`
mem_total=`expr $mem_used + $mem`
cpu_type=`system_profiler SPHardwareDataType | awk -F': ' '/^CPU Type/ {print $2; exit}'`
cpu_speed=`system_profiler SPHardwareDataType | awk -F': ' '/^CPU Speed/ {print $2; exit}'`
cpu_cores=`system_profiler SPHardwareDataType | awk -F': ' '/^Number Of/ {print $2; exit}'`
;;
"Linux")
if [ -f /proc/cpuinfo ]; then
cpu_type=`awk -F': ' '/model name/ {print $2; exit}' /proc/cpuinfo`
cpu_speed=`awk -F': ' '/bogomips/ {print $2; exit}' /proc/cpuinfo`
cpu_cores=`grep "^processor" /proc/cpuinfo | wc -l`
fi
if [ -f /proc/meminfo ]; then
# meminfo results are in kB
mem=`grep "SwapFree" /proc/meminfo | awk '{print $2"k"}'`
if [ ! -z $mem ]; then
UpdateStat free_swap `SysInfo_mem_units $mem`
fi
mem=`grep "Inactive" /proc/meminfo | awk '{print $2"k"}'`
mem_total=`grep "MemTotal" /proc/meminfo | awk '{print $2"k"}'`
else
mem=`top -n 1 | grep Mem: | awk '{print $7}'`
fi
;;
*)
esac
if [ x != x"$cpu_type" ]; then
UpdateStat cpu_info "$cpu_type"
fi
if [ x != x"$cpu_speed" ]; then
UpdateStat cpu_speed "$cpu_speed"
fi
if [ x != x"$cpu_cores" ]; then
UpdateStat cpu_cores "$cpu_cores"
fi
loads=`uptime`
load15=`echo ${loads} | awk '{print $10}'`
UpdateStat cpu_load $load15
if [ ! -z "$mem" ]; then
# Massage the memory values
UpdateStat ram_total `SysInfo_mem_units $mem_total`
UpdateStat ram_free `SysInfo_mem_units $mem`
fi
# Portability notes:
# o tail: explicit "-n" not available in Solaris; instead simplify
# 'tail -n ' to the equivalent 'tail -'.
for disk in "/" ${OCF_RESKEY_disks}; do
unset disk_free disk_label
disk_free=`df -h ${disk} | tail -1 | awk '{print $4}'`
if [ x != x"$disk_free" ]; then
disk_label=`echo $disk | sed -e 's#^/$#root#;s#^/*##;s#/#_#g'`
disk_free=`SysInfo_hdd_units $disk_free`
UpdateStat ${disk_label}_free $disk_free
if [ -n "$MIN_FREE" ]; then
if [ $disk_free -le $MIN_FREE ]; then
UpdateStat "#health_disk" "red"
else
UpdateStat "#health_disk" "green"
fi
fi
fi
done
}
SysInfo_megabytes() {
# Size in megabytes
echo $1 | awk '{ n = $0;
sub(/[0-9]+(.[0-9]+)?/, "");
split(n, a, $0);
n=a[1];
if ($0 == "G" || $0 == "") { n *= 1024 };
if (/^kB?/) { n /= 1024 };
printf "%d\n", n }' # Intentionaly round to an integer
}
SysInfo_mem_units() {
mem=$1
if [ -z $1 ]; then
return
fi
mem=$(SysInfo_megabytes "$1")
# Round to the next multiple of 50
r=$(($mem % 50))
if [ $r != 0 ]; then
mem=$(($mem + 50 - $r))
fi
echo $mem
}
SysInfo_hdd_units() {
# Defauts to size in gigabytes
case $OCF_RESKEY_disk_unit in
[Pp]) echo $(($(SysInfo_megabytes "$1") / 1024 / 1024 / 1024));;
[Tt]) echo $(($(SysInfo_megabytes "$1") / 1024 / 1024));;
[Gg]) echo $(($(SysInfo_megabytes "$1") / 1024));;
[Mm]) echo $(SysInfo_megabytes "$1");;
[Kk]) echo $(($(SysInfo_megabytes "$1") * 1024));;
[Bb]) echo $(($(SysInfo_megabytes "$1") * 1024 * 1024));;
*)
ocf_log err "Invalid value for disk_unit: $OCF_RESKEY_disk_unit"
echo $(($(SysInfo_megabytes "$1") / 1024));;
esac
}
SysInfo_usage() {
cat < $OCF_RESKEY_pidfile
SysInfoStats
exit $OCF_SUCCESS
}
SysInfo_stop() {
rm $OCF_RESKEY_pidfile
exit $OCF_SUCCESS
}
SysInfo_monitor() {
if [ -f $OCF_RESKEY_pidfile ]; then
clone=`cat $OCF_RESKEY_pidfile`
fi
if [ x$clone = x ]; then
rm $OCF_RESKEY_pidfile
exit $OCF_NOT_RUNNING
elif [ $clone = $OCF_RESKEY_clone ]; then
SysInfoStats
exit $OCF_SUCCESS
elif [ x$OCF_RESKEY_CRM_meta_globally_unique = xtrue
-o x$OCF_RESKEY_CRM_meta_globally_unique = xTrue
-o x$OCF_RESKEY_CRM_meta_globally_unique = xyes
-o x$OCF_RESKEY_CRM_meta_globally_unique = xYes
]; then
SysInfoStats
exit $OCF_SUCCESS
fi
exit $OCF_NOT_RUNNING
}
SysInfo_validate() {
return $OCF_SUCCESS
}
if [ $# -ne 1 ]; then
SysInfo_usage
exit $OCF_ERR_ARGS
fi
: ${OCF_RESKEY_pidfile:="${HA_VARRUN%%/}/SysInfo-${OCF_RESOURCE_INSTANCE}"}
: ${OCF_RESKEY_disk_unit:="G"}
: ${OCF_RESKEY_clone:="0"}
if [ x != x${OCF_RESKEY_delay} ]; then
OCF_RESKEY_delay="-d ${OCF_RESKEY_delay}"
else
OCF_RESKEY_delay="-d 0"
fi
MIN_FREE=""
if [ -n "$OCF_RESKEY_min_disk_free" ]; then
ocf_is_decimal "$OCF_RESKEY_min_disk_free" &&
OCF_RESKEY_min_disk_free="$OCF_RESKEY_min_disk_free$OCF_RESKEY_disk_unit"
MIN_FREE=`SysInfo_hdd_units $OCF_RESKEY_min_disk_free`
fi
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) SysInfo_start
;;
stop) SysInfo_stop
;;
monitor) SysInfo_monitor
;;
validate-all) SysInfo_validate
;;
usage|help) SysInfo_usage
exit $OCF_SUCCESS
;;
*) SysInfo_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
exit $?
diff --git a/extra/resources/SystemHealth b/extra/resources/SystemHealth
index 658d446273..3e76fc3221 100644
--- a/extra/resources/SystemHealth
+++ b/extra/resources/SystemHealth
@@ -1,252 +1,252 @@
#!/bin/sh
#
# SystemHealth OCF RA.
#
# Copyright (c) 2009 International Business Machines (IBM), Mark Hamzy
# All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
#######################################################################
meta_data() {
cat <
-0.1
+1.0
This is a SystemHealth Resource Agent. It is used to monitor
the health of a system via IPMI.
SystemHealth resource agent
END
}
#######################################################################
SystemHealth_usage() {
cat < /dev/null 2>&1
RC=$?
if [ $RC != 0 ]; then
ocf_log err "servicelog_notify not found!"
return $OCF_ERR_INSTALLED
fi
which ipmiservicelogd > /dev/null 2>&1
RC=$?
if [ $RC != 0 ]; then
ocf_log err "ipmiservicelogd not found!"
return $OCF_ERR_INSTALLED
fi
test -x $OCF_RESKEY_program
RC=$?
if [ $RC != 0 ]; then
ocf_log err "$OCF_RESKEY_program not found!"
return $OCF_ERR_INSTALLED
fi
}
SystemHealth_start() {
SystemHealth_monitor
RC=$?
if [ $RC = $OCF_ERR_GENERIC ]; then
return $OCF_ERR_GENERIC
elif [ $RC = $OCF_SUCCESS ]; then
ocf_log warn "starting an already started SystemHealth"
return $OCF_SUCCESS
fi
service ipmi start > /dev/null 2>&1
RC=$?
if [ $RC != 0 ]; then
ocf_log err "Could not start service IPMI!"
return $OCF_ERR_GENERIC
fi
ipmiservicelogd smi 0 > /dev/null 2>&1 &
RC=$?
if [ $RC != 0 ]; then
ocf_log err "Could not start ipmiservicelogd!"
return $OCF_ERR_GENERIC
fi
servicelog_notify --add --type=EVENT --command="$OCF_RESKEY_program" --method=num_arg --match='type=4' > /dev/null 2>&1
RC=$?
if [ $RC != 0 ]; then
ocf_log err "servicelog_notify register handler failed!"
return $OCF_ERR_GENERIC
fi
return $OCF_SUCCESS
}
SystemHealth_stop() {
SystemHealth_monitor
RC=$?
if [ $RC = $OCF_ERR_GENERIC ]; then
return $OCF_ERR_GENERIC
elif [ $RC = $OCF_SUCCESS ]; then
killall ipmiservicelogd
RC1=$?
if [ $RC1 != 0 ]; then
ocf_log err "Could not stop ipmiservicelogd!"
fi
servicelog_notify --remove --command="$OCF_RESKEY_program" > /dev/null 2>&1
RC2=$?
if [ $RC2 != 0 ]; then
ocf_log err "servicelog_notify remove handler failed!"
fi
if [ $RC1 = 0 -a $RC2 = 0 ]; then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
elif [ $RC = $OCF_NOT_RUNNING ]; then
ocf_log warn "stopping an already stopped SystemHealth"
return $OCF_SUCCESS
else
ocf_log err "SystemHealth_stop: should not be here!"
return $OCF_ERR_GENERIC
fi
}
SystemHealth_monitor() {
# Monitor _MUST!_ differentiate correctly between running
# (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
# That is THREE states, not just yes/no.
if [ ! -f /var/run/ipmiservicelogd.pid0 ]; then
ocf_log debug "ipmiservicelogd is not running!"
return $OCF_NOT_RUNNING
fi
ps -p `cat /var/run/ipmiservicelogd.pid0` > /dev/null 2>&1
RC=$?
if [ $RC != 0 ]; then
ocf_log debug "ipmiservicelogd's pid `cat /var/run/ipmiservicelogd.pid0` is not running!"
rm /var/run/ipmiservicelogd.pid0
return $OCF_ERR_GENERIC
fi
servicelog_notify --list --command="$OCF_RESKEY_program" > /dev/null 2>&1
RC=$?
if [ $RC = 0 ]; then
return $OCF_SUCCESS
else
return $OCF_NOT_RUNNING
fi
}
SystemHealth_validate() {
SystemHealth_check_tools
RC=$?
if [ $RC != 0 ]; then
return $RC
fi
return $OCF_SUCCESS
}
: ${OCF_RESKEY_program=/usr/sbin/notifyServicelogEvent}
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
usage|help) SystemHealth_usage
exit $OCF_SUCCESS
;;
esac
SystemHealth_check_tools
RC=$?
if [ $RC != 0 ]; then
case $__OCF_ACTION in
stop) exit $OCF_SUCCESS;;
*) exit $RC;;
esac
fi
case $__OCF_ACTION in
start) SystemHealth_start;;
stop) SystemHealth_stop;;
monitor) SystemHealth_monitor;;
reload) ocf_log info "Reloading..."
SystemHealth_start
;;
validate-all) ;;
*) SystemHealth_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc
diff --git a/extra/resources/ifspeed b/extra/resources/ifspeed
index a9390dc312..a41377c371 100644
--- a/extra/resources/ifspeed
+++ b/extra/resources/ifspeed
@@ -1,458 +1,458 @@
#!/bin/bash
#
# OCF resource agent which monitors state of network interface and records it
# as a value in CIB based on summ of speeds of its active (up, link detected,
# not blocked) underlying interfaces.
#
# Copyright (c) 2011 Vladislav Bogdanov
# Partially based on 'ping' RA by Andrew Beekhof
#
# OCF instance parameters:
# OCF_RESKEY_name: name of attribute to set in CIB
# OCF_RESKEY_iface: network interface to monitor
# OCF_RESKEY_bridge_ports: if not null and OCF_RESKEY_iface is a bridge, list of
# bridge ports to consider.
# Default is all ports which have designated_bridge=root_id
# OCF_RESKEY_weight_base: Relative weight of 1Gbps. This can be used to tune
# value of resulting CIB attribute.
#
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
# Defaults
OCF_RESKEY_name_default="ifspeed"
OCF_RESKEY_bridge_ports_default="detect"
OCF_RESKEY_weight_base_default=1000
OCF_RESKEY_dampen_default=5
: ${OCF_RESKEY_name=${OCF_RESKEY_name_default}}
: ${OCF_RESKEY_bridge_ports=${OCF_RESKEY_bridge_ports_default}}
: ${OCF_RESKEY_weight_base=${OCF_RESKEY_weight_base_default}}
: ${OCF_RESKEY_dampen=${OCF_RESKEY_dampen_default}}
meta_data() {
cat <
-
+
1.0
Every time the monitor action is run, this resource agent records (in the CIB)
(relative) speed of network interface it monitors.
This RA can monitor physical interfaces, bonds, bridges, vlans and (hopefully)
any combination of them.
Examples:
*) Bridge on top of one 10Gbps interface (eth2) and 802.3ad bonding (bond0) built
on two 1Gbps interfaces (eth0 and eth1).
*) Active-backup bonding built on top of one physical interface and one vlan on
another interface.
For STP-enabled bridges this RA tries to some-how guess network topology and by
default looks only on ports which are connected to upstream switch. This can be
overriden by 'bridge_ports' parameter. Active interfaces in this case are those
in "forwarding" state.
For balancing bonds this RA summs speeds of underlying "up" slave interfaces
(and applies coefficient 0.8 to result).
For non-balancing bonds ('active-backup' and probably 'brodcast') only speed of
now active slave is used.
Network interface speed monitor
The name of the attribute to set. This is the name to be used in the constraints.
Attribute name
Network interface to monitor.
Network interface
If not null and OCF_RESKEY_iface is a bridge, list of bridge ports to consider.
Default is all ports which have designated_bridge=root_id.
Bridge ports
Relative weight of 1Gbps in interface speed.
Can be used to tune how big attribute value will be.
Weight of 1Gbps
The time to wait (dampening) for further changes to occur.
Dampening interval
Log what have been done more verbosely.
Verbose logging
END
}
usage() {
cat </dev/null)"
if [ -z "$SP_OUT" ]
then
modprobe -s ocfs2_stack_user
if [ $? != 0 ]; then
ocf_log err "Could not load ocfs2_stack_user"
return $OCF_ERR_INSTALLED
fi
fi
SP_OUT="$(awk '/^'user'$/{print; exit}' "$LOADED_PLUGINS_FILE" 2>/dev/null)"
if [ -z "$SP_OUT" ]; then
ocf_log err "Switch to userspace stack unsuccessful"
return $OCF_ERR_INSTALLED
fi
if [ -f "$CLUSTER_STACK_FILE" ]; then
echo "$OCF_RESKEY_stack" >"$CLUSTER_STACK_FILE"
if [ $? != 0 ]; then
ocf_log err "Userspace stack '$OCF_RESKEY_stack' not supported"
return $OCF_ERR_INSTALLED
fi
else
ocf_log err "Switch to userspace stack not supported"
return $OCF_ERR_INSTALLED
fi
driver_filesystem ocfs2; rc=$?
if [ $rc != 0 ]; then
modprobe -s ocfs2
if [ "$?" != 0 ]; then
ocf_log err "Unable to load ocfs2 module"
return $OCF_ERR_INSTALLED
fi
fi
bringup_daemon
return $?
}
o2cb_stop() {
o2cb_monitor; rc=$?
case $rc in
$OCF_NOT_RUNNING) return $OCF_SUCCESS;;
esac
ocf_log info "Stopping $OCF_RESOURCE_INSTANCE"
kill_daemon
if [ $? != 0 ]; then
ocf_log err "Unable to unload modules: the cluster is still online"
return $OCF_ERR_GENERIC
fi
unload_filesystem ocfs2
if [ $? = 1 ]; then
ocf_log err "Unable to unload ocfs2 module"
return $OCF_ERR_GENERIC
fi
# If we can't find the stack glue, we have nothing to do.
[ ! -e "$LOADED_PLUGINS_FILE" ] && return $OCF_SUCCESS
while read plugin
do
unload_module "ocfs2_stack_${plugin}"
if [ $? = 1 ]; then
ocf_log err "Unable to unload ocfs2_stack_${plugin}"
return $OCF_ERR_GENERIC
fi
done <"$LOADED_PLUGINS_FILE"
unload_module "ocfs2_stackglue"
if [ $? = 1 ]; then
ocf_log err "Unable to unload ocfs2_stackglue"
return $OCF_ERR_GENERIC
fi
# Don't unmount configfs - its always in use by libdlm
}
o2cb_monitor() {
o2cb_validate
# Assume that ocfs2_controld will terminate if any of the conditions below are met
driver_filesystem configfs; rc=$?
if [ $rc != 0 ]; then
ocf_log info "configfs not loaded"
return $OCF_NOT_RUNNING
fi
check_filesystem configfs "${OCF_RESKEY_configfs}"; rc=$?
if [ $rc != 0 ]; then
ocf_log info "configfs not mounted"
return $OCF_NOT_RUNNING
fi
if [ ! -e "$LOADED_PLUGINS_FILE" ]; then
ocf_log info "Stack glue driver not loaded"
return $OCF_NOT_RUNNING
fi
grep user "$LOADED_PLUGINS_FILE" >/dev/null 2>&1; rc=$?
if [ $rc != 0 ]; then
ocf_log err "Wrong stack `cat $LOADED_PLUGINS_FILE`"
return $OCF_ERR_INSTALLED
fi
driver_filesystem ocfs2; rc=$?
if [ $rc != 0 ]; then
ocf_log info "ocfs2 not loaded"
return $OCF_NOT_RUNNING
fi
status_daemon
return $?
}
o2cb_usage() {
echo "usage: $0 {start|stop|monitor|validate-all|meta-data}"
echo " Expects to have a fully populated OCF RA-compliant environment set."
echo " In particualr, a value for OCF_ROOT"
}
o2cb_validate() {
check_binary ${DAEMON}
case ${OCF_RESKEY_CRM_meta_globally_unique} in
yes|Yes|true|True|1)
ocf_log err "$OCF_RESOURCE_INSTANCE must be configured with the globally_unique=false meta attribute"
exit $OCF_ERR_CONFIGURED
;;
esac
return $OCF_SUCCESS
}
meta_data() {
cat <
-
+
1.0
OCFS2 daemon resource agent
This Resource Agent controls the userspace daemon needed by OCFS2.
Location where sysfs is mounted
Sysfs location
Location where configfs is mounted
Configfs location
Which userspace stack to use. Known values: pcmk, cman
Userspace stack
Number of seconds to allow the control daemon to come up
Daemon Timeout
END
}
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) o2cb_start
;;
stop) o2cb_stop
;;
monitor) o2cb_monitor
;;
validate-all) o2cb_validate
;;
usage|help) o2cb_usage
exit $OCF_SUCCESS
;;
*) o2cb_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
exit $?
diff --git a/extra/resources/ping b/extra/resources/ping
index e2c5e9eefb..26cc0cc5d5 100755
--- a/extra/resources/ping
+++ b/extra/resources/ping
@@ -1,436 +1,436 @@
#!/bin/sh
#
#
# Ping OCF RA that utilizes the system ping
#
# Copyright (c) 2009 Andrew Beekhof
# All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
#######################################################################
meta_data() {
cat <
-
+
1.0
Every time the monitor action is run, this resource agent records (in the CIB) the current number of nodes the host can connect to using the system fping (preferred) or ping tool.
node connectivity
PID file
PID file
The time to wait (dampening) further changes occur
Dampening interval
The name of the attributes to set. This is the name to be used in the constraints.
Attribute name
The number by which to multiply the number of connected ping nodes by
Value multiplier
A space separated list of ping nodes to count.
Host list
Number of ping attempts, per host, before declaring it dead
no. of ping attempts
How long, in seconds, to wait before declaring a ping lost
ping timeout in seconds
A catch all for any other options that need to be passed to ping.
Extra Options
Resource is failed if the score is less than failure_score.
Default never fails.
failure_score
Use fping rather than ping, if found. If set to 0, fping
will not be used even if present.
Use fping if available
Enables to use default attrd_updater verbose logging on every call.
Verbose logging
END
}
#######################################################################
ping_conditional_log() {
level=$1; shift
if [ ${OCF_RESKEY_debug} = "true" ]; then
ocf_log $level "$*"
fi
}
ping_usage() {
cat <$f_out 2>$f_err; rc=$?
active=`grep alive $f_out|wc -l`
case $rc in
0)
;;
1)
for h in `grep unreachable $f_out | awk '{print $1}'`; do
ping_conditional_log warn "$h is inactive"
done
;;
*)
ocf_log err "Unexpected result for '$cmd' $rc: `tr '\n' ';' < $f_err`"
;;
esac
rm -f $f_out $f_err
return $active
}
ping_check() {
active=0
for host in $OCF_RESKEY_host_list; do
p_exe=ping
case `uname` in
Linux) p_args="-n -q -W $OCF_RESKEY_timeout -c $OCF_RESKEY_attempts";;
Darwin) p_args="-n -q -t $OCF_RESKEY_timeout -c $OCF_RESKEY_attempts -o";;
*) ocf_log err "Unknown host type: `uname`"; exit $OCF_ERR_INSTALLED;;
esac
case $host in
*:*) p_exe=ping6
esac
p_out=`$p_exe $p_args $OCF_RESKEY_options $host 2>&1`; rc=$?
case $rc in
0) active=`expr $active + 1`;;
1) ping_conditional_log warn "$host is inactive: $p_out";;
*) ocf_log err "Unexpected result for '$p_exe $p_args $OCF_RESKEY_options $host' $rc: $p_out";;
esac
done
return $active
}
ping_update() {
if use_fping; then
fping_check
active=$?
else
ping_check
active=$?
fi
score=`expr $active \* $OCF_RESKEY_multiplier`
if [ "$__OCF_ACTION" = "start" ] ; then
attrd_updater -n $OCF_RESKEY_name -B $score -d $OCF_RESKEY_dampen $attrd_options
else
attrd_updater -n $OCF_RESKEY_name -v $score -d $OCF_RESKEY_dampen $attrd_options
fi
rc=$?
case $rc in
0) ping_conditional_log debug "Updated $OCF_RESKEY_name = $score" ;;
*) ocf_log warn "Could not update $OCF_RESKEY_name = $score: rc=$rc";;
esac
if [ $rc -ne 0 ]; then
return $rc
fi
if [ -n "$OCF_RESKEY_failure_score" -a "$score" -lt "$OCF_RESKEY_failure_score" ]; then
ocf_log warn "$OCF_RESKEY_name is less than failure_score($OCF_RESKEY_failure_score)"
return 1
fi
return 0
}
use_fping() {
ocf_is_true "$OCF_RESKEY_use_fping" && have_binary fping;
}
# return values:
# 4 IPv4
# 6 IPv6
# 0 indefinite (i.e. hostname)
host_family() {
case $1 in
*[0-9].*[0-9].*[0-9].*[0-9]) return 4 ;;
*:*) return 6 ;;
*) return 0 ;;
esac
}
# return values same as host_family plus
# 99 ambiguous families
hosts_family() {
# For fping allow only same IP versions or hostnames
family=0
for host in $OCF_RESKEY_host_list; do
host_family $host
f=$?
if [ $family -ne 0 -a $f -ne 0 -a $f -ne $family ] ; then
family=99
break
fi
[ $f -ne 0 ] && family=$f
done
return $family
}
: ${OCF_RESKEY_name:="pingd"}
: ${OCF_RESKEY_dampen:="5s"}
: ${OCF_RESKEY_attempts:="3"}
: ${OCF_RESKEY_multiplier:="1"}
: ${OCF_RESKEY_debug:="false"}
: ${OCF_RESKEY_failure_score:="0"}
: ${OCF_RESKEY_use_fping:="1"}
: ${OCF_RESKEY_CRM_meta_timeout:="20000"}
: ${OCF_RESKEY_CRM_meta_globally_unique:="false"}
integer=`echo ${OCF_RESKEY_timeout} | egrep -o '[0-9]*'`
case ${OCF_RESKEY_timeout} in
*[0-9]ms|*[0-9]msec) OCF_RESKEY_timeout=`expr $integer / 1000`;;
*[0-9]m|*[0-9]min) OCF_RESKEY_timeout=`expr $integer \* 60`;;
*[0-9]h|*[0-9]hr) OCF_RESKEY_timeout=`expr $integer \* 60 \* 60`;;
*) OCF_RESKEY_timeout=$integer;;
esac
if [ -z ${OCF_RESKEY_timeout} ]; then
if [ x"$OCF_RESKEY_host_list" != x ]; then
host_count=`echo $OCF_RESKEY_host_list | awk '{print NF}'`
OCF_RESKEY_timeout=`expr $OCF_RESKEY_CRM_meta_timeout / $host_count / $OCF_RESKEY_attempts`
OCF_RESKEY_timeout=`expr $OCF_RESKEY_timeout / 1100` # Convert to seconds and finish 10% early
else
OCF_RESKEY_timeout=5
fi
fi
if [ ${OCF_RESKEY_timeout} -lt 1 ]; then
OCF_RESKEY_timeout=5
elif [ ${OCF_RESKEY_timeout} -gt 1000 ]; then
# ping actually complains if this value is too high, 5 minutes is plenty
OCF_RESKEY_timeout=300
fi
if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then
: ${OCF_RESKEY_pidfile:="${HA_VARRUN%%/}/ping-${OCF_RESKEY_name}"}
else
: ${OCF_RESKEY_pidfile:="${HA_VARRUN%%/}/ping-${OCF_RESOURCE_INSTANCE}"}
fi
# Check the debug option
case "${OCF_RESKEY_debug}" in
true|True|TRUE|1) OCF_RESKEY_debug=true;;
false|False|FALSE|0) OCF_RESKEY_debug=false;;
*)
ocf_log warn "Value for 'debug' is incorrect. Please specify 'true' or 'false' not: ${OCF_RESKEY_debug}"
OCF_RESKEY_debug=false
;;
esac
attrd_options='-q'
if [ ${OCF_RESKEY_debug} = "true" ]; then
attrd_options=''
fi
case $__OCF_ACTION in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) ping_start;;
stop) ping_stop;;
monitor) ping_monitor;;
validate-all) ping_validate;;
usage|help) ping_usage
exit $OCF_SUCCESS
;;
*) ping_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
exit $?
diff --git a/extra/resources/pingd b/extra/resources/pingd
index add152642e..6003c02fe0 100644
--- a/extra/resources/pingd
+++ b/extra/resources/pingd
@@ -1,200 +1,200 @@
#!/bin/sh
#
#
# pingd OCF Resource Agent
# Records (in the CIB) the current number of ping nodes a
# cluster node can connect to.
#
# Copyright (c) 2006 Andrew Beekhof
# All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
: ${OCF_RESKEY_name:="pingd"}
: ${OCF_RESKEY_interval:="1"}
: ${OCF_RESKEY_CRM_meta_interval:=0}
upgrade1="This agent (ocf:pacemaker:pingd) has been replaced by the more reliable ocf:pacemaker:ping."
upgrade2="Attempting automated conversion, run 'crm ra info ocf:pacemaker:ping' for all configuration options"
upgrade3="You will need to remove the existing resource and replace it with one that uses 'ocf:pacemaker:ping' directly"
case $__OCF_ACTION in
start|monitor)
if [ "x" != "x$OCF_RESKEY_host_list" ]; then
ocf_log err "$upgrade1"
ocf_log err "$upgrade2"
ocf_log err "Automatic conversion to ocf:pacemaker:ping failed: no hosts were configured to check for connectivity"
ocf_log err "$upgrade3"
exit $OCF_ERR_ARGS
fi
recurring=`crm configure show $OCF_RESOURCE_INSTANCE | grep "op monitor.*interval=\"[1-9]" | sed s/.*interval=// | awk -F\" '{print $2}' | sed s/.*interval=// | awk -F\" '{print $2}' | sort | head -n 1`
if [ -z $recurring ]; then
ocf_log err "$upgrade1"
ocf_log err "$upgrade2"
ocf_log err "Automatic conversion to ocf:pacemaker:ping failed: no monitor operation configured"
ocf_log err "Without an explicit monitor operation for '$OCF_RESOURCE_INSTANCE', connectivity changes will not be noticed"
ocf_log err "Preventing startup to ensure the issue is addressed before it matters"
exit $OCF_ERR_ARGS
fi
if [ $OCF_RESKEY_CRM_meta_interval = 0 ]; then
ocf_log warn "$upgrade1"
ocf_log warn "$upgrade2"
if [ $recurring != $OCF_RESKEY_interval ]; then
ocf_log warn "Your monitor operation happens every $recurring, which means that the $OCF_RESKEY_name attribute will be updated with a different frequency than the previously configured ( $OCF_RESKEY_interval )"
ocf_log warn "Either change the monitor interval to match or, ideally, switch to the ocf:pacemaker:ping agent and avoid all this compatibility nonsense."
fi
fi
;;
meta-data)
cat <
-
+
1.0
This agent (ocf:pacemaker:pingd) has been replaced by the more reliable ocf:pacemaker:ping.
It records (in the CIB) the current number of ping nodes (specified in the 'host_list' parameter) a cluster node can connect to.
pingd resource agent
PID file
PID file
The user we want to run pingd as
The user we want to run pingd as
The time to wait (dampening) further changes occur
Dampening interval
The name of the instance_attributes set to place the value in. Rarely needs to be specified.
Set name
The name of the attributes to set. This is the name to be used in the constraints.
Attribute name
The section place the value in. Rarely needs to be specified.
Section name
The number by which to multiply the number of connected ping nodes by
Value multiplier
The list of ping nodes to count. Defaults to all configured ping nodes. Rarely needs to be specified.
Host list
How often, in seconds, to check for node liveliness
ping interval in seconds
Number of ping attempts, per host, before declaring it dead
no. of ping attempts
How long, in seconds, to wait before declaring a ping lost
ping timeout in seconds
A catch all for any other options that need to be passed to pingd.
Extra Options
END
exit $OCF_SUCCESS
;;
esac
${OCF_ROOT}/resource.d/pacemaker/ping $1
exit $?
diff --git a/extra/resources/remote b/extra/resources/remote
index 447267e778..afd8c79973 100644
--- a/extra/resources/remote
+++ b/extra/resources/remote
@@ -1,125 +1,125 @@
#!/bin/sh
#
#
# remote OCF RA. This script provides metadata for the internal
# pacemaker remote lrmd connection agent. Outside of acting
# as a place holder so the remote ra script can be indexed and
# providing metadata, this script should never be invoked. The
# actual functionality behind the remote lrmd connection lives
# within pacemaker's crmd component.
#
# Copyright (c) 2013 David Vossel
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like. Any license provided herein, whether implied or
# otherwise, applies only to this software file. Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
#######################################################################
# Initialization:
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
#######################################################################
meta_data() {
cat <
-0.1
+1.0
remote resource agent
Server location to connect to. This can be an ip address or hostname.
Server location
tcp port to connect to.
tcp port
- Time in seconds to wait before attempting to reconnect to a remote node
- after an active connection to the remote node has been severed. This wait
- is recurring. If reconnect fails after the wait period, a new reconnect
- attempt will be made after observing the wait time. When this option is
- in use, pacemaker will keep attempting to reach out and connect to the
- remote node indefinitely after each wait interval.
+ Interval in seconds at which Pacemaker will attempt to reconnect to a
+ remote node after an active connection to the remote node has been
+ severed. When this value is nonzero, Pacemaker will retry the connection
+ indefinitely, at the specified interval. As with any time-based actions,
+ this is not guaranteed to be checked more frequently than the value of
+ the cluster-recheck-interval cluster option.
reconnect interval
END
}
#######################################################################
remote_usage() {
cat <
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with this library; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include
#ifndef _GNU_SOURCE
# define _GNU_SOURCE
#endif
#include
#include
#include
#include
#include
#include
/*
* \internal
* \brief Get process ID and name associated with a /proc directory entry
*
* \param[in] entry Directory entry (must be result of readdir() on /proc)
* \param[out] name If not NULL, a char[64] to hold the process name
* \param[out] pid If not NULL, will be set to process ID of entry
*
* \return 0 on success, -1 if entry is not for a process or info not found
*
* \note This should be called only on Linux systems, as not all systems that
* support /proc store process names and IDs in the same way.
*/
int
crm_procfs_process_info(struct dirent *entry, char *name, int *pid)
{
int fd, local_pid;
FILE *file;
struct stat statbuf;
char key[16] = { 0 }, procpath[128] = { 0 };
/* We're only interested in entries whose name is a PID,
* so skip anything non-numeric or that is too long.
*
* 114 = 128 - strlen("/proc/") - strlen("/status") - 1
*/
local_pid = atoi(entry->d_name);
if ((local_pid <= 0) || (strlen(entry->d_name) > 114)) {
return -1;
}
if (pid) {
*pid = local_pid;
}
/* Get this entry's file information */
strcpy(procpath, "/proc/");
strcat(procpath, entry->d_name);
fd = open(procpath, O_RDONLY);
if (fd < 0 ) {
return -1;
}
if (fstat(fd, &statbuf) < 0) {
close(fd);
return -1;
}
close(fd);
/* We're only interested in subdirectories */
if (!S_ISDIR(statbuf.st_mode)) {
return -1;
}
/* Read the first entry ("Name:") from the process's status file.
* We could handle the valgrind case if we parsed the cmdline file
* instead, but that's more of a pain than it's worth.
*/
if (name != NULL) {
strcat(procpath, "/status");
file = fopen(procpath, "r");
if (!file) {
return -1;
}
if ((fscanf(file, "%15s%63s", key, name) != 2)
|| safe_str_neq(key, "Name:")) {
fclose(file);
return -1;
}
fclose(file);
}
return 0;
}
/*
* \internal
* \brief Return process ID of a named process
*
* \param[in] name Process name (as used in /proc/.../status)
*
* \return Process ID of named process if running, 0 otherwise
*
* \note This will return 0 if the process is being run via valgrind.
* This should be called only on Linux systems.
*/
int
crm_procfs_pid_of(const char *name)
{
DIR *dp;
struct dirent *entry;
int pid = 0;
char entry_name[64] = { 0 };
dp = opendir("/proc");
if (dp == NULL) {
crm_notice("Can not read /proc directory to track existing components");
return 0;
}
while ((entry = readdir(dp)) != NULL) {
if ((crm_procfs_process_info(entry, entry_name, &pid) == 0)
&& safe_str_eq(entry_name, name)
&& (crm_pid_active(pid, NULL) == 1)) {
crm_info("Found %s active as process %d", name, pid);
break;
}
+ pid = 0;
}
closedir(dp);
return pid;
}