diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst
index 91862cd48d..d4f24fa18f 100644
--- a/doc/sphinx/Pacemaker_Development/components.rst
+++ b/doc/sphinx/Pacemaker_Development/components.rst
@@ -1,489 +1,489 @@
Coding Particular Pacemaker Components
--------------------------------------
The Pacemaker code can be intricate and difficult to follow. This chapter has
some high-level descriptions of how individual components work.
.. index::
single: controller
single: pacemaker-controld
Controller
##########
``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons
to orchestrate actions that need to be taken in the cluster. It receives CIB
change notifications from the CIB manager, passes the new CIB to the scheduler
to determine whether anything needs to be done, uses the executor and fencer to
execute any actions required, and sets failure counts (among other things) via
the attribute manager.
As might be expected, it has the most code of any of the daemons.
.. index::
single: join
Join sequence
_____________
Most daemons track their cluster peers using Corosync's membership and CPG
only. The controller additionally requires peers to `join`, which ensures they
are ready to be assigned tasks. Joining proceeds through a series of phases
referred to as the `join sequence` or `join process`.
A node's current join phase is tracked by the ``join`` member of ``crm_node_t``
(used in the peer cache). It is an ``enum crm_join_phase`` that (ideally)
progresses from the DC's point of view as follows:
* The node initially starts at ``crm_join_none``
* The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node
proceeds to ``crm_join_welcomed``. This can happen in three ways:
* The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at
its controller startup, and the DC will reply to that with a join offer.
* When the DC's peer status callback notices that the node has joined the
messaging layer, it registers ``I_NODE_JOIN`` (which leads to
``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` ->
``join_make_offer()``).
* After certain events (notably a new DC being elected), the DC will send all
nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``).
These can overlap. The DC can send a join offer and the node can send a join
announce at nearly the same time, so the node responds to the original join
offer while the DC responds to the join announce with a new join offer. The
situation resolves itself after looping a bit.
* The node responds to join offers with a `join request`
(``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and
``join_query_callback()``). When the DC receives the request, the
node proceeds to ``crm_join_integrated`` (via ``do_dc_join_filter_offer()``).
* As each node is integrated, the current best CIB is sync'ed to each
integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB
sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``)
and the node proceeds to ``crm_join_finalized`` (via
``finalize_sync_callback()`` + ``finalize_join_for()``).
* Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via
``do_cl_join_finalize_respond()``), including its current resource operation
history (via ``controld_query_executor_state()``). Once the DC receives this
confirmation, the node proceeds to ``crm_join_confirmed`` via
``do_dc_join_ack()``.
Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks
for quorum and responds appropriately.
When peers are lost, their join phase is reset to none (in various places).
``crm_update_peer_join()`` updates a node's join phase.
The DC increments the global ``current_join_id`` for each joining round, and
rejects any (older) replies that don't match.
.. index::
single: fencer
single: pacemaker-fenced
Fencer
######
``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In
the broadest terms, fencing works like this:
#. The initiator (an external program such as ``stonith_admin``, or the cluster
itself via the controller) asks the local fencer, "Hey, could you please
fence this node?"
#. The local fencer asks all the fencers in the cluster (including itself),
"Hey, what fencing devices do you have access to that can fence this node?"
#. Each fencer in the cluster replies with a list of available devices that
it knows about.
#. Once the original fencer gets all the replies, it asks the most
appropriate fencer peer to actually carry out the fencing. It may send
out more than one such request if the target node must be fenced with
multiple devices.
#. The chosen fencer(s) call the appropriate fencing resource agent(s) to
do the fencing, then reply to the original fencer with the result.
#. The original fencer broadcasts the result to all fencers.
#. Each fencer sends the result to each of its local clients (including, at
some point, the initiator).
A more detailed description follows.
.. index::
single: libstonithd
Initiating a fencing request
____________________________
A fencing request can be initiated by the cluster or externally, using the
libstonithd API.
* The cluster always initiates fencing via
``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the
``fence()`` API method). This occurs when a transition graph synapse contains
a ``CRM_OP_FENCE`` XML operation.
* The main external clients are ``stonith_admin`` and ``cts-fence-helper``.
The ``DLM`` project also uses Pacemaker for fencing.
Highlights of the fencing API:
* ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose
``cmds`` member has methods for connect, disconnect, fence, etc.
* the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with
the desired action and target node. Callers do not have to choose or even
have any knowledge about particular fencing devices.
Fencing queries
_______________
The function calls for a fencing request go something like this:
The local fencer receives the client's request via an IPC or messaging
layer callback, which calls
* ``stonith_command()``, which (for requests) calls
* ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls
* ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML
request with the target, desired action, timeout, etc. then broadcasts
the operation to the cluster group (i.e. all fencer instances) and
starts a timer. The query is broadcast because (1) location constraints
might prevent the local node from accessing the stonith device directly,
and (2) even if the local node does have direct access, another node
might be preferred to carry out the fencing.
Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast
request via IPC or messaging layer callback, which calls:
* ``stonith_command()``, which (for requests) calls
* ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls
* ``stonith_query()``, which calls
* ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add
device information to an XML reply and send it. (A message is
considered a reply if it contains ``T_STONITH_REPLY``, which is only
set by fencer peers, not clients.)
The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC
or messaging layer callback, which calls:
* ``stonith_command()``, which (for replies) calls
* ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls
* ``process_remote_stonith_query()``, which allocates a new query result
structure, parses device information into it, and adds it to the
operation object. It increments the number of replies received for this
operation, and compares it against the expected number of replies (i.e.
the number of active peers), and if this is the last expected reply,
calls
* ``request_peer_fencing()``, which calculates the timeout and sends
``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target
node has a fencing "topology" (which allows specifications such as
"this node can be fenced either with device A, or devices B and C in
combination"), it will choose the device(s), and send out as many
requests as needed. If it chooses a device, it will choose the peer; a
peer is preferred if it has "verified" access to the desired device,
meaning that it has the device "running" on it and thus has a monitor
operation ensuring reachability.
Fencing operations
__________________
Each ``STONITH_OP_FENCE`` request goes something like this:
The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or
messaging layer callback, which calls:
* ``stonith_command()``, which (for requests) calls
* ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls
* ``stonith_fence()``, which calls
* ``schedule_stonith_command()`` (using supplied device if
``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable
device obtained via ``get_capable_devices()`` with
``stonith_fence_get_devices_cb()``), which adds the operation to the
device's pending operations list and triggers processing.
The chosen peer fencer's mainloop is triggered and calls
* ``stonith_device_dispatch()``, which calls
* ``stonith_device_execute()``, which pops off the next item from the device's
pending operations list. If acting as the (internally implemented) watchdog
agent, it panics the node, otherwise it calls
* ``stonith_action_create()`` and ``stonith_action_execute_async()`` to
call the fencing agent.
The chosen peer fencer's mainloop is triggered again once the fencing agent
returns, and calls
* ``stonith_action_async_done()`` which adds the results to an action object
then calls its
* done callback (``st_child_done()``), which calls ``schedule_stonith_command()``
for a new device if there are further required actions to execute or if the
original action failed, then builds and sends an XML reply to the original
fencer (via ``send_async_reply()``), then checks whether any
pending actions are the same as the one just executed and merges them if so.
Fencing replies
_______________
The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or
messaging layer callback, which calls:
* ``stonith_command()``, which (for replies) calls
* ``handle_reply()``, which calls
* ``fenced_process_fencing_reply()``, which calls either
``request_peer_fencing()`` (to retry a failed operation, or try the next
device in a topology if appropriate, which issues a new
``STONITH_OP_FENCE`` request, proceeding as before) or
``finalize_op()`` (if the operation is definitively failed or
successful).
* ``finalize_op()`` broadcasts the result to all peers.
Finally, all peers receive the broadcast result and call
* ``finalize_op()``, which sends the result to all local clients.
.. index::
single: fence history
Fencing History
_______________
The fencer keeps a running history of all fencing operations. The bulk of the
relevant code is in `fenced_history.c` and ensures the history is synchronized
across all nodes even if a node leaves and rejoins the cluster.
In libstonithd, this information is represented by `stonith_history_t` and is
queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and
`stonith_admin` use this API to display the history.
.. index::
single: scheduler
single: pacemaker-schedulerd
single: libpe_status
single: libpe_rules
single: libpacemaker
Scheduler
#########
``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker
scheduler for the controller, but "the scheduler" in general refers to related
library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and
some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``).
The purpose of the scheduler is to take a CIB as input and generate a
transition graph (list of actions that need to be taken) as output.
The controller invokes the scheduler by contacting the scheduler daemon via
local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource``
can also invoke the scheduler, but do so by calling the library functions
directly. This allows them to run using a ``CIB_file`` without the cluster
needing to be active.
The main entry point for the scheduler code is
-``lib/pacemaker/pcmk_sched_allocate.c:pcmk__schedule_actions()``. It sets
+``lib/pacemaker/pcmk_scheduler.c:pcmk__schedule_actions()``. It sets
defaults and calls a series of functions for the scheduling. Some key steps:
* ``unpack_cib()`` parses most of the CIB XML into data structures, and
determines the current cluster status.
* ``apply_node_criteria()`` applies factors that make resources prefer certain
nodes, such as shutdown locks, location constraints, and stickiness.
* ``pcmk__create_internal_constraints()`` creates internal constraints, such as
the implicit ordering for group members, or start actions being implicitly
ordered before promote actions.
* ``pcmk__handle_rsc_config_changes()`` processes resource history entries in
the CIB status section. This is used to decide whether certain
actions need to be done, such as deleting orphan resources, forcing a restart
when a resource definition changes, etc.
* ``assign_resources()`` assigns resources to nodes.
* ``schedule_resource_actions()`` schedules resource-specific actions (which
might or might not end up in the final graph).
* ``pcmk__apply_orderings()`` processes ordering constraints in order to modify
action attributes such as optional or required.
* ``pcmk__create_graph()`` creates the transition graph.
Challenges
__________
Working with the scheduler is difficult. Challenges include:
* It is far too much code to keep more than a small portion in your head at one
time.
* Small changes can have large (and unexpected) effects. This is why we have a
large number of regression tests (``cts/cts-scheduler``), which should be run
after making code changes.
* It produces an insane amount of log messages at debug and trace levels.
You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to
enable trace-level messages only when related to specific resources.
* Different parts of the main ``pe_working_set_t`` structure are finalized at
different points in the scheduling process, so you have to keep in mind
whether information you're using at one point of the code can possibly change
later. For example, data unpacked from the CIB can safely be used anytime
after ``unpack_cib(),`` but actions may become optional or required anytime
before ``pcmk__create_graph()``. There's no easy way to deal with this.
* Many names of struct members, functions, etc., are suboptimal, but are part
of the public API and cannot be changed until an API backward compatibility
break.
.. index::
single: pe_working_set_t
Cluster Working Set
___________________
The main data object for the scheduler is ``pe_working_set_t``, which contains
all information needed about nodes, resources, constraints, etc., both as the
raw CIB XML and parsed into more usable data structures, plus the resulting
transition graph XML. The variable name is usually ``data_set``.
.. index::
single: pe_resource_t
Resources
_________
``pe_resource_t`` is the data object representing cluster resources. A resource
has a variant: primitive (a.k.a. native), group, clone, or bundle.
The resource object has members for two sets of methods,
``resource_object_functions_t`` from the ``libpe_status`` public API, and
``resource_alloc_functions_t`` whose implementation is internal to
``libpacemaker``. The actual functions vary by variant.
The object functions have basic capabilities such as unpacking the resource
XML, and determining the current or planned location of the resource.
-The allocation functions have more obscure capabilities needed for scheduling,
+The assignment functions have more obscure capabilities needed for scheduling,
such as processing location and ordering constraints. For example,
``pcmk__create_internal_constraints()`` simply calls the
``internal_constraints()`` method for each top-level resource in the cluster.
.. index::
single: pe_node_t
Nodes
_____
-Allocation of resources to nodes is done by choosing the node with the highest
+Assignment of resources to nodes is done by choosing the node with the highest
score for a given resource. The scheduler does a bunch of processing to
-generate the scores, then the actual allocation is straightforward.
+generate the scores, then the actual assignment is straightforward.
Node lists are frequently used. For example, ``pe_working_set_t`` has a
``nodes`` member which is a list of all nodes in the cluster, and
``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on
which the resource is (or might be) active. These are lists of ``pe_node_t``
objects.
The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member
-with all node information that is independent of resource allocation (the node
+with all node information that is independent of resource assignment (the node
name, etc.).
The working set's ``nodes`` member contains the original of this information.
All other node lists contain copies of ``pe_node_t`` where only the ``details``
member points to the originals in the working set's ``nodes`` list. In this
way, the other members of ``pe_node_t`` (such as ``weight``, which is the node
score) may vary by node list, while the common details are shared.
.. index::
single: pe_action_t
single: pe_action_flags
Actions
_______
``pe_action_t`` is the data object representing actions that might need to be
taken. These could be resource actions, cluster-wide actions such as fencing a
node, or "pseudo-actions" which are abstractions used as convenient points for
ordering other actions against.
It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The
most important of these are ``pe_action_runnable`` (if not set, the action is
"blocked" and cannot be added to the transition graph) and
``pe_action_optional`` (actions with this set will not be added to the
transition graph; actions often start out as optional, and may become required
later).
.. index::
single: pe__colocation_t
Colocations
___________
``pcmk__colocation_t`` is the data object representing colocations.
Colocation constraints come into play in these parts of the scheduler code:
* When sorting resources for assignment, so resources with highest node score
are assigned first (see ``cmp_resources()``)
* When updating node scores for resource assigment or promotion priority
* When assigning resources, so any resources to be colocated with can be
assigned first, and so colocations affect where the resource is assigned
* When choosing roles for promotable clone instances, so colocations involving
a specific role can affect which instances are promoted
-The resource allocation functions have several methods related to colocations:
+The resource assignment functions have several methods related to colocations:
* ``apply_coloc_score():`` This applies a colocation's score to either the
dependent's allowed node scores (if called while resources are being
assigned) or the dependent's priority (if called while choosing promotable
instance roles). It can behave differently depending on whether it is being
called as the primary's method or as the dependent's method.
* ``add_colocated_node_scores():`` This updates a table of nodes for a given
colocation attribute and score. It goes through colocations involving a given
resource, and updates the scores of the nodes in the table with the best
scores of nodes that match up according to the colocation criteria.
* ``colocated_resources():`` This generates a list of all resources involved
in mandatory colocations (directly or indirectly via colocation chains) with
a given resource.
.. index::
single: pe__ordering_t
single: pe_ordering
Orderings
_________
Ordering constraints are simple in concept, but they are one of the most
important, powerful, and difficult to follow aspects of the scheduler code.
``pe__ordering_t`` is the data object representing an ordering, better thought
of as a relationship between two actions, since the relation can be more
complex than just "this one runs after that one".
For an ordering "A then B", the code generally refers to A as "first" or
"before", and B as "then" or "after".
Much of the power comes from ``enum pe_ordering``, which are flags that
determine how an ordering behaves. There are many obscure flags with big
effects. A few examples:
* ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0,
meaning no flags set, so it must be compared with equality rather than
``pcmk_is_set()``.
* ``pe_order_optional`` means the ordering does not make either action
required, so it only applies if they both become required for other reasons.
* ``pe_order_implies_first`` means that if action B becomes required for any
reason, then action A will become required as well.
diff --git a/doc/sphinx/Pacemaker_Explained/advanced-resources.rst b/doc/sphinx/Pacemaker_Explained/advanced-resources.rst
index a61b76db2f..07583507a4 100644
--- a/doc/sphinx/Pacemaker_Explained/advanced-resources.rst
+++ b/doc/sphinx/Pacemaker_Explained/advanced-resources.rst
@@ -1,1629 +1,1629 @@
Advanced Resource Types
-----------------------
.. index:
single: group resource
single: resource; group
.. _group-resources:
Groups - A Syntactic Shortcut
#############################
One of the most common elements of a cluster is a set of resources
that need to be located together, start sequentially, and stop in the
reverse order. To simplify this configuration, we support the concept
of groups.
.. topic:: A group of two primitive resources
.. code-block:: xml
Although the example above contains only two resources, there is no
limit to the number of resources a group can contain. The example is
also sufficient to explain the fundamental properties of a group:
* Resources are started in the order they appear in (**Public-IP** first,
then **Email**)
* Resources are stopped in the reverse order to which they appear in
(**Email** first, then **Public-IP**)
If a resource in the group can't run anywhere, then nothing after that
is allowed to run, too.
* If **Public-IP** can't run anywhere, neither can **Email**;
* but if **Email** can't run anywhere, this does not affect **Public-IP**
in any way
The group above is logically equivalent to writing:
.. topic:: How the cluster sees a group resource
.. code-block:: xml
Obviously as the group grows bigger, the reduced configuration effort
can become significant.
Another (typical) example of a group is a DRBD volume, the filesystem
mount, an IP address, and an application that uses them.
.. index::
pair: XML element; group
Group Properties
________________
.. table:: **Properties of a Group Resource**
:widths: 1 4
+-------------+------------------------------------------------------------------+
| Field | Description |
+=============+==================================================================+
| id | .. index:: |
| | single: group; property, id |
| | single: property; id (group) |
| | single: id; group property |
| | |
| | A unique name for the group |
+-------------+------------------------------------------------------------------+
| description | .. index:: |
| | single: group; attribute, description |
| | single: attribute; description (group) |
| | single: description; group attribute |
| | |
| | An optional description of the group, for the user's own |
| | purposes. |
| | E.g. ``resources needed for website`` |
+-------------+------------------------------------------------------------------+
Group Options
_____________
Groups inherit the ``priority``, ``target-role``, and ``is-managed`` properties
from primitive resources. See :ref:`resource_options` for information about
those properties.
Group Instance Attributes
_________________________
Groups have no instance attributes. However, any that are set for the group
object will be inherited by the group's children.
Group Contents
______________
Groups may only contain a collection of cluster resources (see
:ref:`primitive-resource`). To refer to a child of a group resource, just use
the child's ``id`` instead of the group's.
Group Constraints
_________________
Although it is possible to reference a group's children in
constraints, it is usually preferable to reference the group itself.
.. topic:: Some constraints involving groups
.. code-block:: xml
.. index::
pair: resource-stickiness; group
Group Stickiness
________________
Stickiness, the measure of how much a resource wants to stay where it
is, is additive in groups. Every active resource of the group will
contribute its stickiness value to the group's total. So if the
default ``resource-stickiness`` is 100, and a group has seven members,
five of which are active, then the group as a whole will prefer its
current location with a score of 500.
.. index::
single: clone
single: resource; clone
.. _s-resource-clone:
Clones - Resources That Can Have Multiple Active Instances
##########################################################
*Clone* resources are resources that can have more than one copy active at the
same time. This allows you, for example, to run a copy of a daemon on every
node. You can clone any primitive or group resource [#]_.
Anonymous versus Unique Clones
______________________________
A clone resource is configured to be either *anonymous* or *globally unique*.
Anonymous clones are the simplest. These behave completely identically
everywhere they are running. Because of this, there can be only one instance of
an anonymous clone active per node.
The instances of globally unique clones are distinct entities. All instances
are launched identically, but one instance of the clone is not identical to any
other instance, whether running on the same node or a different node. As an
example, a cloned IP address can use special kernel functionality such that
each instance handles a subset of requests for the same IP address.
.. index::
single: promotable clone
single: resource; promotable
.. _s-resource-promotable:
Promotable clones
_________________
If a clone is *promotable*, its instances can perform a special role that
Pacemaker will manage via the ``promote`` and ``demote`` actions of the resource
agent.
Services that support such a special role have various terms for the special
role and the default role: primary and secondary, master and replica,
controller and worker, etc. Pacemaker uses the terms *promoted* and
*unpromoted* to be agnostic to what the service calls them or what they do.
All that Pacemaker cares about is that an instance comes up in the unpromoted role
when started, and the resource agent supports the ``promote`` and ``demote`` actions
to manage entering and exiting the promoted role.
.. index::
pair: XML element; clone
Clone Properties
________________
.. table:: **Properties of a Clone Resource**
:widths: 1 4
+-------------+------------------------------------------------------------------+
| Field | Description |
+=============+==================================================================+
| id | .. index:: |
| | single: clone; property, id |
| | single: property; id (clone) |
| | single: id; clone property |
| | |
| | A unique name for the clone |
+-------------+------------------------------------------------------------------+
| description | .. index:: |
| | single: clone; attribute, description |
| | single: attribute; description (clone) |
| | single: description; clone attribute |
| | |
| | An optional description of the clone, for the user's own |
| | purposes. |
| | E.g. ``IP address for website`` |
+-------------+------------------------------------------------------------------+
.. index::
pair: options; clone
Clone Options
_____________
:ref:`Options ` inherited from primitive resources:
``priority, target-role, is-managed``
.. table:: **Clone-specific configuration options**
:class: longtable
:widths: 1 1 3
+-------------------+-----------------+-------------------------------------------------------+
| Field | Default | Description |
+===================+=================+=======================================================+
| globally-unique | false | .. index:: |
| | | single: clone; option, globally-unique |
| | | single: option; globally-unique (clone) |
| | | single: globally-unique; clone option |
| | | |
| | | If **true**, each clone instance performs a |
| | | distinct function |
+-------------------+-----------------+-------------------------------------------------------+
| clone-max | 0 | .. index:: |
| | | single: clone; option, clone-max |
| | | single: option; clone-max (clone) |
| | | single: clone-max; clone option |
| | | |
| | | The maximum number of clone instances that can |
| | | be started across the entire cluster. If 0, the |
| | | number of nodes in the cluster will be used. |
+-------------------+-----------------+-------------------------------------------------------+
| clone-node-max | 1 | .. index:: |
| | | single: clone; option, clone-node-max |
| | | single: option; clone-node-max (clone) |
| | | single: clone-node-max; clone option |
| | | |
| | | If ``globally-unique`` is **true**, the maximum |
| | | number of clone instances that can be started |
| | | on a single node |
+-------------------+-----------------+-------------------------------------------------------+
| clone-min | 0 | .. index:: |
| | | single: clone; option, clone-min |
| | | single: option; clone-min (clone) |
| | | single: clone-min; clone option |
| | | |
| | | Require at least this number of clone instances |
| | | to be runnable before allowing resources |
| | | depending on the clone to be runnable. A value |
| | | of 0 means require all clone instances to be |
| | | runnable. |
+-------------------+-----------------+-------------------------------------------------------+
| notify | false | .. index:: |
| | | single: clone; option, notify |
| | | single: option; notify (clone) |
| | | single: notify; clone option |
| | | |
| | | Call the resource agent's **notify** action for |
| | | all active instances, before and after starting |
| | | or stopping any clone instance. The resource |
| | | agent must support this action. |
| | | Allowed values: **false**, **true** |
+-------------------+-----------------+-------------------------------------------------------+
| ordered | false | .. index:: |
| | | single: clone; option, ordered |
| | | single: option; ordered (clone) |
| | | single: ordered; clone option |
| | | |
| | | If **true**, clone instances must be started |
| | | sequentially instead of in parallel. |
| | | Allowed values: **false**, **true** |
+-------------------+-----------------+-------------------------------------------------------+
| interleave | false | .. index:: |
| | | single: clone; option, interleave |
| | | single: option; interleave (clone) |
| | | single: interleave; clone option |
| | | |
| | | When this clone is ordered relative to another |
| | | clone, if this option is **false** (the default), |
| | | the ordering is relative to *all* instances of |
| | | the other clone, whereas if this option is |
| | | **true**, the ordering is relative only to |
| | | instances on the same node. |
| | | Allowed values: **false**, **true** |
+-------------------+-----------------+-------------------------------------------------------+
| promotable | false | .. index:: |
| | | single: clone; option, promotable |
| | | single: option; promotable (clone) |
| | | single: promotable; clone option |
| | | |
| | | If **true**, clone instances can perform a |
| | | special role that Pacemaker will manage via the |
| | | resource agent's **promote** and **demote** |
| | | actions. The resource agent must support these |
| | | actions. |
| | | Allowed values: **false**, **true** |
+-------------------+-----------------+-------------------------------------------------------+
| promoted-max | 1 | .. index:: |
| | | single: clone; option, promoted-max |
| | | single: option; promoted-max (clone) |
| | | single: promoted-max; clone option |
| | | |
| | | If ``promotable`` is **true**, the number of |
| | | instances that can be promoted at one time |
| | | across the entire cluster |
+-------------------+-----------------+-------------------------------------------------------+
| promoted-node-max | 1 | .. index:: |
| | | single: clone; option, promoted-node-max |
| | | single: option; promoted-node-max (clone) |
| | | single: promoted-node-max; clone option |
| | | |
| | | If ``promotable`` is **true** and ``globally-unique`` |
| | | is **false**, the number of clone instances can be |
| | | promoted at one time on a single node |
+-------------------+-----------------+-------------------------------------------------------+
.. note:: **Deprecated Terminology**
In older documentation and online examples, you may see promotable clones
referred to as *multi-state*, *stateful*, or *master/slave*; these mean the
same thing as *promotable*. Certain syntax is supported for backward
compatibility, but is deprecated and will be removed in a future version:
* Using a ``master`` tag, instead of a ``clone`` tag with the ``promotable``
meta-attribute set to ``true``
* Using the ``master-max`` meta-attribute instead of ``promoted-max``
* Using the ``master-node-max`` meta-attribute instead of
``promoted-node-max``
* Using ``Master`` as a role name instead of ``Promoted``
* Using ``Slave`` as a role name instead of ``Unpromoted``
Clone Contents
______________
Clones must contain exactly one primitive or group resource.
.. topic:: A clone that runs a web server on all nodes
.. code-block:: xml
.. warning::
You should never reference the name of a clone's child (the primitive or group
resource being cloned). If you think you need to do this, you probably need to
re-evaluate your design.
Clone Instance Attribute
________________________
Clones have no instance attributes; however, any that are set here will be
inherited by the clone's child.
.. index::
single: clone; constraint
Clone Constraints
_________________
In most cases, a clone will have a single instance on each active cluster
node. If this is not the case, you can indicate which nodes the
cluster should preferentially assign copies to with resource location
constraints. These constraints are written no differently from those
for primitive resources except that the clone's **id** is used.
.. topic:: Some constraints involving clones
.. code-block:: xml
Ordering constraints behave slightly differently for clones. In the
example above, ``apache-stats`` will wait until all copies of ``apache-clone``
that need to be started have done so before being started itself.
Only if *no* copies can be started will ``apache-stats`` be prevented
from being active. Additionally, the clone will wait for
``apache-stats`` to be stopped before stopping itself.
Colocation of a primitive or group resource with a clone means that
the resource can run on any node with an active instance of the clone.
The cluster will choose an instance based on where the clone is running and
the resource's own location preferences.
Colocation between clones is also possible. If one clone **A** is colocated
with another clone **B**, the set of allowed locations for **A** is limited to
nodes on which **B** is (or will be) active. Placement is then performed
normally.
.. index::
single: promotable clone; constraint
.. _promotable-clone-constraints:
Promotable Clone Constraints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For promotable clone resources, the ``first-action`` and/or ``then-action`` fields
for ordering constraints may be set to ``promote`` or ``demote`` to constrain the
promoted role, and colocation constraints may contain ``rsc-role`` and/or
``with-rsc-role`` fields.
.. topic:: Constraints involving promotable clone resources
.. code-block:: xml
In the example above, **myApp** will wait until one of the database
copies has been started and promoted before being started
itself on the same node. Only if no copies can be promoted will **myApp** be
prevented from being active. Additionally, the cluster will wait for
**myApp** to be stopped before demoting the database.
Colocation of a primitive or group resource with a promotable clone
resource means that it can run on any node with an active instance of
the promotable clone resource that has the specified role (``Promoted`` or
``Unpromoted``). In the example above, the cluster will choose a location
based on where database is running in the promoted role, and if there are
multiple promoted instances it will also factor in **myApp**'s own location
preferences when deciding which location to choose.
Colocation with regular clones and other promotable clone resources is also
possible. In such cases, the set of allowed locations for the **rsc**
clone is (after role filtering) limited to nodes on which the
``with-rsc`` promotable clone resource is (or will be) in the specified role.
Placement is then performed as normal.
Using Promotable Clone Resources in Colocation Sets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a promotable clone is used in a :ref:`resource set `
inside a colocation constraint, the resource set may take a ``role`` attribute.
In the following example, an instance of **B** may be promoted only on a node
where **A** is in the promoted role. Additionally, resources **C** and **D**
must be located on a node where both **A** and **B** are promoted.
.. topic:: Colocate C and D with A's and B's promoted instances
.. code-block:: xml
Using Promotable Clone Resources in Ordered Sets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a promotable clone is used in a :ref:`resource set `
inside an ordering constraint, the resource set may take an ``action``
attribute.
.. topic:: Start C and D after first promoting A and B
.. code-block:: xml
In the above example, **B** cannot be promoted until **A** has been promoted.
Additionally, resources **C** and **D** must wait until **A** and **B** have
been promoted before they can start.
.. index::
pair: resource-stickiness; clone
.. _s-clone-stickiness:
Clone Stickiness
________________
-To achieve a stable allocation pattern, clones are slightly sticky by
-default. If no value for ``resource-stickiness`` is provided, the clone
-will use a value of 1. Being a small value, it causes minimal
-disturbance to the score calculations of other resources but is enough
-to prevent Pacemaker from needlessly moving copies around the cluster.
+To achieve stable assignments, clones are slightly sticky by default. If no
+value for ``resource-stickiness`` is provided, the clone will use a value of 1.
+Being a small value, it causes minimal disturbance to the score calculations of
+other resources but is enough to prevent Pacemaker from needlessly moving
+instances around the cluster.
.. note::
For globally unique clones, this may result in multiple instances of the
clone staying on a single node, even after another eligible node becomes
active (for example, after being put into standby mode then made active again).
If you do not want this behavior, specify a ``resource-stickiness`` of 0
for the clone temporarily and let the cluster adjust, then set it back
to 1 if you want the default behavior to apply again.
.. important::
If ``resource-stickiness`` is set in the ``rsc_defaults`` section, it will
apply to clone instances as well. This means an explicit ``resource-stickiness``
of 0 in ``rsc_defaults`` works differently from the implicit default used when
``resource-stickiness`` is not specified.
Clone Resource Agent Requirements
_________________________________
Any resource can be used as an anonymous clone, as it requires no
additional support from the resource agent. Whether it makes sense to
do so depends on your resource and its resource agent.
Resource Agent Requirements for Globally Unique Clones
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Globally unique clones require additional support in the resource agent. In
particular, it must only respond with ``${OCF_SUCCESS}`` if the node has that
exact instance active. All other probes for instances of the clone should
result in ``${OCF_NOT_RUNNING}`` (or one of the other OCF error codes if
they are failed).
Individual instances of a clone are identified by appending a colon and a
numerical offset, e.g. **apache:2**.
Resource agents can find out how many copies there are by examining
the ``OCF_RESKEY_CRM_meta_clone_max`` environment variable and which
instance it is by examining ``OCF_RESKEY_CRM_meta_clone``.
The resource agent must not make any assumptions (based on
``OCF_RESKEY_CRM_meta_clone``) about which numerical instances are active. In
particular, the list of active copies will not always be an unbroken
sequence, nor always start at 0.
Resource Agent Requirements for Promotable Clones
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Promotable clone resources require two extra actions, ``demote`` and ``promote``,
which are responsible for changing the state of the resource. Like **start** and
**stop**, they should return ``${OCF_SUCCESS}`` if they completed successfully or
a relevant error code if they did not.
The states can mean whatever you wish, but when the resource is
started, it must come up in the unpromoted role. From there, the
cluster will decide which instances to promote.
In addition to the clone requirements for monitor actions, agents must
also *accurately* report which state they are in. The cluster relies
on the agent to report its status (including role) accurately and does
not indicate to the agent what role it currently believes it to be in.
.. table:: **Role implications of OCF return codes**
:widths: 1 3
+----------------------+--------------------------------------------------+
| Monitor Return Code | Description |
+======================+==================================================+
| OCF_NOT_RUNNING | .. index:: |
| | single: OCF_NOT_RUNNING |
| | single: OCF return code; OCF_NOT_RUNNING |
| | |
| | Stopped |
+----------------------+--------------------------------------------------+
| OCF_SUCCESS | .. index:: |
| | single: OCF_SUCCESS |
| | single: OCF return code; OCF_SUCCESS |
| | |
| | Running (Unpromoted) |
+----------------------+--------------------------------------------------+
| OCF_RUNNING_PROMOTED | .. index:: |
| | single: OCF_RUNNING_PROMOTED |
| | single: OCF return code; OCF_RUNNING_PROMOTED |
| | |
| | Running (Promoted) |
+----------------------+--------------------------------------------------+
| OCF_FAILED_PROMOTED | .. index:: |
| | single: OCF_FAILED_PROMOTED |
| | single: OCF return code; OCF_FAILED_PROMOTED |
| | |
| | Failed (Promoted) |
+----------------------+--------------------------------------------------+
| Other | .. index:: |
| | single: return code |
| | |
| | Failed (Unpromoted) |
+----------------------+--------------------------------------------------+
Clone Notifications
~~~~~~~~~~~~~~~~~~~
If the clone has the ``notify`` meta-attribute set to **true**, and the resource
agent supports the ``notify`` action, Pacemaker will call the action when
appropriate, passing a number of extra variables which, when combined with
additional context, can be used to calculate the current state of the cluster
and what is about to happen to it.
.. index::
single: clone; environment variables
single: notify; environment variables
.. table:: **Environment variables supplied with Clone notify actions**
:widths: 1 1
+----------------------------------------------+-------------------------------------------------------------------------------+
| Variable | Description |
+==============================================+===============================================================================+
| OCF_RESKEY_CRM_meta_notify_type | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_type |
| | single: OCF_RESKEY_CRM_meta_notify_type |
| | |
| | Allowed values: **pre**, **post** |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_operation | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_operation |
| | single: OCF_RESKEY_CRM_meta_notify_operation |
| | |
| | Allowed values: **start**, **stop** |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_start_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_start_resource |
| | single: OCF_RESKEY_CRM_meta_notify_start_resource |
| | |
| | Resources to be started |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_stop_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_stop_resource |
| | single: OCF_RESKEY_CRM_meta_notify_stop_resource |
| | |
| | Resources to be stopped |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_active_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_active_resource |
| | single: OCF_RESKEY_CRM_meta_notify_active_resource |
| | |
| | Resources that are running |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_inactive_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_inactive_resource |
| | single: OCF_RESKEY_CRM_meta_notify_inactive_resource |
| | |
| | Resources that are not running |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_start_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_start_uname |
| | single: OCF_RESKEY_CRM_meta_notify_start_uname |
| | |
| | Nodes on which resources will be started |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_stop_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_stop_uname |
| | single: OCF_RESKEY_CRM_meta_notify_stop_uname |
| | |
| | Nodes on which resources will be stopped |
+----------------------------------------------+-------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_active_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_active_uname |
| | single: OCF_RESKEY_CRM_meta_notify_active_uname |
| | |
| | Nodes on which resources are running |
+----------------------------------------------+-------------------------------------------------------------------------------+
The variables come in pairs, such as
``OCF_RESKEY_CRM_meta_notify_start_resource`` and
``OCF_RESKEY_CRM_meta_notify_start_uname``, and should be treated as an
array of whitespace-separated elements.
``OCF_RESKEY_CRM_meta_notify_inactive_resource`` is an exception, as the
matching **uname** variable does not exist since inactive resources
are not running on any node.
Thus, in order to indicate that **clone:0** will be started on **sles-1**,
**clone:2** will be started on **sles-3**, and **clone:3** will be started
on **sles-2**, the cluster would set:
.. topic:: Notification variables
.. code-block:: none
OCF_RESKEY_CRM_meta_notify_start_resource="clone:0 clone:2 clone:3"
OCF_RESKEY_CRM_meta_notify_start_uname="sles-1 sles-3 sles-2"
.. note::
Pacemaker will log but otherwise ignore failures of notify actions.
Interpretation of Notification Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Pre-notification (stop):**
* Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
**Post-notification (stop) / Pre-notification (start):**
* Active resources
* ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Inactive resources
* ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
**Post-notification (start):**
* Active resources:
* ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Inactive resources:
* ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
Extra Notifications for Promotable Clones
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. index::
single: clone; environment variables
single: promotable; environment variables
.. table:: **Extra environment variables supplied for promotable clones**
:widths: 1 1
+------------------------------------------------+---------------------------------------------------------------------------------+
| Variable | Description |
+================================================+=================================================================================+
| OCF_RESKEY_CRM_meta_notify_promoted_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_promoted_resource |
| | single: OCF_RESKEY_CRM_meta_notify_promoted_resource |
| | |
| | Resources that are running in the promoted role |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_unpromoted_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_unpromoted_resource |
| | single: OCF_RESKEY_CRM_meta_notify_unpromoted_resource |
| | |
| | Resources that are running in the unpromoted role |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_promote_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_promote_resource |
| | single: OCF_RESKEY_CRM_meta_notify_promote_resource |
| | |
| | Resources to be promoted |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_demote_resource | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_demote_resource |
| | single: OCF_RESKEY_CRM_meta_notify_demote_resource |
| | |
| | Resources to be demoted |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_promote_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_promote_uname |
| | single: OCF_RESKEY_CRM_meta_notify_promote_uname |
| | |
| | Nodes on which resources will be promoted |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_demote_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_demote_uname |
| | single: OCF_RESKEY_CRM_meta_notify_demote_uname |
| | |
| | Nodes on which resources will be demoted |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_promoted_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_promoted_uname |
| | single: OCF_RESKEY_CRM_meta_notify_promoted_uname |
| | |
| | Nodes on which resources are running in the promoted role |
+------------------------------------------------+---------------------------------------------------------------------------------+
| OCF_RESKEY_CRM_meta_notify_unpromoted_uname | .. index:: |
| | single: environment variable; OCF_RESKEY_CRM_meta_notify_unpromoted_uname |
| | single: OCF_RESKEY_CRM_meta_notify_unpromoted_uname |
| | |
| | Nodes on which resources are running in the unpromoted role |
+------------------------------------------------+---------------------------------------------------------------------------------+
Interpretation of Promotable Notification Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Pre-notification (demote):**
* Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* Promoted resources: ``$OCF_RESKEY_CRM_meta_notify_promoted_resource``
* Unpromoted resources: ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource``
* Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
**Post-notification (demote) / Pre-notification (stop):**
* Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* Promoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_promoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Unpromoted resources: ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource``
* Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
**Post-notification (stop) / Pre-notification (start)**
* Active resources:
* ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Promoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_promoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Unpromoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Inactive resources:
* ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
**Post-notification (start) / Pre-notification (promote)**
* Active resources:
* ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Promoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_promoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Unpromoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Inactive resources:
* ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
**Post-notification (promote)**
* Active resources:
* ``$OCF_RESKEY_CRM_meta_notify_active_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Promoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_promoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Unpromoted resources:
* ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Inactive resources:
* ``$OCF_RESKEY_CRM_meta_notify_inactive_resource``
* plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* minus ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
* Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource``
* Resources that were promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource``
* Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource``
* Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource``
Monitoring Promotable Clone Resources
_____________________________________
The usual monitor actions are insufficient to monitor a promotable clone
resource, because Pacemaker needs to verify not only that the resource is
active, but also that its actual role matches its intended one.
Define two monitoring actions: the usual one will cover the unpromoted role,
and an additional one with ``role="Promoted"`` will cover the promoted role.
.. topic:: Monitoring both states of a promotable clone resource
.. code-block:: xml
.. important::
It is crucial that *every* monitor operation has a different interval!
Pacemaker currently differentiates between operations
only by resource and interval; so if (for example) a promotable clone resource
had the same monitor interval for both roles, Pacemaker would ignore the
role when checking the status -- which would cause unexpected return
codes, and therefore unnecessary complications.
.. _s-promotion-scores:
Determining Which Instance is Promoted
______________________________________
Pacemaker can choose a promotable clone instance to be promoted in one of two
ways:
* Promotion scores: These are node attributes set via the ``crm_attribute``
command using the ``--promotion`` option, which generally would be called by
the resource agent's start action if it supports promotable clones. This tool
automatically detects both the resource and host, and should be used to set a
preference for being promoted. Based on this, ``promoted-max``, and
``promoted-node-max``, the instance(s) with the highest preference will be
promoted.
* Constraints: Location constraints can indicate which nodes are most preferred
to be promoted.
.. topic:: Explicitly preferring node1 to be promoted
.. code-block:: xml
.. index:
single: bundle
single: resource; bundle
pair: container; Docker
pair: container; podman
pair: container; rkt
.. _s-resource-bundle:
Bundles - Containerized Resources
#################################
Pacemaker supports a special syntax for launching a service inside a
`container `_
with any infrastructure it requires: the *bundle*.
Pacemaker bundles support `Docker `_,
`podman `_ *(since 2.0.1)*, and
`rkt `_ container technologies. [#]_
.. topic:: A bundle for a containerized web server
.. code-block:: xml
Bundle Prerequisites
____________________
Before configuring a bundle in Pacemaker, the user must install the appropriate
container launch technology (Docker, podman, or rkt), and supply a fully
configured container image, on every node allowed to run the bundle.
Pacemaker will create an implicit resource of type **ocf:heartbeat:docker**,
**ocf:heartbeat:podman**, or **ocf:heartbeat:rkt** to manage a bundle's
container. The user must ensure that the appropriate resource agent is
installed on every node allowed to run the bundle.
.. index::
pair: XML element; bundle
Bundle Properties
_________________
.. table:: **XML Attributes of a bundle Element**
:widths: 1 4
+-------------+------------------------------------------------------------------+
| Field | Description |
+=============+==================================================================+
| id | .. index:: |
| | single: bundle; attribute, id |
| | single: attribute; id (bundle) |
| | single: id; bundle attribute |
| | |
| | A unique name for the bundle (required) |
+-------------+------------------------------------------------------------------+
| description | .. index:: |
| | single: bundle; attribute, description |
| | single: attribute; description (bundle) |
| | single: description; bundle attribute |
| | |
| | An optional description of the group, for the user's own |
| | purposes. |
| | E.g. ``manages the container that runs the service`` |
+-------------+------------------------------------------------------------------+
A bundle must contain exactly one ``docker``, ``podman``, or ``rkt`` element.
.. index::
pair: XML element; docker
pair: XML element; podman
pair: XML element; rkt
Bundle Container Properties
___________________________
.. table:: **XML attributes of a docker, podman, or rkt Element**
:class: longtable
:widths: 2 3 4
+-------------------+------------------------------------+---------------------------------------------------+
| Attribute | Default | Description |
+===================+====================================+===================================================+
| image | | .. index:: |
| | | single: docker; attribute, image |
| | | single: attribute; image (docker) |
| | | single: image; docker attribute |
| | | single: podman; attribute, image |
| | | single: attribute; image (podman) |
| | | single: image; podman attribute |
| | | single: rkt; attribute, image |
| | | single: attribute; image (rkt) |
| | | single: image; rkt attribute |
| | | |
| | | Container image tag (required) |
+-------------------+------------------------------------+---------------------------------------------------+
| replicas | Value of ``promoted-max`` | .. index:: |
| | if that is positive, else 1 | single: docker; attribute, replicas |
| | | single: attribute; replicas (docker) |
| | | single: replicas; docker attribute |
| | | single: podman; attribute, replicas |
| | | single: attribute; replicas (podman) |
| | | single: replicas; podman attribute |
| | | single: rkt; attribute, replicas |
| | | single: attribute; replicas (rkt) |
| | | single: replicas; rkt attribute |
| | | |
| | | A positive integer specifying the number of |
| | | container instances to launch |
+-------------------+------------------------------------+---------------------------------------------------+
| replicas-per-host | 1 | .. index:: |
| | | single: docker; attribute, replicas-per-host |
| | | single: attribute; replicas-per-host (docker) |
| | | single: replicas-per-host; docker attribute |
| | | single: podman; attribute, replicas-per-host |
| | | single: attribute; replicas-per-host (podman) |
| | | single: replicas-per-host; podman attribute |
| | | single: rkt; attribute, replicas-per-host |
| | | single: attribute; replicas-per-host (rkt) |
| | | single: replicas-per-host; rkt attribute |
| | | |
| | | A positive integer specifying the number of |
| | | container instances allowed to run on a |
| | | single node |
+-------------------+------------------------------------+---------------------------------------------------+
| promoted-max | 0 | .. index:: |
| | | single: docker; attribute, promoted-max |
| | | single: attribute; promoted-max (docker) |
| | | single: promoted-max; docker attribute |
| | | single: podman; attribute, promoted-max |
| | | single: attribute; promoted-max (podman) |
| | | single: promoted-max; podman attribute |
| | | single: rkt; attribute, promoted-max |
| | | single: attribute; promoted-max (rkt) |
| | | single: promoted-max; rkt attribute |
| | | |
| | | A non-negative integer that, if positive, |
| | | indicates that the containerized service |
| | | should be treated as a promotable service, |
| | | with this many replicas allowed to run the |
| | | service in the promoted role |
+-------------------+------------------------------------+---------------------------------------------------+
| network | | .. index:: |
| | | single: docker; attribute, network |
| | | single: attribute; network (docker) |
| | | single: network; docker attribute |
| | | single: podman; attribute, network |
| | | single: attribute; network (podman) |
| | | single: network; podman attribute |
| | | single: rkt; attribute, network |
| | | single: attribute; network (rkt) |
| | | single: network; rkt attribute |
| | | |
| | | If specified, this will be passed to the |
| | | ``docker run``, ``podman run``, or |
| | | ``rkt run`` command as the network setting |
| | | for the container. |
+-------------------+------------------------------------+---------------------------------------------------+
| run-command | ``/usr/sbin/pacemaker-remoted`` if | .. index:: |
| | bundle contains a **primitive**, | single: docker; attribute, run-command |
| | otherwise none | single: attribute; run-command (docker) |
| | | single: run-command; docker attribute |
| | | single: podman; attribute, run-command |
| | | single: attribute; run-command (podman) |
| | | single: run-command; podman attribute |
| | | single: rkt; attribute, run-command |
| | | single: attribute; run-command (rkt) |
| | | single: run-command; rkt attribute |
| | | |
| | | This command will be run inside the container |
| | | when launching it ("PID 1"). If the bundle |
| | | contains a **primitive**, this command *must* |
| | | start ``pacemaker-remoted`` (but could, for |
| | | example, be a script that does other stuff, too). |
+-------------------+------------------------------------+---------------------------------------------------+
| options | | .. index:: |
| | | single: docker; attribute, options |
| | | single: attribute; options (docker) |
| | | single: options; docker attribute |
| | | single: podman; attribute, options |
| | | single: attribute; options (podman) |
| | | single: options; podman attribute |
| | | single: rkt; attribute, options |
| | | single: attribute; options (rkt) |
| | | single: options; rkt attribute |
| | | |
| | | Extra command-line options to pass to the |
| | | ``docker run``, ``podman run``, or ``rkt run`` |
| | | command |
+-------------------+------------------------------------+---------------------------------------------------+
.. note::
Considerations when using cluster configurations or container images from
Pacemaker 1.1:
* If the container image has a pre-2.0.0 version of Pacemaker, set ``run-command``
to ``/usr/sbin/pacemaker_remoted`` (note the underbar instead of dash).
* ``masters`` is accepted as an alias for ``promoted-max``, but is deprecated since
2.0.0, and support for it will be removed in a future version.
Bundle Network Properties
_________________________
A bundle may optionally contain one ```` element.
.. index::
pair: XML element; network
single: bundle; network
.. table:: **XML attributes of a network Element**
:widths: 2 1 5
+----------------+---------+------------------------------------------------------------+
| Attribute | Default | Description |
+================+=========+============================================================+
| add-host | TRUE | .. index:: |
| | | single: network; attribute, add-host |
| | | single: attribute; add-host (network) |
| | | single: add-host; network attribute |
| | | |
| | | If TRUE, and ``ip-range-start`` is used, Pacemaker will |
| | | automatically ensure that ``/etc/hosts`` inside the |
| | | containers has entries for each |
| | | :ref:`replica name ` |
| | | and its assigned IP. |
+----------------+---------+------------------------------------------------------------+
| ip-range-start | | .. index:: |
| | | single: network; attribute, ip-range-start |
| | | single: attribute; ip-range-start (network) |
| | | single: ip-range-start; network attribute |
| | | |
| | | If specified, Pacemaker will create an implicit |
| | | ``ocf:heartbeat:IPaddr2`` resource for each container |
| | | instance, starting with this IP address, using up to |
| | | ``replicas`` sequential addresses. These addresses can be |
| | | used from the host's network to reach the service inside |
| | | the container, though it is not visible within the |
| | | container itself. Only IPv4 addresses are currently |
| | | supported. |
+----------------+---------+------------------------------------------------------------+
| host-netmask | 32 | .. index:: |
| | | single: network; attribute; host-netmask |
| | | single: attribute; host-netmask (network) |
| | | single: host-netmask; network attribute |
| | | |
| | | If ``ip-range-start`` is specified, the IP addresses |
| | | are created with this CIDR netmask (as a number of bits). |
+----------------+---------+------------------------------------------------------------+
| host-interface | | .. index:: |
| | | single: network; attribute; host-interface |
| | | single: attribute; host-interface (network) |
| | | single: host-interface; network attribute |
| | | |
| | | If ``ip-range-start`` is specified, the IP addresses are |
| | | created on this host interface (by default, it will be |
| | | determined from the IP address). |
+----------------+---------+------------------------------------------------------------+
| control-port | 3121 | .. index:: |
| | | single: network; attribute; control-port |
| | | single: attribute; control-port (network) |
| | | single: control-port; network attribute |
| | | |
| | | If the bundle contains a ``primitive``, the cluster will |
| | | use this integer TCP port for communication with |
| | | Pacemaker Remote inside the container. Changing this is |
| | | useful when the container is unable to listen on the |
| | | default port, for example, when the container uses the |
| | | host's network rather than ``ip-range-start`` (in which |
| | | case ``replicas-per-host`` must be 1), or when the bundle |
| | | may run on a Pacemaker Remote node that is already |
| | | listening on the default port. Any ``PCMK_remote_port`` |
| | | environment variable set on the host or in the container |
| | | is ignored for bundle connections. |
+----------------+---------+------------------------------------------------------------+
.. _s-resource-bundle-note-replica-names:
.. note::
Replicas are named by the bundle id plus a dash and an integer counter starting
with zero. For example, if a bundle named **httpd-bundle** has **replicas=2**, its
containers will be named **httpd-bundle-0** and **httpd-bundle-1**.
.. index::
pair: XML element; port-mapping
Additionally, a ``network`` element may optionally contain one or more
``port-mapping`` elements.
.. table:: **Attributes of a port-mapping Element**
:widths: 2 1 5
+---------------+-------------------+------------------------------------------------------+
| Attribute | Default | Description |
+===============+===================+======================================================+
| id | | .. index:: |
| | | single: port-mapping; attribute, id |
| | | single: attribute; id (port-mapping) |
| | | single: id; port-mapping attribute |
| | | |
| | | A unique name for the port mapping (required) |
+---------------+-------------------+------------------------------------------------------+
| port | | .. index:: |
| | | single: port-mapping; attribute, port |
| | | single: attribute; port (port-mapping) |
| | | single: port; port-mapping attribute |
| | | |
| | | If this is specified, connections to this TCP port |
| | | number on the host network (on the container's |
| | | assigned IP address, if ``ip-range-start`` is |
| | | specified) will be forwarded to the container |
| | | network. Exactly one of ``port`` or ``range`` |
| | | must be specified in a ``port-mapping``. |
+---------------+-------------------+------------------------------------------------------+
| internal-port | value of ``port`` | .. index:: |
| | | single: port-mapping; attribute, internal-port |
| | | single: attribute; internal-port (port-mapping) |
| | | single: internal-port; port-mapping attribute |
| | | |
| | | If ``port`` and this are specified, connections |
| | | to ``port`` on the host's network will be |
| | | forwarded to this port on the container network. |
+---------------+-------------------+------------------------------------------------------+
| range | | .. index:: |
| | | single: port-mapping; attribute, range |
| | | single: attribute; range (port-mapping) |
| | | single: range; port-mapping attribute |
| | | |
| | | If this is specified, connections to these TCP |
| | | port numbers (expressed as *first_port*-*last_port*) |
| | | on the host network (on the container's assigned IP |
| | | address, if ``ip-range-start`` is specified) will |
| | | be forwarded to the same ports in the container |
| | | network. Exactly one of ``port`` or ``range`` |
| | | must be specified in a ``port-mapping``. |
+---------------+-------------------+------------------------------------------------------+
.. note::
If the bundle contains a ``primitive``, Pacemaker will automatically map the
``control-port``, so it is not necessary to specify that port in a
``port-mapping``.
.. index:
pair: XML element; storage
pair: XML element; storage-mapping
single: bundle; storage
.. _s-bundle-storage:
Bundle Storage Properties
_________________________
A bundle may optionally contain one ``storage`` element. A ``storage`` element
has no properties of its own, but may contain one or more ``storage-mapping``
elements.
.. table:: **Attributes of a storage-mapping Element**
:widths: 2 1 5
+-----------------+---------+-------------------------------------------------------------+
| Attribute | Default | Description |
+=================+=========+=============================================================+
| id | | .. index:: |
| | | single: storage-mapping; attribute, id |
| | | single: attribute; id (storage-mapping) |
| | | single: id; storage-mapping attribute |
| | | |
| | | A unique name for the storage mapping (required) |
+-----------------+---------+-------------------------------------------------------------+
| source-dir | | .. index:: |
| | | single: storage-mapping; attribute, source-dir |
| | | single: attribute; source-dir (storage-mapping) |
| | | single: source-dir; storage-mapping attribute |
| | | |
| | | The absolute path on the host's filesystem that will be |
| | | mapped into the container. Exactly one of ``source-dir`` |
| | | and ``source-dir-root`` must be specified in a |
| | | ``storage-mapping``. |
+-----------------+---------+-------------------------------------------------------------+
| source-dir-root | | .. index:: |
| | | single: storage-mapping; attribute, source-dir-root |
| | | single: attribute; source-dir-root (storage-mapping) |
| | | single: source-dir-root; storage-mapping attribute |
| | | |
| | | The start of a path on the host's filesystem that will |
| | | be mapped into the container, using a different |
| | | subdirectory on the host for each container instance. |
| | | The subdirectory will be named the same as the |
| | | :ref:`replica name `. |
| | | Exactly one of ``source-dir`` and ``source-dir-root`` |
| | | must be specified in a ``storage-mapping``. |
+-----------------+---------+-------------------------------------------------------------+
| target-dir | | .. index:: |
| | | single: storage-mapping; attribute, target-dir |
| | | single: attribute; target-dir (storage-mapping) |
| | | single: target-dir; storage-mapping attribute |
| | | |
| | | The path name within the container where the host |
| | | storage will be mapped (required) |
+-----------------+---------+-------------------------------------------------------------+
| options | | .. index:: |
| | | single: storage-mapping; attribute, options |
| | | single: attribute; options (storage-mapping) |
| | | single: options; storage-mapping attribute |
| | | |
| | | A comma-separated list of file system mount |
| | | options to use when mapping the storage |
+-----------------+---------+-------------------------------------------------------------+
.. note::
Pacemaker does not define the behavior if the source directory does not already
exist on the host. However, it is expected that the container technology and/or
its resource agent will create the source directory in that case.
.. note::
If the bundle contains a ``primitive``,
Pacemaker will automatically map the equivalent of
``source-dir=/etc/pacemaker/authkey target-dir=/etc/pacemaker/authkey``
and ``source-dir-root=/var/log/pacemaker/bundles target-dir=/var/log`` into the
container, so it is not necessary to specify those paths in a
``storage-mapping``.
.. important::
The ``PCMK_authkey_location`` environment variable must not be set to anything
other than the default of ``/etc/pacemaker/authkey`` on any node in the cluster.
.. important::
If SELinux is used in enforcing mode on the host, you must ensure the container
is allowed to use any storage you mount into it. For Docker and podman bundles,
adding "Z" to the mount options will create a container-specific label for the
mount that allows the container access.
.. index::
single: bundle; primitive
Bundle Primitive
________________
A bundle may optionally contain one :ref:`primitive `
resource. The primitive may have operations, instance attributes, and
meta-attributes defined, as usual.
If a bundle contains a primitive resource, the container image must include
the Pacemaker Remote daemon, and at least one of ``ip-range-start`` or
``control-port`` must be configured in the bundle. Pacemaker will create an
implicit **ocf:pacemaker:remote** resource for the connection, launch
Pacemaker Remote within the container, and monitor and manage the primitive
resource via Pacemaker Remote.
If the bundle has more than one container instance (replica), the primitive
resource will function as an implicit :ref:`clone ` -- a
:ref:`promotable clone ` if the bundle has ``promoted-max``
greater than zero.
.. note::
If you want to pass environment variables to a bundle's Pacemaker Remote
connection or primitive, you have two options:
* Environment variables whose value is the same regardless of the underlying host
may be set using the container element's ``options`` attribute.
* If you want variables to have host-specific values, you can use the
:ref:`storage-mapping ` element to map a file on the host as
``/etc/pacemaker/pcmk-init.env`` in the container *(since 2.0.3)*.
Pacemaker Remote will parse this file as a shell-like format, with
variables set as NAME=VALUE, ignoring blank lines and comments starting
with "#".
.. important::
When a bundle has a ``primitive``, Pacemaker on all cluster nodes must be able to
contact Pacemaker Remote inside the bundle's containers.
* The containers must have an accessible network (for example, ``network`` should
not be set to "none" with a ``primitive``).
* The default, using a distinct network space inside the container, works in
combination with ``ip-range-start``. Any firewall must allow access from all
cluster nodes to the ``control-port`` on the container IPs.
* If the container shares the host's network space (for example, by setting
``network`` to "host"), a unique ``control-port`` should be specified for each
bundle. Any firewall must allow access from all cluster nodes to the
``control-port`` on all cluster and remote node IPs.
.. index::
single: bundle; node attributes
.. _s-bundle-attributes:
Bundle Node Attributes
______________________
If the bundle has a ``primitive``, the primitive's resource agent may want to set
node attributes such as :ref:`promotion scores `. However, with
containers, it is not apparent which node should get the attribute.
If the container uses shared storage that is the same no matter which node the
container is hosted on, then it is appropriate to use the promotion score on the
bundle node itself.
On the other hand, if the container uses storage exported from the underlying host,
then it may be more appropriate to use the promotion score on the underlying host.
Since this depends on the particular situation, the
``container-attribute-target`` resource meta-attribute allows the user to specify
which approach to use. If it is set to ``host``, then user-defined node attributes
will be checked on the underlying host. If it is anything else, the local node
(in this case the bundle node) is used as usual.
This only applies to user-defined attributes; the cluster will always check the
local node for cluster-defined attributes such as ``#uname``.
If ``container-attribute-target`` is ``host``, the cluster will pass additional
environment variables to the primitive's resource agent that allow it to set
node attributes appropriately: ``CRM_meta_container_attribute_target`` (identical
to the meta-attribute value) and ``CRM_meta_physical_host`` (the name of the
underlying host).
.. note::
When called by a resource agent, the ``attrd_updater`` and ``crm_attribute``
commands will automatically check those environment variables and set
attributes appropriately.
.. index::
single: bundle; meta-attributes
Bundle Meta-Attributes
______________________
Any meta-attribute set on a bundle will be inherited by the bundle's
primitive and any resources implicitly created by Pacemaker for the bundle.
This includes options such as ``priority``, ``target-role``, and ``is-managed``. See
:ref:`resource_options` for more information.
Bundles support clone meta-attributes including ``notify``, ``ordered``, and
``interleave``.
Limitations of Bundles
______________________
Restarting pacemaker while a bundle is unmanaged or the cluster is in
maintenance mode may cause the bundle to fail.
Bundles may not be explicitly cloned or included in groups. This includes the
bundle's primitive and any resources implicitly created by Pacemaker for the
bundle. (If ``replicas`` is greater than 1, the bundle will behave like a clone
implicitly.)
Bundles do not have instance attributes, utilization attributes, or operations,
though a bundle's primitive may have them.
A bundle with a primitive can run on a Pacemaker Remote node only if the bundle
uses a distinct ``control-port``.
.. [#] Of course, the service must support running multiple instances.
.. [#] Docker is a trademark of Docker, Inc. No endorsement by or association with
Docker, Inc. is implied.
diff --git a/doc/sphinx/Pacemaker_Explained/options.rst b/doc/sphinx/Pacemaker_Explained/options.rst
index 5d95e4c867..ca7ea2a8a3 100644
--- a/doc/sphinx/Pacemaker_Explained/options.rst
+++ b/doc/sphinx/Pacemaker_Explained/options.rst
@@ -1,631 +1,631 @@
Cluster-Wide Configuration
--------------------------
.. index::
pair: XML element; cib
pair: XML element; configuration
Configuration Layout
####################
The cluster is defined by the Cluster Information Base (CIB), which uses XML
notation. The simplest CIB, an empty one, looks like this:
.. topic:: An empty configuration
.. code-block:: xml
The empty configuration above contains the major sections that make up a CIB:
* ``cib``: The entire CIB is enclosed with a ``cib`` element. Certain
fundamental settings are defined as attributes of this element.
* ``configuration``: This section -- the primary focus of this document --
contains traditional configuration information such as what resources the
cluster serves and the relationships among them.
* ``crm_config``: cluster-wide configuration options
* ``nodes``: the machines that host the cluster
* ``resources``: the services run by the cluster
* ``constraints``: indications of how resources should be placed
* ``status``: This section contains the history of each resource on each
node. Based on this data, the cluster can construct the complete current
state of the cluster. The authoritative source for this section is the
local executor (pacemaker-execd process) on each cluster node, and the
cluster will occasionally repopulate the entire section. For this reason,
it is never written to disk, and administrators are advised against
modifying it in any way.
In this document, configuration settings will be described as properties or
options based on how they are defined in the CIB:
* Properties are XML attributes of an XML element.
* Options are name-value pairs expressed as ``nvpair`` child elements of an XML
element.
Normally, you will use command-line tools that abstract the XML, so the
distinction will be unimportant; both properties and options are cluster
settings you can tweak.
CIB Properties
##############
Certain settings are defined by CIB properties (that is, attributes of the
``cib`` tag) rather than with the rest of the cluster configuration in the
``configuration`` section.
The reason is simply a matter of parsing. These options are used by the
configuration database which is, by design, mostly ignorant of the content it
holds. So the decision was made to place them in an easy-to-find location.
.. table:: **CIB Properties**
:class: longtable
:widths: 1 3
+------------------+-----------------------------------------------------------+
| Attribute | Description |
+==================+===========================================================+
| admin_epoch | .. index:: |
| | pair: admin_epoch; cib |
| | |
| | When a node joins the cluster, the cluster performs a |
| | check to see which node has the best configuration. It |
| | asks the node with the highest (``admin_epoch``, |
| | ``epoch``, ``num_updates``) tuple to replace the |
| | configuration on all the nodes -- which makes setting |
| | them, and setting them correctly, very important. |
| | ``admin_epoch`` is never modified by the cluster; you can |
| | use this to make the configurations on any inactive nodes |
| | obsolete. |
| | |
| | **Warning:** Never set this value to zero. In such cases, |
| | the cluster cannot tell the difference between your |
| | configuration and the "empty" one used when nothing is |
| | found on disk. |
+------------------+-----------------------------------------------------------+
| epoch | .. index:: |
| | pair: epoch; cib |
| | |
| | The cluster increments this every time the configuration |
| | is updated (usually by the administrator). |
+------------------+-----------------------------------------------------------+
| num_updates | .. index:: |
| | pair: num_updates; cib |
| | |
| | The cluster increments this every time the configuration |
| | or status is updated (usually by the cluster) and resets |
| | it to 0 when epoch changes. |
+------------------+-----------------------------------------------------------+
| validate-with | .. index:: |
| | pair: validate-with; cib |
| | |
| | Determines the type of XML validation that will be done |
| | on the configuration. If set to ``none``, the cluster |
| | will not verify that updates conform to the DTD (nor |
| | reject ones that don't). |
+------------------+-----------------------------------------------------------+
| cib-last-written | .. index:: |
| | pair: cib-last-written; cib |
| | |
| | Indicates when the configuration was last written to |
| | disk. Maintained by the cluster; for informational |
| | purposes only. |
+------------------+-----------------------------------------------------------+
| have-quorum | .. index:: |
| | pair: have-quorum; cib |
| | |
| | Indicates if the cluster has quorum. If false, this may |
| | mean that the cluster cannot start resources or fence |
| | other nodes (see ``no-quorum-policy`` below). Maintained |
| | by the cluster. |
+------------------+-----------------------------------------------------------+
| dc-uuid | .. index:: |
| | pair: dc-uuid; cib |
| | |
| | Indicates which cluster node is the current leader. Used |
| | by the cluster when placing resources and determining the |
| | order of some events. Maintained by the cluster. |
+------------------+-----------------------------------------------------------+
.. _cluster_options:
Cluster Options
###############
Cluster options, as you might expect, control how the cluster behaves when
confronted with various situations.
They are grouped into sets within the ``crm_config`` section. In advanced
configurations, there may be more than one set. (This will be described later
in the chapter on :ref:`rules` where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including their default
values, by running the ``man pacemaker-schedulerd`` and
``man pacemaker-controld`` commands.
.. table:: **Cluster Options**
:class: longtable
:widths: 2 1 4
+---------------------------+---------+----------------------------------------------------+
| Option | Default | Description |
+===========================+=========+====================================================+
| cluster-name | | .. index:: |
| | | pair: cluster option; cluster-name |
| | | |
| | | An (optional) name for the cluster as a whole. |
| | | This is mostly for users' convenience for use |
| | | as desired in administration, but this can be |
| | | used in the Pacemaker configuration in |
| | | :ref:`rules` (as the ``#cluster-name`` |
| | | :ref:`node attribute |
| | | `. It may |
| | | also be used by higher-level tools when |
| | | displaying cluster information, and by |
| | | certain resource agents (for example, the |
| | | ``ocf:heartbeat:GFS2`` agent stores the |
| | | cluster name in filesystem meta-data). |
+---------------------------+---------+----------------------------------------------------+
| dc-version | | .. index:: |
| | | pair: cluster option; dc-version |
| | | |
| | | Version of Pacemaker on the cluster's DC. |
| | | Determined automatically by the cluster. Often |
| | | includes the hash which identifies the exact |
| | | Git changeset it was built from. Used for |
| | | diagnostic purposes. |
+---------------------------+---------+----------------------------------------------------+
| cluster-infrastructure | | .. index:: |
| | | pair: cluster option; cluster-infrastructure |
| | | |
| | | The messaging stack on which Pacemaker is |
| | | currently running. Determined automatically by |
| | | the cluster. Used for informational and |
| | | diagnostic purposes. |
+---------------------------+---------+----------------------------------------------------+
| no-quorum-policy | stop | .. index:: |
| | | pair: cluster option; no-quorum-policy |
| | | |
| | | What to do when the cluster does not have |
| | | quorum. Allowed values: |
| | | |
| | | * ``ignore:`` continue all resource management |
| | | * ``freeze:`` continue resource management, but |
| | | don't recover resources from nodes not in the |
| | | affected partition |
| | | * ``stop:`` stop all resources in the affected |
| | | cluster partition |
| | | * ``demote:`` demote promotable resources and |
| | | stop all other resources in the affected |
| | | cluster partition *(since 2.0.5)* |
| | | * ``suicide:`` fence all nodes in the affected |
| | | cluster partition |
+---------------------------+---------+----------------------------------------------------+
| batch-limit | 0 | .. index:: |
| | | pair: cluster option; batch-limit |
| | | |
| | | The maximum number of actions that the cluster |
| | | may execute in parallel across all nodes. The |
| | | "correct" value will depend on the speed and |
| | | load of your network and cluster nodes. If zero, |
| | | the cluster will impose a dynamically calculated |
| | | limit only when any node has high load. If -1, the |
| | | cluster will not impose any limit. |
+---------------------------+---------+----------------------------------------------------+
| migration-limit | -1 | .. index:: |
| | | pair: cluster option; migration-limit |
| | | |
| | | The number of |
| | | :ref:`live migration ` actions |
| | | that the cluster is allowed to execute in |
| | | parallel on a node. A value of -1 means |
| | | unlimited. |
+---------------------------+---------+----------------------------------------------------+
| symmetric-cluster | true | .. index:: |
| | | pair: cluster option; symmetric-cluster |
| | | |
| | | Whether resources can run on any node by default |
| | | (if false, a resource is allowed to run on a |
| | | node only if a |
| | | :ref:`location constraint ` |
| | | enables it) |
+---------------------------+---------+----------------------------------------------------+
| stop-all-resources | false | .. index:: |
| | | pair: cluster option; stop-all-resources |
| | | |
| | | Whether all resources should be disallowed from |
| | | running (can be useful during maintenance) |
+---------------------------+---------+----------------------------------------------------+
| stop-orphan-resources | true | .. index:: |
| | | pair: cluster option; stop-orphan-resources |
| | | |
| | | Whether resources that have been deleted from |
| | | the configuration should be stopped. This value |
| | | takes precedence over ``is-managed`` (that is, |
| | | even unmanaged resources will be stopped when |
| | | orphaned if this value is ``true`` |
+---------------------------+---------+----------------------------------------------------+
| stop-orphan-actions | true | .. index:: |
| | | pair: cluster option; stop-orphan-actions |
| | | |
| | | Whether recurring :ref:`operations ` |
| | | that have been deleted from the configuration |
| | | should be cancelled |
+---------------------------+---------+----------------------------------------------------+
| start-failure-is-fatal | true | .. index:: |
| | | pair: cluster option; start-failure-is-fatal |
| | | |
| | | Whether a failure to start a resource on a |
| | | particular node prevents further start attempts |
| | | on that node? If ``false``, the cluster will |
| | | decide whether the node is still eligible based |
| | | on the resource's current failure count and |
| | | :ref:`migration-threshold `. |
+---------------------------+---------+----------------------------------------------------+
| enable-startup-probes | true | .. index:: |
| | | pair: cluster option; enable-startup-probes |
| | | |
| | | Whether the cluster should check the |
| | | pre-existing state of resources when the cluster |
| | | starts |
+---------------------------+---------+----------------------------------------------------+
| maintenance-mode | false | .. index:: |
| | | pair: cluster option; maintenance-mode |
| | | |
| | | Whether the cluster should refrain from |
| | | monitoring, starting and stopping resources |
+---------------------------+---------+----------------------------------------------------+
| stonith-enabled | true | .. index:: |
| | | pair: cluster option; stonith-enabled |
| | | |
| | | Whether the cluster is allowed to fence nodes |
| | | (for example, failed nodes and nodes with |
| | | resources that can't be stopped. |
| | | |
| | | If true, at least one fence device must be |
| | | configured before resources are allowed to run. |
| | | |
| | | If false, unresponsive nodes are immediately |
| | | assumed to be running no resources, and resource |
| | | recovery on online nodes starts without any |
| | | further protection (which can mean *data loss* |
| | | if the unresponsive node still accesses shared |
| | | storage, for example). See also the |
| | | :ref:`requires ` resource |
| | | meta-attribute. |
+---------------------------+---------+----------------------------------------------------+
| stonith-action | reboot | .. index:: |
| | | pair: cluster option; stonith-action |
| | | |
| | | Action the cluster should send to the fence agent |
| | | when a node must be fenced. Allowed values are |
| | | ``reboot``, ``off``, and (for legacy agents only) |
| | | ``poweroff``. |
+---------------------------+---------+----------------------------------------------------+
| stonith-timeout | 60s | .. index:: |
| | | pair: cluster option; stonith-timeout |
| | | |
| | | How long to wait for ``on``, ``off``, and |
| | | ``reboot`` fence actions to complete by default. |
+---------------------------+---------+----------------------------------------------------+
| stonith-max-attempts | 10 | .. index:: |
| | | pair: cluster option; stonith-max-attempts |
| | | |
| | | How many times fencing can fail for a target |
| | | before the cluster will no longer immediately |
| | | re-attempt it. |
+---------------------------+---------+----------------------------------------------------+
| stonith-watchdog-timeout | 0 | .. index:: |
| | | pair: cluster option; stonith-watchdog-timeout |
| | | |
| | | If nonzero, and the cluster detects |
| | | ``have-watchdog`` as ``true``, then watchdog-based |
| | | self-fencing will be performed via SBD when |
| | | fencing is required, without requiring a fencing |
| | | resource explicitly configured. |
| | | |
| | | If this is set to a positive value, unseen nodes |
| | | are assumed to self-fence within this much time. |
| | | |
| | | **Warning:** It must be ensured that this value is |
| | | larger than the ``SBD_WATCHDOG_TIMEOUT`` |
| | | environment variable on all nodes. Pacemaker |
| | | verifies the settings individually on all nodes |
| | | and prevents startup or shuts down if configured |
| | | wrongly on the fly. It is strongly recommended |
| | | that ``SBD_WATCHDOG_TIMEOUT`` be set to the same |
| | | value on all nodes. |
| | | |
| | | If this is set to a negative value, and |
| | | ``SBD_WATCHDOG_TIMEOUT`` is set, twice that value |
| | | will be used. |
| | | |
| | | **Warning:** In this case, it is essential (and |
| | | currently not verified by pacemaker) that |
| | | ``SBD_WATCHDOG_TIMEOUT`` is set to the same |
| | | value on all nodes. |
+---------------------------+---------+----------------------------------------------------+
| concurrent-fencing | false | .. index:: |
| | | pair: cluster option; concurrent-fencing |
| | | |
| | | Whether the cluster is allowed to initiate |
| | | multiple fence actions concurrently. Fence actions |
| | | initiated externally, such as via the |
| | | ``stonith_admin`` tool or an application such as |
| | | DLM, or by the fencer itself such as recurring |
| | | device monitors and ``status`` and ``list`` |
| | | commands, are not limited by this option. |
+---------------------------+---------+----------------------------------------------------+
| fence-reaction | stop | .. index:: |
| | | pair: cluster option; fence-reaction |
| | | |
| | | How should a cluster node react if notified of its |
| | | own fencing? A cluster node may receive |
| | | notification of its own fencing if fencing is |
| | | misconfigured, or if fabric fencing is in use that |
| | | doesn't cut cluster communication. Allowed values |
| | | are ``stop`` to attempt to immediately stop |
| | | pacemaker and stay stopped, or ``panic`` to |
| | | attempt to immediately reboot the local node, |
| | | falling back to stop on failure. The default is |
| | | likely to be changed to ``panic`` in a future |
| | | release. *(since 2.0.3)* |
+---------------------------+---------+----------------------------------------------------+
| priority-fencing-delay | 0 | .. index:: |
| | | pair: cluster option; priority-fencing-delay |
| | | |
| | | Apply this delay to any fencing targeting the lost |
| | | nodes with the highest total resource priority in |
| | | case we don't have the majority of the nodes in |
| | | our cluster partition, so that the more |
| | | significant nodes potentially win any fencing |
| | | match (especially meaningful in a split-brain of a |
| | | 2-node cluster). A promoted resource instance |
| | | takes the resource's priority plus 1 if the |
| | | resource's priority is not 0. Any static or random |
| | | delays introduced by ``pcmk_delay_base`` and |
| | | ``pcmk_delay_max`` configured for the |
| | | corresponding fencing resources will be added to |
| | | this delay. This delay should be significantly |
| | | greater than (safely twice) the maximum delay from |
| | | those parameters. *(since 2.0.4)* |
+---------------------------+---------+----------------------------------------------------+
| node-pending-timeout | 10min | .. index:: |
| | | pair: cluster option; node-pending-timeout |
| | | |
| | | A node that has joined the cluster can be pending |
| | | on joining the process group. We wait up to this |
| | | much time for it. If it times out, fencing |
| | | targeting the node will be issued if enabled. |
| | | *(since 2.1.7)* |
+---------------------------+---------+----------------------------------------------------+
| cluster-delay | 60s | .. index:: |
| | | pair: cluster option; cluster-delay |
| | | |
| | | Estimated maximum round-trip delay over the |
| | | network (excluding action execution). If the DC |
| | | requires an action to be executed on another node, |
| | | it will consider the action failed if it does not |
| | | get a response from the other node in this time |
| | | (after considering the action's own timeout). The |
| | | "correct" value will depend on the speed and load |
| | | of your network and cluster nodes. |
+---------------------------+---------+----------------------------------------------------+
| dc-deadtime | 20s | .. index:: |
| | | pair: cluster option; dc-deadtime |
| | | |
| | | How long to wait for a response from other nodes |
| | | during startup. The "correct" value will depend on |
| | | the speed/load of your network and the type of |
| | | switches used. |
+---------------------------+---------+----------------------------------------------------+
| cluster-ipc-limit | 500 | .. index:: |
| | | pair: cluster option; cluster-ipc-limit |
| | | |
| | | The maximum IPC message backlog before one cluster |
| | | daemon will disconnect another. This is of use in |
| | | large clusters, for which a good value is the |
| | | number of resources in the cluster multiplied by |
| | | the number of nodes. The default of 500 is also |
| | | the minimum. Raise this if you see |
| | | "Evicting client" messages for cluster daemon PIDs |
| | | in the logs. |
+---------------------------+---------+----------------------------------------------------+
| pe-error-series-max | -1 | .. index:: |
| | | pair: cluster option; pe-error-series-max |
| | | |
| | | The number of scheduler inputs resulting in errors |
| | | to save. Used when reporting problems. A value of |
| | | -1 means unlimited (report all), and 0 means none. |
+---------------------------+---------+----------------------------------------------------+
| pe-warn-series-max | 5000 | .. index:: |
| | | pair: cluster option; pe-warn-series-max |
| | | |
| | | The number of scheduler inputs resulting in |
| | | warnings to save. Used when reporting problems. A |
| | | value of -1 means unlimited (report all), and 0 |
| | | means none. |
+---------------------------+---------+----------------------------------------------------+
| pe-input-series-max | 4000 | .. index:: |
| | | pair: cluster option; pe-input-series-max |
| | | |
| | | The number of "normal" scheduler inputs to save. |
| | | Used when reporting problems. A value of -1 means |
| | | unlimited (report all), and 0 means none. |
+---------------------------+---------+----------------------------------------------------+
| enable-acl | false | .. index:: |
| | | pair: cluster option; enable-acl |
| | | |
| | | Whether :ref:`acl` should be used to authorize |
| | | modifications to the CIB |
+---------------------------+---------+----------------------------------------------------+
| placement-strategy | default | .. index:: |
| | | pair: cluster option; placement-strategy |
| | | |
- | | | How the cluster should allocate resources to nodes |
+ | | | How the cluster should assign resources to nodes |
| | | (see :ref:`utilization`). Allowed values are |
| | | ``default``, ``utilization``, ``balanced``, and |
| | | ``minimal``. |
+---------------------------+---------+----------------------------------------------------+
| node-health-strategy | none | .. index:: |
| | | pair: cluster option; node-health-strategy |
| | | |
| | | How the cluster should react to node health |
| | | attributes (see :ref:`node-health`). Allowed values|
| | | are ``none``, ``migrate-on-red``, ``only-green``, |
| | | ``progressive``, and ``custom``. |
+---------------------------+---------+----------------------------------------------------+
| node-health-base | 0 | .. index:: |
| | | pair: cluster option; node-health-base |
| | | |
| | | The base health score assigned to a node. Only |
| | | used when ``node-health-strategy`` is |
| | | ``progressive``. |
+---------------------------+---------+----------------------------------------------------+
| node-health-green | 0 | .. index:: |
| | | pair: cluster option; node-health-green |
| | | |
| | | The score to use for a node health attribute whose |
| | | value is ``green``. Only used when |
| | | ``node-health-strategy`` is ``progressive`` or |
| | | ``custom``. |
+---------------------------+---------+----------------------------------------------------+
| node-health-yellow | 0 | .. index:: |
| | | pair: cluster option; node-health-yellow |
| | | |
| | | The score to use for a node health attribute whose |
| | | value is ``yellow``. Only used when |
| | | ``node-health-strategy`` is ``progressive`` or |
| | | ``custom``. |
+---------------------------+---------+----------------------------------------------------+
| node-health-red | 0 | .. index:: |
| | | pair: cluster option; node-health-red |
| | | |
| | | The score to use for a node health attribute whose |
| | | value is ``red``. Only used when |
| | | ``node-health-strategy`` is ``progressive`` or |
| | | ``custom``. |
+---------------------------+---------+----------------------------------------------------+
| cluster-recheck-interval | 15min | .. index:: |
| | | pair: cluster option; cluster-recheck-interval |
| | | |
| | | Pacemaker is primarily event-driven, and looks |
| | | ahead to know when to recheck the cluster for |
| | | failure timeouts and most time-based rules |
| | | *(since 2.0.3)*. However, it will also recheck the |
| | | cluster after this amount of inactivity. This has |
| | | two goals: rules with ``date_spec`` are only |
| | | guaranteed to be checked this often, and it also |
| | | serves as a fail-safe for some kinds of scheduler |
| | | bugs. A value of 0 disables this polling; positive |
| | | values are a time interval. |
+---------------------------+---------+----------------------------------------------------+
| shutdown-lock | false | .. index:: |
| | | pair: cluster option; shutdown-lock |
| | | |
| | | The default of false allows active resources to be |
| | | recovered elsewhere when their node is cleanly |
| | | shut down, which is what the vast majority of |
| | | users will want. However, some users prefer to |
| | | make resources highly available only for failures, |
| | | with no recovery for clean shutdowns. If this |
| | | option is true, resources active on a node when it |
| | | is cleanly shut down are kept "locked" to that |
| | | node (not allowed to run elsewhere) until they |
| | | start again on that node after it rejoins (or for |
| | | at most ``shutdown-lock-limit``, if set). Stonith |
| | | resources and Pacemaker Remote connections are |
| | | never locked. Clone and bundle instances and the |
| | | promoted role of promotable clones are currently |
| | | never locked, though support could be added in a |
| | | future release. Locks may be manually cleared |
| | | using the ``--refresh`` option of ``crm_resource`` |
| | | (both the resource and node must be specified; |
| | | this works with remote nodes if their connection |
| | | resource's ``target-role`` is set to ``Stopped``, |
| | | but not if Pacemaker Remote is stopped on the |
| | | remote node without disabling the connection |
| | | resource). *(since 2.0.4)* |
+---------------------------+---------+----------------------------------------------------+
| shutdown-lock-limit | 0 | .. index:: |
| | | pair: cluster option; shutdown-lock-limit |
| | | |
| | | If ``shutdown-lock`` is true, and this is set to a |
| | | nonzero time duration, locked resources will be |
| | | allowed to start after this much time has passed |
| | | since the node shutdown was initiated, even if the |
| | | node has not rejoined. (This works with remote |
| | | nodes only if their connection resource's |
| | | ``target-role`` is set to ``Stopped``.) |
| | | *(since 2.0.4)* |
+---------------------------+---------+----------------------------------------------------+
| remove-after-stop | false | .. index:: |
| | | pair: cluster option; remove-after-stop |
| | | |
| | | *Deprecated* Should the cluster remove |
| | | resources from Pacemaker's executor after they are |
| | | stopped? Values other than the default are, at |
| | | best, poorly tested and potentially dangerous. |
| | | This option is deprecated and will be removed in a |
| | | future release. |
+---------------------------+---------+----------------------------------------------------+
| startup-fencing | true | .. index:: |
| | | pair: cluster option; startup-fencing |
| | | |
| | | *Advanced Use Only:* Should the cluster fence |
| | | unseen nodes at start-up? Setting this to false is |
| | | unsafe, because the unseen nodes could be active |
| | | and running resources but unreachable. |
+---------------------------+---------+----------------------------------------------------+
| election-timeout | 2min | .. index:: |
| | | pair: cluster option; election-timeout |
| | | |
| | | *Advanced Use Only:* If you need to adjust this |
| | | value, it probably indicates the presence of a bug.|
+---------------------------+---------+----------------------------------------------------+
| shutdown-escalation | 20min | .. index:: |
| | | pair: cluster option; shutdown-escalation |
| | | |
| | | *Advanced Use Only:* If you need to adjust this |
| | | value, it probably indicates the presence of a bug.|
+---------------------------+---------+----------------------------------------------------+
| join-integration-timeout | 3min | .. index:: |
| | | pair: cluster option; join-integration-timeout |
| | | |
| | | *Advanced Use Only:* If you need to adjust this |
| | | value, it probably indicates the presence of a bug.|
+---------------------------+---------+----------------------------------------------------+
| join-finalization-timeout | 30min | .. index:: |
| | | pair: cluster option; join-finalization-timeout |
| | | |
| | | *Advanced Use Only:* If you need to adjust this |
| | | value, it probably indicates the presence of a bug.|
+---------------------------+---------+----------------------------------------------------+
| transition-delay | 0s | .. index:: |
| | | pair: cluster option; transition-delay |
| | | |
| | | *Advanced Use Only:* Delay cluster recovery for |
| | | the configured interval to allow for additional or |
| | | related events to occur. This can be useful if |
| | | your configuration is sensitive to the order in |
| | | which ping updates arrive. Enabling this option |
| | | will slow down cluster recovery under all |
| | | conditions. |
+---------------------------+---------+----------------------------------------------------+
diff --git a/doc/sphinx/Pacemaker_Explained/utilization.rst b/doc/sphinx/Pacemaker_Explained/utilization.rst
index 93c67cdf31..87eef6021e 100644
--- a/doc/sphinx/Pacemaker_Explained/utilization.rst
+++ b/doc/sphinx/Pacemaker_Explained/utilization.rst
@@ -1,264 +1,264 @@
.. _utilization:
Utilization and Placement Strategy
----------------------------------
Pacemaker decides where to place a resource according to the resource
-allocation scores on every node. The resource will be allocated to the
+assignment scores on every node. The resource will be assigned to the
node where the resource has the highest score.
-If the resource allocation scores on all the nodes are equal, by the default
+If the resource assignment scores on all the nodes are equal, by the default
placement strategy, Pacemaker will choose a node with the least number of
-allocated resources for balancing the load. If the number of resources on each
+assigned resources for balancing the load. If the number of resources on each
node is equal, the first eligible node listed in the CIB will be chosen to run
the resource.
Often, in real-world situations, different resources use significantly
different proportions of a node's capacities (memory, I/O, etc.).
We cannot balance the load ideally just according to the number of resources
-allocated to a node. Besides, if resources are placed such that their combined
+assigned to a node. Besides, if resources are placed such that their combined
requirements exceed the provided capacity, they may fail to start completely or
run with degraded performance.
To take these factors into account, Pacemaker allows you to configure:
#. The capacity a certain node provides.
#. The capacity a certain resource requires.
#. An overall strategy for placement of resources.
Utilization attributes
######################
To configure the capacity that a node provides or a resource requires,
you can use *utilization attributes* in ``node`` and ``resource`` objects.
You can name utilization attributes according to your preferences and define as
many name/value pairs as your configuration needs. However, the attributes'
values must be integers.
.. topic:: Specifying CPU and RAM capacities of two nodes
.. code-block:: xml
.. topic:: Specifying CPU and RAM consumed by several resources
.. code-block:: xml
A node is considered eligible for a resource if it has sufficient free
capacity to satisfy the resource's requirements. The nature of the required
or provided capacities is completely irrelevant to Pacemaker -- it just makes
sure that all capacity requirements of a resource are satisfied before placing
a resource to a node.
Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*.
These attributes are added to a ``transient_attributes`` section for the node
and are forgotten by the cluster when the node goes offline. The ``attrd_updater``
tool can be used to set these attributes.
.. topic:: Transient utilization attribute for node cluster-1
.. code-block:: xml
.. note::
Utilization is supported for bundles *(since 2.1.3)*, but only for bundles
with an inner primitive. Any resource utilization values should be specified
for the inner primitive, but any priority meta-attribute should be specified
for the outer bundle.
Placement Strategy
##################
After you have configured the capacities your nodes provide and the
capacities your resources require, you need to set the ``placement-strategy``
in the global cluster options, otherwise the capacity configurations have
*no effect*.
Four values are available for the ``placement-strategy``:
* **default**
Utilization values are not taken into account at all.
- Resources are allocated according to allocation scores. If scores are equal,
+ Resources are assigned according to assignment scores. If scores are equal,
resources are evenly distributed across nodes.
* **utilization**
Utilization values are taken into account *only* when deciding whether a node
is considered eligible (i.e. whether it has sufficient free capacity to satisfy
the resource's requirements). Load-balancing is still done based on the
- number of resources allocated to a node.
+ number of resources assigned to a node.
* **balanced**
Utilization values are taken into account when deciding whether a node
is eligible to serve a resource *and* when load-balancing, so an attempt is
made to spread the resources in a way that optimizes resource performance.
* **minimal**
Utilization values are taken into account *only* when deciding whether a node
is eligible to serve a resource. For load-balancing, an attempt is made to
concentrate the resources on as few nodes as possible, thereby enabling
possible power savings on the remaining nodes.
Set ``placement-strategy`` with ``crm_attribute``:
.. code-block:: none
# crm_attribute --name placement-strategy --update balanced
Now Pacemaker will ensure the load from your resources will be distributed
evenly throughout the cluster, without the need for convoluted sets of
colocation constraints.
-Allocation Details
+Assignment Details
##################
-Which node is preferred to get consumed first when allocating resources?
-________________________________________________________________________
+Which node is preferred to get consumed first when assigning resources?
+_______________________________________________________________________
* The node with the highest node weight gets consumed first. Node weight
is a score maintained by the cluster to represent node health.
* If multiple nodes have the same node weight:
* If ``placement-strategy`` is ``default`` or ``utilization``,
- the node that has the least number of allocated resources gets consumed first.
+ the node that has the least number of assigned resources gets consumed first.
- * If their numbers of allocated resources are equal,
+ * If their numbers of assigned resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If ``placement-strategy`` is ``balanced``,
the node that has the most free capacity gets consumed first.
* If the free capacities of the nodes are equal,
- the node that has the least number of allocated resources gets consumed first.
+ the node that has the least number of assigned resources gets consumed first.
- * If their numbers of allocated resources are equal,
+ * If their numbers of assigned resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If ``placement-strategy`` is ``minimal``,
the first eligible node listed in the CIB gets consumed first.
Which node has more free capacity?
__________________________________
If only one type of utilization attribute has been defined, free capacity
is a simple numeric comparison.
If multiple types of utilization attributes have been defined, then
the node that is numerically highest in the the most attribute types
has the most free capacity. For example:
* If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``,
then their free capacities are equal.
* If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory``
and ``storage``, then ``nodeB`` has more free capacity.
Which resource is preferred to be assigned first?
_________________________________________________
* The resource that has the highest ``priority`` (see :ref:`resource_options`) gets
- allocated first.
+ assigned first.
* If their priorities are equal, check whether they are already running. The
- resource that has the highest score on the node where it's running gets allocated
+ resource that has the highest score on the node where it's running gets assigned
first, to prevent resource shuffling.
* If the scores above are equal or the resources are not running, the resource has
- the highest score on the preferred node gets allocated first.
+ the highest score on the preferred node gets assigned first.
* If the scores above are equal, the first runnable resource listed in the CIB
- gets allocated first.
+ gets assigned first.
Limitations and Workarounds
###########################
The type of problem Pacemaker is dealing with here is known as the
`knapsack problem `_ and falls into
the `NP-complete `_ category of computer
science problems -- a fancy way of saying "it takes a really long time
to solve".
Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
or days, finding an optimal solution while services remain unavailable.
So instead of trying to solve the problem completely, Pacemaker uses a
*best effort* algorithm for determining which node should host a particular
service. This means it arrives at a solution much faster than traditional
linear programming algorithms, but by doing so at the price of leaving some
services stopped.
In the contrived example at the start of this chapter:
-* ``rsc-small`` would be allocated to ``node1``
+* ``rsc-small`` would be assigned to ``node1``
-* ``rsc-medium`` would be allocated to ``node2``
+* ``rsc-medium`` would be assigned to ``node2``
* ``rsc-large`` would remain inactive
Which is not ideal.
There are various approaches to dealing with the limitations of
pacemaker's placement strategy:
* **Ensure you have sufficient physical capacity.**
It might sound obvious, but if the physical capacity of your nodes is (close to)
maxed out by the cluster under normal conditions, then failover isn't going to
go well. Even without the utilization feature, you'll start hitting timeouts and
getting secondary failures.
* **Build some buffer into the capabilities advertised by the nodes.**
Advertise slightly more resources than we physically have, on the (usually valid)
assumption that a resource will not use 100% of the configured amount of
CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*.
* **Specify resource priorities.**
If the cluster is going to sacrifice services, it should be the ones you care
about (comparatively) the least. Ensure that resource priorities are properly set
so that your most important resources are scheduled first.