diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst index 91862cd48d..d4f24fa18f 100644 --- a/doc/sphinx/Pacemaker_Development/components.rst +++ b/doc/sphinx/Pacemaker_Development/components.rst @@ -1,489 +1,489 @@ Coding Particular Pacemaker Components -------------------------------------- The Pacemaker code can be intricate and difficult to follow. This chapter has some high-level descriptions of how individual components work. .. index:: single: controller single: pacemaker-controld Controller ########## ``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons to orchestrate actions that need to be taken in the cluster. It receives CIB change notifications from the CIB manager, passes the new CIB to the scheduler to determine whether anything needs to be done, uses the executor and fencer to execute any actions required, and sets failure counts (among other things) via the attribute manager. As might be expected, it has the most code of any of the daemons. .. index:: single: join Join sequence _____________ Most daemons track their cluster peers using Corosync's membership and CPG only. The controller additionally requires peers to `join`, which ensures they are ready to be assigned tasks. Joining proceeds through a series of phases referred to as the `join sequence` or `join process`. A node's current join phase is tracked by the ``join`` member of ``crm_node_t`` (used in the peer cache). It is an ``enum crm_join_phase`` that (ideally) progresses from the DC's point of view as follows: * The node initially starts at ``crm_join_none`` * The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node proceeds to ``crm_join_welcomed``. This can happen in three ways: * The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at its controller startup, and the DC will reply to that with a join offer. * When the DC's peer status callback notices that the node has joined the messaging layer, it registers ``I_NODE_JOIN`` (which leads to ``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` -> ``join_make_offer()``). * After certain events (notably a new DC being elected), the DC will send all nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``). These can overlap. The DC can send a join offer and the node can send a join announce at nearly the same time, so the node responds to the original join offer while the DC responds to the join announce with a new join offer. The situation resolves itself after looping a bit. * The node responds to join offers with a `join request` (``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and ``join_query_callback()``). When the DC receives the request, the node proceeds to ``crm_join_integrated`` (via ``do_dc_join_filter_offer()``). * As each node is integrated, the current best CIB is sync'ed to each integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``) and the node proceeds to ``crm_join_finalized`` (via ``finalize_sync_callback()`` + ``finalize_join_for()``). * Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via ``do_cl_join_finalize_respond()``), including its current resource operation history (via ``controld_query_executor_state()``). Once the DC receives this confirmation, the node proceeds to ``crm_join_confirmed`` via ``do_dc_join_ack()``. Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks for quorum and responds appropriately. When peers are lost, their join phase is reset to none (in various places). ``crm_update_peer_join()`` updates a node's join phase. The DC increments the global ``current_join_id`` for each joining round, and rejects any (older) replies that don't match. .. index:: single: fencer single: pacemaker-fenced Fencer ###### ``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In the broadest terms, fencing works like this: #. The initiator (an external program such as ``stonith_admin``, or the cluster itself via the controller) asks the local fencer, "Hey, could you please fence this node?" #. The local fencer asks all the fencers in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?" #. Each fencer in the cluster replies with a list of available devices that it knows about. #. Once the original fencer gets all the replies, it asks the most appropriate fencer peer to actually carry out the fencing. It may send out more than one such request if the target node must be fenced with multiple devices. #. The chosen fencer(s) call the appropriate fencing resource agent(s) to do the fencing, then reply to the original fencer with the result. #. The original fencer broadcasts the result to all fencers. #. Each fencer sends the result to each of its local clients (including, at some point, the initiator). A more detailed description follows. .. index:: single: libstonithd Initiating a fencing request ____________________________ A fencing request can be initiated by the cluster or externally, using the libstonithd API. * The cluster always initiates fencing via ``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the ``fence()`` API method). This occurs when a transition graph synapse contains a ``CRM_OP_FENCE`` XML operation. * The main external clients are ``stonith_admin`` and ``cts-fence-helper``. The ``DLM`` project also uses Pacemaker for fencing. Highlights of the fencing API: * ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose ``cmds`` member has methods for connect, disconnect, fence, etc. * the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with the desired action and target node. Callers do not have to choose or even have any knowledge about particular fencing devices. Fencing queries _______________ The function calls for a fencing request go something like this: The local fencer receives the client's request via an IPC or messaging layer callback, which calls * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML request with the target, desired action, timeout, etc. then broadcasts the operation to the cluster group (i.e. all fencer instances) and starts a timer. The query is broadcast because (1) location constraints might prevent the local node from accessing the stonith device directly, and (2) even if the local node does have direct access, another node might be preferred to carry out the fencing. Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast request via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls * ``stonith_query()``, which calls * ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add device information to an XML reply and send it. (A message is considered a reply if it contains ``T_STONITH_REPLY``, which is only set by fencer peers, not clients.) The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls * ``process_remote_stonith_query()``, which allocates a new query result structure, parses device information into it, and adds it to the operation object. It increments the number of replies received for this operation, and compares it against the expected number of replies (i.e. the number of active peers), and if this is the last expected reply, calls * ``request_peer_fencing()``, which calculates the timeout and sends ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target node has a fencing "topology" (which allows specifications such as "this node can be fenced either with device A, or devices B and C in combination"), it will choose the device(s), and send out as many requests as needed. If it chooses a device, it will choose the peer; a peer is preferred if it has "verified" access to the desired device, meaning that it has the device "running" on it and thus has a monitor operation ensuring reachability. Fencing operations __________________ Each ``STONITH_OP_FENCE`` request goes something like this: The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls * ``stonith_fence()``, which calls * ``schedule_stonith_command()`` (using supplied device if ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable device obtained via ``get_capable_devices()`` with ``stonith_fence_get_devices_cb()``), which adds the operation to the device's pending operations list and triggers processing. The chosen peer fencer's mainloop is triggered and calls * ``stonith_device_dispatch()``, which calls * ``stonith_device_execute()``, which pops off the next item from the device's pending operations list. If acting as the (internally implemented) watchdog agent, it panics the node, otherwise it calls * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to call the fencing agent. The chosen peer fencer's mainloop is triggered again once the fencing agent returns, and calls * ``stonith_action_async_done()`` which adds the results to an action object then calls its * done callback (``st_child_done()``), which calls ``schedule_stonith_command()`` for a new device if there are further required actions to execute or if the original action failed, then builds and sends an XML reply to the original fencer (via ``send_async_reply()``), then checks whether any pending actions are the same as the one just executed and merges them if so. Fencing replies _______________ The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()``, which calls * ``fenced_process_fencing_reply()``, which calls either ``request_peer_fencing()`` (to retry a failed operation, or try the next device in a topology if appropriate, which issues a new ``STONITH_OP_FENCE`` request, proceeding as before) or ``finalize_op()`` (if the operation is definitively failed or successful). * ``finalize_op()`` broadcasts the result to all peers. Finally, all peers receive the broadcast result and call * ``finalize_op()``, which sends the result to all local clients. .. index:: single: fence history Fencing History _______________ The fencer keeps a running history of all fencing operations. The bulk of the relevant code is in `fenced_history.c` and ensures the history is synchronized across all nodes even if a node leaves and rejoins the cluster. In libstonithd, this information is represented by `stonith_history_t` and is queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and `stonith_admin` use this API to display the history. .. index:: single: scheduler single: pacemaker-schedulerd single: libpe_status single: libpe_rules single: libpacemaker Scheduler ######### ``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker scheduler for the controller, but "the scheduler" in general refers to related library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``). The purpose of the scheduler is to take a CIB as input and generate a transition graph (list of actions that need to be taken) as output. The controller invokes the scheduler by contacting the scheduler daemon via local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource`` can also invoke the scheduler, but do so by calling the library functions directly. This allows them to run using a ``CIB_file`` without the cluster needing to be active. The main entry point for the scheduler code is -``lib/pacemaker/pcmk_sched_allocate.c:pcmk__schedule_actions()``. It sets +``lib/pacemaker/pcmk_scheduler.c:pcmk__schedule_actions()``. It sets defaults and calls a series of functions for the scheduling. Some key steps: * ``unpack_cib()`` parses most of the CIB XML into data structures, and determines the current cluster status. * ``apply_node_criteria()`` applies factors that make resources prefer certain nodes, such as shutdown locks, location constraints, and stickiness. * ``pcmk__create_internal_constraints()`` creates internal constraints, such as the implicit ordering for group members, or start actions being implicitly ordered before promote actions. * ``pcmk__handle_rsc_config_changes()`` processes resource history entries in the CIB status section. This is used to decide whether certain actions need to be done, such as deleting orphan resources, forcing a restart when a resource definition changes, etc. * ``assign_resources()`` assigns resources to nodes. * ``schedule_resource_actions()`` schedules resource-specific actions (which might or might not end up in the final graph). * ``pcmk__apply_orderings()`` processes ordering constraints in order to modify action attributes such as optional or required. * ``pcmk__create_graph()`` creates the transition graph. Challenges __________ Working with the scheduler is difficult. Challenges include: * It is far too much code to keep more than a small portion in your head at one time. * Small changes can have large (and unexpected) effects. This is why we have a large number of regression tests (``cts/cts-scheduler``), which should be run after making code changes. * It produces an insane amount of log messages at debug and trace levels. You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to enable trace-level messages only when related to specific resources. * Different parts of the main ``pe_working_set_t`` structure are finalized at different points in the scheduling process, so you have to keep in mind whether information you're using at one point of the code can possibly change later. For example, data unpacked from the CIB can safely be used anytime after ``unpack_cib(),`` but actions may become optional or required anytime before ``pcmk__create_graph()``. There's no easy way to deal with this. * Many names of struct members, functions, etc., are suboptimal, but are part of the public API and cannot be changed until an API backward compatibility break. .. index:: single: pe_working_set_t Cluster Working Set ___________________ The main data object for the scheduler is ``pe_working_set_t``, which contains all information needed about nodes, resources, constraints, etc., both as the raw CIB XML and parsed into more usable data structures, plus the resulting transition graph XML. The variable name is usually ``data_set``. .. index:: single: pe_resource_t Resources _________ ``pe_resource_t`` is the data object representing cluster resources. A resource has a variant: primitive (a.k.a. native), group, clone, or bundle. The resource object has members for two sets of methods, ``resource_object_functions_t`` from the ``libpe_status`` public API, and ``resource_alloc_functions_t`` whose implementation is internal to ``libpacemaker``. The actual functions vary by variant. The object functions have basic capabilities such as unpacking the resource XML, and determining the current or planned location of the resource. -The allocation functions have more obscure capabilities needed for scheduling, +The assignment functions have more obscure capabilities needed for scheduling, such as processing location and ordering constraints. For example, ``pcmk__create_internal_constraints()`` simply calls the ``internal_constraints()`` method for each top-level resource in the cluster. .. index:: single: pe_node_t Nodes _____ -Allocation of resources to nodes is done by choosing the node with the highest +Assignment of resources to nodes is done by choosing the node with the highest score for a given resource. The scheduler does a bunch of processing to -generate the scores, then the actual allocation is straightforward. +generate the scores, then the actual assignment is straightforward. Node lists are frequently used. For example, ``pe_working_set_t`` has a ``nodes`` member which is a list of all nodes in the cluster, and ``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on which the resource is (or might be) active. These are lists of ``pe_node_t`` objects. The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member -with all node information that is independent of resource allocation (the node +with all node information that is independent of resource assignment (the node name, etc.). The working set's ``nodes`` member contains the original of this information. All other node lists contain copies of ``pe_node_t`` where only the ``details`` member points to the originals in the working set's ``nodes`` list. In this way, the other members of ``pe_node_t`` (such as ``weight``, which is the node score) may vary by node list, while the common details are shared. .. index:: single: pe_action_t single: pe_action_flags Actions _______ ``pe_action_t`` is the data object representing actions that might need to be taken. These could be resource actions, cluster-wide actions such as fencing a node, or "pseudo-actions" which are abstractions used as convenient points for ordering other actions against. It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The most important of these are ``pe_action_runnable`` (if not set, the action is "blocked" and cannot be added to the transition graph) and ``pe_action_optional`` (actions with this set will not be added to the transition graph; actions often start out as optional, and may become required later). .. index:: single: pe__colocation_t Colocations ___________ ``pcmk__colocation_t`` is the data object representing colocations. Colocation constraints come into play in these parts of the scheduler code: * When sorting resources for assignment, so resources with highest node score are assigned first (see ``cmp_resources()``) * When updating node scores for resource assigment or promotion priority * When assigning resources, so any resources to be colocated with can be assigned first, and so colocations affect where the resource is assigned * When choosing roles for promotable clone instances, so colocations involving a specific role can affect which instances are promoted -The resource allocation functions have several methods related to colocations: +The resource assignment functions have several methods related to colocations: * ``apply_coloc_score():`` This applies a colocation's score to either the dependent's allowed node scores (if called while resources are being assigned) or the dependent's priority (if called while choosing promotable instance roles). It can behave differently depending on whether it is being called as the primary's method or as the dependent's method. * ``add_colocated_node_scores():`` This updates a table of nodes for a given colocation attribute and score. It goes through colocations involving a given resource, and updates the scores of the nodes in the table with the best scores of nodes that match up according to the colocation criteria. * ``colocated_resources():`` This generates a list of all resources involved in mandatory colocations (directly or indirectly via colocation chains) with a given resource. .. index:: single: pe__ordering_t single: pe_ordering Orderings _________ Ordering constraints are simple in concept, but they are one of the most important, powerful, and difficult to follow aspects of the scheduler code. ``pe__ordering_t`` is the data object representing an ordering, better thought of as a relationship between two actions, since the relation can be more complex than just "this one runs after that one". For an ordering "A then B", the code generally refers to A as "first" or "before", and B as "then" or "after". Much of the power comes from ``enum pe_ordering``, which are flags that determine how an ordering behaves. There are many obscure flags with big effects. A few examples: * ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0, meaning no flags set, so it must be compared with equality rather than ``pcmk_is_set()``. * ``pe_order_optional`` means the ordering does not make either action required, so it only applies if they both become required for other reasons. * ``pe_order_implies_first`` means that if action B becomes required for any reason, then action A will become required as well. diff --git a/doc/sphinx/Pacemaker_Explained/advanced-resources.rst b/doc/sphinx/Pacemaker_Explained/advanced-resources.rst index a61b76db2f..07583507a4 100644 --- a/doc/sphinx/Pacemaker_Explained/advanced-resources.rst +++ b/doc/sphinx/Pacemaker_Explained/advanced-resources.rst @@ -1,1629 +1,1629 @@ Advanced Resource Types ----------------------- .. index: single: group resource single: resource; group .. _group-resources: Groups - A Syntactic Shortcut ############################# One of the most common elements of a cluster is a set of resources that need to be located together, start sequentially, and stop in the reverse order. To simplify this configuration, we support the concept of groups. .. topic:: A group of two primitive resources .. code-block:: xml Although the example above contains only two resources, there is no limit to the number of resources a group can contain. The example is also sufficient to explain the fundamental properties of a group: * Resources are started in the order they appear in (**Public-IP** first, then **Email**) * Resources are stopped in the reverse order to which they appear in (**Email** first, then **Public-IP**) If a resource in the group can't run anywhere, then nothing after that is allowed to run, too. * If **Public-IP** can't run anywhere, neither can **Email**; * but if **Email** can't run anywhere, this does not affect **Public-IP** in any way The group above is logically equivalent to writing: .. topic:: How the cluster sees a group resource .. code-block:: xml Obviously as the group grows bigger, the reduced configuration effort can become significant. Another (typical) example of a group is a DRBD volume, the filesystem mount, an IP address, and an application that uses them. .. index:: pair: XML element; group Group Properties ________________ .. table:: **Properties of a Group Resource** :widths: 1 4 +-------------+------------------------------------------------------------------+ | Field | Description | +=============+==================================================================+ | id | .. index:: | | | single: group; property, id | | | single: property; id (group) | | | single: id; group property | | | | | | A unique name for the group | +-------------+------------------------------------------------------------------+ | description | .. index:: | | | single: group; attribute, description | | | single: attribute; description (group) | | | single: description; group attribute | | | | | | An optional description of the group, for the user's own | | | purposes. | | | E.g. ``resources needed for website`` | +-------------+------------------------------------------------------------------+ Group Options _____________ Groups inherit the ``priority``, ``target-role``, and ``is-managed`` properties from primitive resources. See :ref:`resource_options` for information about those properties. Group Instance Attributes _________________________ Groups have no instance attributes. However, any that are set for the group object will be inherited by the group's children. Group Contents ______________ Groups may only contain a collection of cluster resources (see :ref:`primitive-resource`). To refer to a child of a group resource, just use the child's ``id`` instead of the group's. Group Constraints _________________ Although it is possible to reference a group's children in constraints, it is usually preferable to reference the group itself. .. topic:: Some constraints involving groups .. code-block:: xml .. index:: pair: resource-stickiness; group Group Stickiness ________________ Stickiness, the measure of how much a resource wants to stay where it is, is additive in groups. Every active resource of the group will contribute its stickiness value to the group's total. So if the default ``resource-stickiness`` is 100, and a group has seven members, five of which are active, then the group as a whole will prefer its current location with a score of 500. .. index:: single: clone single: resource; clone .. _s-resource-clone: Clones - Resources That Can Have Multiple Active Instances ########################################################## *Clone* resources are resources that can have more than one copy active at the same time. This allows you, for example, to run a copy of a daemon on every node. You can clone any primitive or group resource [#]_. Anonymous versus Unique Clones ______________________________ A clone resource is configured to be either *anonymous* or *globally unique*. Anonymous clones are the simplest. These behave completely identically everywhere they are running. Because of this, there can be only one instance of an anonymous clone active per node. The instances of globally unique clones are distinct entities. All instances are launched identically, but one instance of the clone is not identical to any other instance, whether running on the same node or a different node. As an example, a cloned IP address can use special kernel functionality such that each instance handles a subset of requests for the same IP address. .. index:: single: promotable clone single: resource; promotable .. _s-resource-promotable: Promotable clones _________________ If a clone is *promotable*, its instances can perform a special role that Pacemaker will manage via the ``promote`` and ``demote`` actions of the resource agent. Services that support such a special role have various terms for the special role and the default role: primary and secondary, master and replica, controller and worker, etc. Pacemaker uses the terms *promoted* and *unpromoted* to be agnostic to what the service calls them or what they do. All that Pacemaker cares about is that an instance comes up in the unpromoted role when started, and the resource agent supports the ``promote`` and ``demote`` actions to manage entering and exiting the promoted role. .. index:: pair: XML element; clone Clone Properties ________________ .. table:: **Properties of a Clone Resource** :widths: 1 4 +-------------+------------------------------------------------------------------+ | Field | Description | +=============+==================================================================+ | id | .. index:: | | | single: clone; property, id | | | single: property; id (clone) | | | single: id; clone property | | | | | | A unique name for the clone | +-------------+------------------------------------------------------------------+ | description | .. index:: | | | single: clone; attribute, description | | | single: attribute; description (clone) | | | single: description; clone attribute | | | | | | An optional description of the clone, for the user's own | | | purposes. | | | E.g. ``IP address for website`` | +-------------+------------------------------------------------------------------+ .. index:: pair: options; clone Clone Options _____________ :ref:`Options ` inherited from primitive resources: ``priority, target-role, is-managed`` .. table:: **Clone-specific configuration options** :class: longtable :widths: 1 1 3 +-------------------+-----------------+-------------------------------------------------------+ | Field | Default | Description | +===================+=================+=======================================================+ | globally-unique | false | .. index:: | | | | single: clone; option, globally-unique | | | | single: option; globally-unique (clone) | | | | single: globally-unique; clone option | | | | | | | | If **true**, each clone instance performs a | | | | distinct function | +-------------------+-----------------+-------------------------------------------------------+ | clone-max | 0 | .. index:: | | | | single: clone; option, clone-max | | | | single: option; clone-max (clone) | | | | single: clone-max; clone option | | | | | | | | The maximum number of clone instances that can | | | | be started across the entire cluster. If 0, the | | | | number of nodes in the cluster will be used. | +-------------------+-----------------+-------------------------------------------------------+ | clone-node-max | 1 | .. index:: | | | | single: clone; option, clone-node-max | | | | single: option; clone-node-max (clone) | | | | single: clone-node-max; clone option | | | | | | | | If ``globally-unique`` is **true**, the maximum | | | | number of clone instances that can be started | | | | on a single node | +-------------------+-----------------+-------------------------------------------------------+ | clone-min | 0 | .. index:: | | | | single: clone; option, clone-min | | | | single: option; clone-min (clone) | | | | single: clone-min; clone option | | | | | | | | Require at least this number of clone instances | | | | to be runnable before allowing resources | | | | depending on the clone to be runnable. A value | | | | of 0 means require all clone instances to be | | | | runnable. | +-------------------+-----------------+-------------------------------------------------------+ | notify | false | .. index:: | | | | single: clone; option, notify | | | | single: option; notify (clone) | | | | single: notify; clone option | | | | | | | | Call the resource agent's **notify** action for | | | | all active instances, before and after starting | | | | or stopping any clone instance. The resource | | | | agent must support this action. | | | | Allowed values: **false**, **true** | +-------------------+-----------------+-------------------------------------------------------+ | ordered | false | .. index:: | | | | single: clone; option, ordered | | | | single: option; ordered (clone) | | | | single: ordered; clone option | | | | | | | | If **true**, clone instances must be started | | | | sequentially instead of in parallel. | | | | Allowed values: **false**, **true** | +-------------------+-----------------+-------------------------------------------------------+ | interleave | false | .. index:: | | | | single: clone; option, interleave | | | | single: option; interleave (clone) | | | | single: interleave; clone option | | | | | | | | When this clone is ordered relative to another | | | | clone, if this option is **false** (the default), | | | | the ordering is relative to *all* instances of | | | | the other clone, whereas if this option is | | | | **true**, the ordering is relative only to | | | | instances on the same node. | | | | Allowed values: **false**, **true** | +-------------------+-----------------+-------------------------------------------------------+ | promotable | false | .. index:: | | | | single: clone; option, promotable | | | | single: option; promotable (clone) | | | | single: promotable; clone option | | | | | | | | If **true**, clone instances can perform a | | | | special role that Pacemaker will manage via the | | | | resource agent's **promote** and **demote** | | | | actions. The resource agent must support these | | | | actions. | | | | Allowed values: **false**, **true** | +-------------------+-----------------+-------------------------------------------------------+ | promoted-max | 1 | .. index:: | | | | single: clone; option, promoted-max | | | | single: option; promoted-max (clone) | | | | single: promoted-max; clone option | | | | | | | | If ``promotable`` is **true**, the number of | | | | instances that can be promoted at one time | | | | across the entire cluster | +-------------------+-----------------+-------------------------------------------------------+ | promoted-node-max | 1 | .. index:: | | | | single: clone; option, promoted-node-max | | | | single: option; promoted-node-max (clone) | | | | single: promoted-node-max; clone option | | | | | | | | If ``promotable`` is **true** and ``globally-unique`` | | | | is **false**, the number of clone instances can be | | | | promoted at one time on a single node | +-------------------+-----------------+-------------------------------------------------------+ .. note:: **Deprecated Terminology** In older documentation and online examples, you may see promotable clones referred to as *multi-state*, *stateful*, or *master/slave*; these mean the same thing as *promotable*. Certain syntax is supported for backward compatibility, but is deprecated and will be removed in a future version: * Using a ``master`` tag, instead of a ``clone`` tag with the ``promotable`` meta-attribute set to ``true`` * Using the ``master-max`` meta-attribute instead of ``promoted-max`` * Using the ``master-node-max`` meta-attribute instead of ``promoted-node-max`` * Using ``Master`` as a role name instead of ``Promoted`` * Using ``Slave`` as a role name instead of ``Unpromoted`` Clone Contents ______________ Clones must contain exactly one primitive or group resource. .. topic:: A clone that runs a web server on all nodes .. code-block:: xml .. warning:: You should never reference the name of a clone's child (the primitive or group resource being cloned). If you think you need to do this, you probably need to re-evaluate your design. Clone Instance Attribute ________________________ Clones have no instance attributes; however, any that are set here will be inherited by the clone's child. .. index:: single: clone; constraint Clone Constraints _________________ In most cases, a clone will have a single instance on each active cluster node. If this is not the case, you can indicate which nodes the cluster should preferentially assign copies to with resource location constraints. These constraints are written no differently from those for primitive resources except that the clone's **id** is used. .. topic:: Some constraints involving clones .. code-block:: xml Ordering constraints behave slightly differently for clones. In the example above, ``apache-stats`` will wait until all copies of ``apache-clone`` that need to be started have done so before being started itself. Only if *no* copies can be started will ``apache-stats`` be prevented from being active. Additionally, the clone will wait for ``apache-stats`` to be stopped before stopping itself. Colocation of a primitive or group resource with a clone means that the resource can run on any node with an active instance of the clone. The cluster will choose an instance based on where the clone is running and the resource's own location preferences. Colocation between clones is also possible. If one clone **A** is colocated with another clone **B**, the set of allowed locations for **A** is limited to nodes on which **B** is (or will be) active. Placement is then performed normally. .. index:: single: promotable clone; constraint .. _promotable-clone-constraints: Promotable Clone Constraints ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For promotable clone resources, the ``first-action`` and/or ``then-action`` fields for ordering constraints may be set to ``promote`` or ``demote`` to constrain the promoted role, and colocation constraints may contain ``rsc-role`` and/or ``with-rsc-role`` fields. .. topic:: Constraints involving promotable clone resources .. code-block:: xml In the example above, **myApp** will wait until one of the database copies has been started and promoted before being started itself on the same node. Only if no copies can be promoted will **myApp** be prevented from being active. Additionally, the cluster will wait for **myApp** to be stopped before demoting the database. Colocation of a primitive or group resource with a promotable clone resource means that it can run on any node with an active instance of the promotable clone resource that has the specified role (``Promoted`` or ``Unpromoted``). In the example above, the cluster will choose a location based on where database is running in the promoted role, and if there are multiple promoted instances it will also factor in **myApp**'s own location preferences when deciding which location to choose. Colocation with regular clones and other promotable clone resources is also possible. In such cases, the set of allowed locations for the **rsc** clone is (after role filtering) limited to nodes on which the ``with-rsc`` promotable clone resource is (or will be) in the specified role. Placement is then performed as normal. Using Promotable Clone Resources in Colocation Sets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a promotable clone is used in a :ref:`resource set ` inside a colocation constraint, the resource set may take a ``role`` attribute. In the following example, an instance of **B** may be promoted only on a node where **A** is in the promoted role. Additionally, resources **C** and **D** must be located on a node where both **A** and **B** are promoted. .. topic:: Colocate C and D with A's and B's promoted instances .. code-block:: xml Using Promotable Clone Resources in Ordered Sets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a promotable clone is used in a :ref:`resource set ` inside an ordering constraint, the resource set may take an ``action`` attribute. .. topic:: Start C and D after first promoting A and B .. code-block:: xml In the above example, **B** cannot be promoted until **A** has been promoted. Additionally, resources **C** and **D** must wait until **A** and **B** have been promoted before they can start. .. index:: pair: resource-stickiness; clone .. _s-clone-stickiness: Clone Stickiness ________________ -To achieve a stable allocation pattern, clones are slightly sticky by -default. If no value for ``resource-stickiness`` is provided, the clone -will use a value of 1. Being a small value, it causes minimal -disturbance to the score calculations of other resources but is enough -to prevent Pacemaker from needlessly moving copies around the cluster. +To achieve stable assignments, clones are slightly sticky by default. If no +value for ``resource-stickiness`` is provided, the clone will use a value of 1. +Being a small value, it causes minimal disturbance to the score calculations of +other resources but is enough to prevent Pacemaker from needlessly moving +instances around the cluster. .. note:: For globally unique clones, this may result in multiple instances of the clone staying on a single node, even after another eligible node becomes active (for example, after being put into standby mode then made active again). If you do not want this behavior, specify a ``resource-stickiness`` of 0 for the clone temporarily and let the cluster adjust, then set it back to 1 if you want the default behavior to apply again. .. important:: If ``resource-stickiness`` is set in the ``rsc_defaults`` section, it will apply to clone instances as well. This means an explicit ``resource-stickiness`` of 0 in ``rsc_defaults`` works differently from the implicit default used when ``resource-stickiness`` is not specified. Clone Resource Agent Requirements _________________________________ Any resource can be used as an anonymous clone, as it requires no additional support from the resource agent. Whether it makes sense to do so depends on your resource and its resource agent. Resource Agent Requirements for Globally Unique Clones ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Globally unique clones require additional support in the resource agent. In particular, it must only respond with ``${OCF_SUCCESS}`` if the node has that exact instance active. All other probes for instances of the clone should result in ``${OCF_NOT_RUNNING}`` (or one of the other OCF error codes if they are failed). Individual instances of a clone are identified by appending a colon and a numerical offset, e.g. **apache:2**. Resource agents can find out how many copies there are by examining the ``OCF_RESKEY_CRM_meta_clone_max`` environment variable and which instance it is by examining ``OCF_RESKEY_CRM_meta_clone``. The resource agent must not make any assumptions (based on ``OCF_RESKEY_CRM_meta_clone``) about which numerical instances are active. In particular, the list of active copies will not always be an unbroken sequence, nor always start at 0. Resource Agent Requirements for Promotable Clones ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Promotable clone resources require two extra actions, ``demote`` and ``promote``, which are responsible for changing the state of the resource. Like **start** and **stop**, they should return ``${OCF_SUCCESS}`` if they completed successfully or a relevant error code if they did not. The states can mean whatever you wish, but when the resource is started, it must come up in the unpromoted role. From there, the cluster will decide which instances to promote. In addition to the clone requirements for monitor actions, agents must also *accurately* report which state they are in. The cluster relies on the agent to report its status (including role) accurately and does not indicate to the agent what role it currently believes it to be in. .. table:: **Role implications of OCF return codes** :widths: 1 3 +----------------------+--------------------------------------------------+ | Monitor Return Code | Description | +======================+==================================================+ | OCF_NOT_RUNNING | .. index:: | | | single: OCF_NOT_RUNNING | | | single: OCF return code; OCF_NOT_RUNNING | | | | | | Stopped | +----------------------+--------------------------------------------------+ | OCF_SUCCESS | .. index:: | | | single: OCF_SUCCESS | | | single: OCF return code; OCF_SUCCESS | | | | | | Running (Unpromoted) | +----------------------+--------------------------------------------------+ | OCF_RUNNING_PROMOTED | .. index:: | | | single: OCF_RUNNING_PROMOTED | | | single: OCF return code; OCF_RUNNING_PROMOTED | | | | | | Running (Promoted) | +----------------------+--------------------------------------------------+ | OCF_FAILED_PROMOTED | .. index:: | | | single: OCF_FAILED_PROMOTED | | | single: OCF return code; OCF_FAILED_PROMOTED | | | | | | Failed (Promoted) | +----------------------+--------------------------------------------------+ | Other | .. index:: | | | single: return code | | | | | | Failed (Unpromoted) | +----------------------+--------------------------------------------------+ Clone Notifications ~~~~~~~~~~~~~~~~~~~ If the clone has the ``notify`` meta-attribute set to **true**, and the resource agent supports the ``notify`` action, Pacemaker will call the action when appropriate, passing a number of extra variables which, when combined with additional context, can be used to calculate the current state of the cluster and what is about to happen to it. .. index:: single: clone; environment variables single: notify; environment variables .. table:: **Environment variables supplied with Clone notify actions** :widths: 1 1 +----------------------------------------------+-------------------------------------------------------------------------------+ | Variable | Description | +==============================================+===============================================================================+ | OCF_RESKEY_CRM_meta_notify_type | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_type | | | single: OCF_RESKEY_CRM_meta_notify_type | | | | | | Allowed values: **pre**, **post** | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_operation | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_operation | | | single: OCF_RESKEY_CRM_meta_notify_operation | | | | | | Allowed values: **start**, **stop** | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_start_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_start_resource | | | single: OCF_RESKEY_CRM_meta_notify_start_resource | | | | | | Resources to be started | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_stop_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_stop_resource | | | single: OCF_RESKEY_CRM_meta_notify_stop_resource | | | | | | Resources to be stopped | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_active_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_active_resource | | | single: OCF_RESKEY_CRM_meta_notify_active_resource | | | | | | Resources that are running | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_inactive_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_inactive_resource | | | single: OCF_RESKEY_CRM_meta_notify_inactive_resource | | | | | | Resources that are not running | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_start_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_start_uname | | | single: OCF_RESKEY_CRM_meta_notify_start_uname | | | | | | Nodes on which resources will be started | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_stop_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_stop_uname | | | single: OCF_RESKEY_CRM_meta_notify_stop_uname | | | | | | Nodes on which resources will be stopped | +----------------------------------------------+-------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_active_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_active_uname | | | single: OCF_RESKEY_CRM_meta_notify_active_uname | | | | | | Nodes on which resources are running | +----------------------------------------------+-------------------------------------------------------------------------------+ The variables come in pairs, such as ``OCF_RESKEY_CRM_meta_notify_start_resource`` and ``OCF_RESKEY_CRM_meta_notify_start_uname``, and should be treated as an array of whitespace-separated elements. ``OCF_RESKEY_CRM_meta_notify_inactive_resource`` is an exception, as the matching **uname** variable does not exist since inactive resources are not running on any node. Thus, in order to indicate that **clone:0** will be started on **sles-1**, **clone:2** will be started on **sles-3**, and **clone:3** will be started on **sles-2**, the cluster would set: .. topic:: Notification variables .. code-block:: none OCF_RESKEY_CRM_meta_notify_start_resource="clone:0 clone:2 clone:3" OCF_RESKEY_CRM_meta_notify_start_uname="sles-1 sles-3 sles-2" .. note:: Pacemaker will log but otherwise ignore failures of notify actions. Interpretation of Notification Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Pre-notification (stop):** * Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` **Post-notification (stop) / Pre-notification (start):** * Active resources * ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Inactive resources * ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` **Post-notification (start):** * Active resources: * ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Inactive resources: * ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` Extra Notifications for Promotable Clones ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. index:: single: clone; environment variables single: promotable; environment variables .. table:: **Extra environment variables supplied for promotable clones** :widths: 1 1 +------------------------------------------------+---------------------------------------------------------------------------------+ | Variable | Description | +================================================+=================================================================================+ | OCF_RESKEY_CRM_meta_notify_promoted_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_promoted_resource | | | single: OCF_RESKEY_CRM_meta_notify_promoted_resource | | | | | | Resources that are running in the promoted role | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_unpromoted_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_unpromoted_resource | | | single: OCF_RESKEY_CRM_meta_notify_unpromoted_resource | | | | | | Resources that are running in the unpromoted role | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_promote_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_promote_resource | | | single: OCF_RESKEY_CRM_meta_notify_promote_resource | | | | | | Resources to be promoted | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_demote_resource | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_demote_resource | | | single: OCF_RESKEY_CRM_meta_notify_demote_resource | | | | | | Resources to be demoted | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_promote_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_promote_uname | | | single: OCF_RESKEY_CRM_meta_notify_promote_uname | | | | | | Nodes on which resources will be promoted | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_demote_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_demote_uname | | | single: OCF_RESKEY_CRM_meta_notify_demote_uname | | | | | | Nodes on which resources will be demoted | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_promoted_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_promoted_uname | | | single: OCF_RESKEY_CRM_meta_notify_promoted_uname | | | | | | Nodes on which resources are running in the promoted role | +------------------------------------------------+---------------------------------------------------------------------------------+ | OCF_RESKEY_CRM_meta_notify_unpromoted_uname | .. index:: | | | single: environment variable; OCF_RESKEY_CRM_meta_notify_unpromoted_uname | | | single: OCF_RESKEY_CRM_meta_notify_unpromoted_uname | | | | | | Nodes on which resources are running in the unpromoted role | +------------------------------------------------+---------------------------------------------------------------------------------+ Interpretation of Promotable Notification Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Pre-notification (demote):** * Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * Promoted resources: ``$OCF_RESKEY_CRM_meta_notify_promoted_resource`` * Unpromoted resources: ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource`` * Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` **Post-notification (demote) / Pre-notification (stop):** * Active resources: ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * Promoted resources: * ``$OCF_RESKEY_CRM_meta_notify_promoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Unpromoted resources: ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource`` * Inactive resources: ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` **Post-notification (stop) / Pre-notification (start)** * Active resources: * ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Promoted resources: * ``$OCF_RESKEY_CRM_meta_notify_promoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Unpromoted resources: * ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Inactive resources: * ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` **Post-notification (start) / Pre-notification (promote)** * Active resources: * ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Promoted resources: * ``$OCF_RESKEY_CRM_meta_notify_promoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Unpromoted resources: * ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Inactive resources: * ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` **Post-notification (promote)** * Active resources: * ``$OCF_RESKEY_CRM_meta_notify_active_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Promoted resources: * ``$OCF_RESKEY_CRM_meta_notify_promoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Unpromoted resources: * ``$OCF_RESKEY_CRM_meta_notify_unpromoted_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Inactive resources: * ``$OCF_RESKEY_CRM_meta_notify_inactive_resource`` * plus ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * minus ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources to be promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources to be demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources to be stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` * Resources that were started: ``$OCF_RESKEY_CRM_meta_notify_start_resource`` * Resources that were promoted: ``$OCF_RESKEY_CRM_meta_notify_promote_resource`` * Resources that were demoted: ``$OCF_RESKEY_CRM_meta_notify_demote_resource`` * Resources that were stopped: ``$OCF_RESKEY_CRM_meta_notify_stop_resource`` Monitoring Promotable Clone Resources _____________________________________ The usual monitor actions are insufficient to monitor a promotable clone resource, because Pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one. Define two monitoring actions: the usual one will cover the unpromoted role, and an additional one with ``role="Promoted"`` will cover the promoted role. .. topic:: Monitoring both states of a promotable clone resource .. code-block:: xml .. important:: It is crucial that *every* monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a promotable clone resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status -- which would cause unexpected return codes, and therefore unnecessary complications. .. _s-promotion-scores: Determining Which Instance is Promoted ______________________________________ Pacemaker can choose a promotable clone instance to be promoted in one of two ways: * Promotion scores: These are node attributes set via the ``crm_attribute`` command using the ``--promotion`` option, which generally would be called by the resource agent's start action if it supports promotable clones. This tool automatically detects both the resource and host, and should be used to set a preference for being promoted. Based on this, ``promoted-max``, and ``promoted-node-max``, the instance(s) with the highest preference will be promoted. * Constraints: Location constraints can indicate which nodes are most preferred to be promoted. .. topic:: Explicitly preferring node1 to be promoted .. code-block:: xml .. index: single: bundle single: resource; bundle pair: container; Docker pair: container; podman pair: container; rkt .. _s-resource-bundle: Bundles - Containerized Resources ################################# Pacemaker supports a special syntax for launching a service inside a `container `_ with any infrastructure it requires: the *bundle*. Pacemaker bundles support `Docker `_, `podman `_ *(since 2.0.1)*, and `rkt `_ container technologies. [#]_ .. topic:: A bundle for a containerized web server .. code-block:: xml Bundle Prerequisites ____________________ Before configuring a bundle in Pacemaker, the user must install the appropriate container launch technology (Docker, podman, or rkt), and supply a fully configured container image, on every node allowed to run the bundle. Pacemaker will create an implicit resource of type **ocf:heartbeat:docker**, **ocf:heartbeat:podman**, or **ocf:heartbeat:rkt** to manage a bundle's container. The user must ensure that the appropriate resource agent is installed on every node allowed to run the bundle. .. index:: pair: XML element; bundle Bundle Properties _________________ .. table:: **XML Attributes of a bundle Element** :widths: 1 4 +-------------+------------------------------------------------------------------+ | Field | Description | +=============+==================================================================+ | id | .. index:: | | | single: bundle; attribute, id | | | single: attribute; id (bundle) | | | single: id; bundle attribute | | | | | | A unique name for the bundle (required) | +-------------+------------------------------------------------------------------+ | description | .. index:: | | | single: bundle; attribute, description | | | single: attribute; description (bundle) | | | single: description; bundle attribute | | | | | | An optional description of the group, for the user's own | | | purposes. | | | E.g. ``manages the container that runs the service`` | +-------------+------------------------------------------------------------------+ A bundle must contain exactly one ``docker``, ``podman``, or ``rkt`` element. .. index:: pair: XML element; docker pair: XML element; podman pair: XML element; rkt Bundle Container Properties ___________________________ .. table:: **XML attributes of a docker, podman, or rkt Element** :class: longtable :widths: 2 3 4 +-------------------+------------------------------------+---------------------------------------------------+ | Attribute | Default | Description | +===================+====================================+===================================================+ | image | | .. index:: | | | | single: docker; attribute, image | | | | single: attribute; image (docker) | | | | single: image; docker attribute | | | | single: podman; attribute, image | | | | single: attribute; image (podman) | | | | single: image; podman attribute | | | | single: rkt; attribute, image | | | | single: attribute; image (rkt) | | | | single: image; rkt attribute | | | | | | | | Container image tag (required) | +-------------------+------------------------------------+---------------------------------------------------+ | replicas | Value of ``promoted-max`` | .. index:: | | | if that is positive, else 1 | single: docker; attribute, replicas | | | | single: attribute; replicas (docker) | | | | single: replicas; docker attribute | | | | single: podman; attribute, replicas | | | | single: attribute; replicas (podman) | | | | single: replicas; podman attribute | | | | single: rkt; attribute, replicas | | | | single: attribute; replicas (rkt) | | | | single: replicas; rkt attribute | | | | | | | | A positive integer specifying the number of | | | | container instances to launch | +-------------------+------------------------------------+---------------------------------------------------+ | replicas-per-host | 1 | .. index:: | | | | single: docker; attribute, replicas-per-host | | | | single: attribute; replicas-per-host (docker) | | | | single: replicas-per-host; docker attribute | | | | single: podman; attribute, replicas-per-host | | | | single: attribute; replicas-per-host (podman) | | | | single: replicas-per-host; podman attribute | | | | single: rkt; attribute, replicas-per-host | | | | single: attribute; replicas-per-host (rkt) | | | | single: replicas-per-host; rkt attribute | | | | | | | | A positive integer specifying the number of | | | | container instances allowed to run on a | | | | single node | +-------------------+------------------------------------+---------------------------------------------------+ | promoted-max | 0 | .. index:: | | | | single: docker; attribute, promoted-max | | | | single: attribute; promoted-max (docker) | | | | single: promoted-max; docker attribute | | | | single: podman; attribute, promoted-max | | | | single: attribute; promoted-max (podman) | | | | single: promoted-max; podman attribute | | | | single: rkt; attribute, promoted-max | | | | single: attribute; promoted-max (rkt) | | | | single: promoted-max; rkt attribute | | | | | | | | A non-negative integer that, if positive, | | | | indicates that the containerized service | | | | should be treated as a promotable service, | | | | with this many replicas allowed to run the | | | | service in the promoted role | +-------------------+------------------------------------+---------------------------------------------------+ | network | | .. index:: | | | | single: docker; attribute, network | | | | single: attribute; network (docker) | | | | single: network; docker attribute | | | | single: podman; attribute, network | | | | single: attribute; network (podman) | | | | single: network; podman attribute | | | | single: rkt; attribute, network | | | | single: attribute; network (rkt) | | | | single: network; rkt attribute | | | | | | | | If specified, this will be passed to the | | | | ``docker run``, ``podman run``, or | | | | ``rkt run`` command as the network setting | | | | for the container. | +-------------------+------------------------------------+---------------------------------------------------+ | run-command | ``/usr/sbin/pacemaker-remoted`` if | .. index:: | | | bundle contains a **primitive**, | single: docker; attribute, run-command | | | otherwise none | single: attribute; run-command (docker) | | | | single: run-command; docker attribute | | | | single: podman; attribute, run-command | | | | single: attribute; run-command (podman) | | | | single: run-command; podman attribute | | | | single: rkt; attribute, run-command | | | | single: attribute; run-command (rkt) | | | | single: run-command; rkt attribute | | | | | | | | This command will be run inside the container | | | | when launching it ("PID 1"). If the bundle | | | | contains a **primitive**, this command *must* | | | | start ``pacemaker-remoted`` (but could, for | | | | example, be a script that does other stuff, too). | +-------------------+------------------------------------+---------------------------------------------------+ | options | | .. index:: | | | | single: docker; attribute, options | | | | single: attribute; options (docker) | | | | single: options; docker attribute | | | | single: podman; attribute, options | | | | single: attribute; options (podman) | | | | single: options; podman attribute | | | | single: rkt; attribute, options | | | | single: attribute; options (rkt) | | | | single: options; rkt attribute | | | | | | | | Extra command-line options to pass to the | | | | ``docker run``, ``podman run``, or ``rkt run`` | | | | command | +-------------------+------------------------------------+---------------------------------------------------+ .. note:: Considerations when using cluster configurations or container images from Pacemaker 1.1: * If the container image has a pre-2.0.0 version of Pacemaker, set ``run-command`` to ``/usr/sbin/pacemaker_remoted`` (note the underbar instead of dash). * ``masters`` is accepted as an alias for ``promoted-max``, but is deprecated since 2.0.0, and support for it will be removed in a future version. Bundle Network Properties _________________________ A bundle may optionally contain one ```` element. .. index:: pair: XML element; network single: bundle; network .. table:: **XML attributes of a network Element** :widths: 2 1 5 +----------------+---------+------------------------------------------------------------+ | Attribute | Default | Description | +================+=========+============================================================+ | add-host | TRUE | .. index:: | | | | single: network; attribute, add-host | | | | single: attribute; add-host (network) | | | | single: add-host; network attribute | | | | | | | | If TRUE, and ``ip-range-start`` is used, Pacemaker will | | | | automatically ensure that ``/etc/hosts`` inside the | | | | containers has entries for each | | | | :ref:`replica name ` | | | | and its assigned IP. | +----------------+---------+------------------------------------------------------------+ | ip-range-start | | .. index:: | | | | single: network; attribute, ip-range-start | | | | single: attribute; ip-range-start (network) | | | | single: ip-range-start; network attribute | | | | | | | | If specified, Pacemaker will create an implicit | | | | ``ocf:heartbeat:IPaddr2`` resource for each container | | | | instance, starting with this IP address, using up to | | | | ``replicas`` sequential addresses. These addresses can be | | | | used from the host's network to reach the service inside | | | | the container, though it is not visible within the | | | | container itself. Only IPv4 addresses are currently | | | | supported. | +----------------+---------+------------------------------------------------------------+ | host-netmask | 32 | .. index:: | | | | single: network; attribute; host-netmask | | | | single: attribute; host-netmask (network) | | | | single: host-netmask; network attribute | | | | | | | | If ``ip-range-start`` is specified, the IP addresses | | | | are created with this CIDR netmask (as a number of bits). | +----------------+---------+------------------------------------------------------------+ | host-interface | | .. index:: | | | | single: network; attribute; host-interface | | | | single: attribute; host-interface (network) | | | | single: host-interface; network attribute | | | | | | | | If ``ip-range-start`` is specified, the IP addresses are | | | | created on this host interface (by default, it will be | | | | determined from the IP address). | +----------------+---------+------------------------------------------------------------+ | control-port | 3121 | .. index:: | | | | single: network; attribute; control-port | | | | single: attribute; control-port (network) | | | | single: control-port; network attribute | | | | | | | | If the bundle contains a ``primitive``, the cluster will | | | | use this integer TCP port for communication with | | | | Pacemaker Remote inside the container. Changing this is | | | | useful when the container is unable to listen on the | | | | default port, for example, when the container uses the | | | | host's network rather than ``ip-range-start`` (in which | | | | case ``replicas-per-host`` must be 1), or when the bundle | | | | may run on a Pacemaker Remote node that is already | | | | listening on the default port. Any ``PCMK_remote_port`` | | | | environment variable set on the host or in the container | | | | is ignored for bundle connections. | +----------------+---------+------------------------------------------------------------+ .. _s-resource-bundle-note-replica-names: .. note:: Replicas are named by the bundle id plus a dash and an integer counter starting with zero. For example, if a bundle named **httpd-bundle** has **replicas=2**, its containers will be named **httpd-bundle-0** and **httpd-bundle-1**. .. index:: pair: XML element; port-mapping Additionally, a ``network`` element may optionally contain one or more ``port-mapping`` elements. .. table:: **Attributes of a port-mapping Element** :widths: 2 1 5 +---------------+-------------------+------------------------------------------------------+ | Attribute | Default | Description | +===============+===================+======================================================+ | id | | .. index:: | | | | single: port-mapping; attribute, id | | | | single: attribute; id (port-mapping) | | | | single: id; port-mapping attribute | | | | | | | | A unique name for the port mapping (required) | +---------------+-------------------+------------------------------------------------------+ | port | | .. index:: | | | | single: port-mapping; attribute, port | | | | single: attribute; port (port-mapping) | | | | single: port; port-mapping attribute | | | | | | | | If this is specified, connections to this TCP port | | | | number on the host network (on the container's | | | | assigned IP address, if ``ip-range-start`` is | | | | specified) will be forwarded to the container | | | | network. Exactly one of ``port`` or ``range`` | | | | must be specified in a ``port-mapping``. | +---------------+-------------------+------------------------------------------------------+ | internal-port | value of ``port`` | .. index:: | | | | single: port-mapping; attribute, internal-port | | | | single: attribute; internal-port (port-mapping) | | | | single: internal-port; port-mapping attribute | | | | | | | | If ``port`` and this are specified, connections | | | | to ``port`` on the host's network will be | | | | forwarded to this port on the container network. | +---------------+-------------------+------------------------------------------------------+ | range | | .. index:: | | | | single: port-mapping; attribute, range | | | | single: attribute; range (port-mapping) | | | | single: range; port-mapping attribute | | | | | | | | If this is specified, connections to these TCP | | | | port numbers (expressed as *first_port*-*last_port*) | | | | on the host network (on the container's assigned IP | | | | address, if ``ip-range-start`` is specified) will | | | | be forwarded to the same ports in the container | | | | network. Exactly one of ``port`` or ``range`` | | | | must be specified in a ``port-mapping``. | +---------------+-------------------+------------------------------------------------------+ .. note:: If the bundle contains a ``primitive``, Pacemaker will automatically map the ``control-port``, so it is not necessary to specify that port in a ``port-mapping``. .. index: pair: XML element; storage pair: XML element; storage-mapping single: bundle; storage .. _s-bundle-storage: Bundle Storage Properties _________________________ A bundle may optionally contain one ``storage`` element. A ``storage`` element has no properties of its own, but may contain one or more ``storage-mapping`` elements. .. table:: **Attributes of a storage-mapping Element** :widths: 2 1 5 +-----------------+---------+-------------------------------------------------------------+ | Attribute | Default | Description | +=================+=========+=============================================================+ | id | | .. index:: | | | | single: storage-mapping; attribute, id | | | | single: attribute; id (storage-mapping) | | | | single: id; storage-mapping attribute | | | | | | | | A unique name for the storage mapping (required) | +-----------------+---------+-------------------------------------------------------------+ | source-dir | | .. index:: | | | | single: storage-mapping; attribute, source-dir | | | | single: attribute; source-dir (storage-mapping) | | | | single: source-dir; storage-mapping attribute | | | | | | | | The absolute path on the host's filesystem that will be | | | | mapped into the container. Exactly one of ``source-dir`` | | | | and ``source-dir-root`` must be specified in a | | | | ``storage-mapping``. | +-----------------+---------+-------------------------------------------------------------+ | source-dir-root | | .. index:: | | | | single: storage-mapping; attribute, source-dir-root | | | | single: attribute; source-dir-root (storage-mapping) | | | | single: source-dir-root; storage-mapping attribute | | | | | | | | The start of a path on the host's filesystem that will | | | | be mapped into the container, using a different | | | | subdirectory on the host for each container instance. | | | | The subdirectory will be named the same as the | | | | :ref:`replica name `. | | | | Exactly one of ``source-dir`` and ``source-dir-root`` | | | | must be specified in a ``storage-mapping``. | +-----------------+---------+-------------------------------------------------------------+ | target-dir | | .. index:: | | | | single: storage-mapping; attribute, target-dir | | | | single: attribute; target-dir (storage-mapping) | | | | single: target-dir; storage-mapping attribute | | | | | | | | The path name within the container where the host | | | | storage will be mapped (required) | +-----------------+---------+-------------------------------------------------------------+ | options | | .. index:: | | | | single: storage-mapping; attribute, options | | | | single: attribute; options (storage-mapping) | | | | single: options; storage-mapping attribute | | | | | | | | A comma-separated list of file system mount | | | | options to use when mapping the storage | +-----------------+---------+-------------------------------------------------------------+ .. note:: Pacemaker does not define the behavior if the source directory does not already exist on the host. However, it is expected that the container technology and/or its resource agent will create the source directory in that case. .. note:: If the bundle contains a ``primitive``, Pacemaker will automatically map the equivalent of ``source-dir=/etc/pacemaker/authkey target-dir=/etc/pacemaker/authkey`` and ``source-dir-root=/var/log/pacemaker/bundles target-dir=/var/log`` into the container, so it is not necessary to specify those paths in a ``storage-mapping``. .. important:: The ``PCMK_authkey_location`` environment variable must not be set to anything other than the default of ``/etc/pacemaker/authkey`` on any node in the cluster. .. important:: If SELinux is used in enforcing mode on the host, you must ensure the container is allowed to use any storage you mount into it. For Docker and podman bundles, adding "Z" to the mount options will create a container-specific label for the mount that allows the container access. .. index:: single: bundle; primitive Bundle Primitive ________________ A bundle may optionally contain one :ref:`primitive ` resource. The primitive may have operations, instance attributes, and meta-attributes defined, as usual. If a bundle contains a primitive resource, the container image must include the Pacemaker Remote daemon, and at least one of ``ip-range-start`` or ``control-port`` must be configured in the bundle. Pacemaker will create an implicit **ocf:pacemaker:remote** resource for the connection, launch Pacemaker Remote within the container, and monitor and manage the primitive resource via Pacemaker Remote. If the bundle has more than one container instance (replica), the primitive resource will function as an implicit :ref:`clone ` -- a :ref:`promotable clone ` if the bundle has ``promoted-max`` greater than zero. .. note:: If you want to pass environment variables to a bundle's Pacemaker Remote connection or primitive, you have two options: * Environment variables whose value is the same regardless of the underlying host may be set using the container element's ``options`` attribute. * If you want variables to have host-specific values, you can use the :ref:`storage-mapping ` element to map a file on the host as ``/etc/pacemaker/pcmk-init.env`` in the container *(since 2.0.3)*. Pacemaker Remote will parse this file as a shell-like format, with variables set as NAME=VALUE, ignoring blank lines and comments starting with "#". .. important:: When a bundle has a ``primitive``, Pacemaker on all cluster nodes must be able to contact Pacemaker Remote inside the bundle's containers. * The containers must have an accessible network (for example, ``network`` should not be set to "none" with a ``primitive``). * The default, using a distinct network space inside the container, works in combination with ``ip-range-start``. Any firewall must allow access from all cluster nodes to the ``control-port`` on the container IPs. * If the container shares the host's network space (for example, by setting ``network`` to "host"), a unique ``control-port`` should be specified for each bundle. Any firewall must allow access from all cluster nodes to the ``control-port`` on all cluster and remote node IPs. .. index:: single: bundle; node attributes .. _s-bundle-attributes: Bundle Node Attributes ______________________ If the bundle has a ``primitive``, the primitive's resource agent may want to set node attributes such as :ref:`promotion scores `. However, with containers, it is not apparent which node should get the attribute. If the container uses shared storage that is the same no matter which node the container is hosted on, then it is appropriate to use the promotion score on the bundle node itself. On the other hand, if the container uses storage exported from the underlying host, then it may be more appropriate to use the promotion score on the underlying host. Since this depends on the particular situation, the ``container-attribute-target`` resource meta-attribute allows the user to specify which approach to use. If it is set to ``host``, then user-defined node attributes will be checked on the underlying host. If it is anything else, the local node (in this case the bundle node) is used as usual. This only applies to user-defined attributes; the cluster will always check the local node for cluster-defined attributes such as ``#uname``. If ``container-attribute-target`` is ``host``, the cluster will pass additional environment variables to the primitive's resource agent that allow it to set node attributes appropriately: ``CRM_meta_container_attribute_target`` (identical to the meta-attribute value) and ``CRM_meta_physical_host`` (the name of the underlying host). .. note:: When called by a resource agent, the ``attrd_updater`` and ``crm_attribute`` commands will automatically check those environment variables and set attributes appropriately. .. index:: single: bundle; meta-attributes Bundle Meta-Attributes ______________________ Any meta-attribute set on a bundle will be inherited by the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. This includes options such as ``priority``, ``target-role``, and ``is-managed``. See :ref:`resource_options` for more information. Bundles support clone meta-attributes including ``notify``, ``ordered``, and ``interleave``. Limitations of Bundles ______________________ Restarting pacemaker while a bundle is unmanaged or the cluster is in maintenance mode may cause the bundle to fail. Bundles may not be explicitly cloned or included in groups. This includes the bundle's primitive and any resources implicitly created by Pacemaker for the bundle. (If ``replicas`` is greater than 1, the bundle will behave like a clone implicitly.) Bundles do not have instance attributes, utilization attributes, or operations, though a bundle's primitive may have them. A bundle with a primitive can run on a Pacemaker Remote node only if the bundle uses a distinct ``control-port``. .. [#] Of course, the service must support running multiple instances. .. [#] Docker is a trademark of Docker, Inc. No endorsement by or association with Docker, Inc. is implied. diff --git a/doc/sphinx/Pacemaker_Explained/options.rst b/doc/sphinx/Pacemaker_Explained/options.rst index 5d95e4c867..ca7ea2a8a3 100644 --- a/doc/sphinx/Pacemaker_Explained/options.rst +++ b/doc/sphinx/Pacemaker_Explained/options.rst @@ -1,631 +1,631 @@ Cluster-Wide Configuration -------------------------- .. index:: pair: XML element; cib pair: XML element; configuration Configuration Layout #################### The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this: .. topic:: An empty configuration .. code-block:: xml The empty configuration above contains the major sections that make up a CIB: * ``cib``: The entire CIB is enclosed with a ``cib`` element. Certain fundamental settings are defined as attributes of this element. * ``configuration``: This section -- the primary focus of this document -- contains traditional configuration information such as what resources the cluster serves and the relationships among them. * ``crm_config``: cluster-wide configuration options * ``nodes``: the machines that host the cluster * ``resources``: the services run by the cluster * ``constraints``: indications of how resources should be placed * ``status``: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way. In this document, configuration settings will be described as properties or options based on how they are defined in the CIB: * Properties are XML attributes of an XML element. * Options are name-value pairs expressed as ``nvpair`` child elements of an XML element. Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak. CIB Properties ############## Certain settings are defined by CIB properties (that is, attributes of the ``cib`` tag) rather than with the rest of the cluster configuration in the ``configuration`` section. The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location. .. table:: **CIB Properties** :class: longtable :widths: 1 3 +------------------+-----------------------------------------------------------+ | Attribute | Description | +==================+===========================================================+ | admin_epoch | .. index:: | | | pair: admin_epoch; cib | | | | | | When a node joins the cluster, the cluster performs a | | | check to see which node has the best configuration. It | | | asks the node with the highest (``admin_epoch``, | | | ``epoch``, ``num_updates``) tuple to replace the | | | configuration on all the nodes -- which makes setting | | | them, and setting them correctly, very important. | | | ``admin_epoch`` is never modified by the cluster; you can | | | use this to make the configurations on any inactive nodes | | | obsolete. | | | | | | **Warning:** Never set this value to zero. In such cases, | | | the cluster cannot tell the difference between your | | | configuration and the "empty" one used when nothing is | | | found on disk. | +------------------+-----------------------------------------------------------+ | epoch | .. index:: | | | pair: epoch; cib | | | | | | The cluster increments this every time the configuration | | | is updated (usually by the administrator). | +------------------+-----------------------------------------------------------+ | num_updates | .. index:: | | | pair: num_updates; cib | | | | | | The cluster increments this every time the configuration | | | or status is updated (usually by the cluster) and resets | | | it to 0 when epoch changes. | +------------------+-----------------------------------------------------------+ | validate-with | .. index:: | | | pair: validate-with; cib | | | | | | Determines the type of XML validation that will be done | | | on the configuration. If set to ``none``, the cluster | | | will not verify that updates conform to the DTD (nor | | | reject ones that don't). | +------------------+-----------------------------------------------------------+ | cib-last-written | .. index:: | | | pair: cib-last-written; cib | | | | | | Indicates when the configuration was last written to | | | disk. Maintained by the cluster; for informational | | | purposes only. | +------------------+-----------------------------------------------------------+ | have-quorum | .. index:: | | | pair: have-quorum; cib | | | | | | Indicates if the cluster has quorum. If false, this may | | | mean that the cluster cannot start resources or fence | | | other nodes (see ``no-quorum-policy`` below). Maintained | | | by the cluster. | +------------------+-----------------------------------------------------------+ | dc-uuid | .. index:: | | | pair: dc-uuid; cib | | | | | | Indicates which cluster node is the current leader. Used | | | by the cluster when placing resources and determining the | | | order of some events. Maintained by the cluster. | +------------------+-----------------------------------------------------------+ .. _cluster_options: Cluster Options ############### Cluster options, as you might expect, control how the cluster behaves when confronted with various situations. They are grouped into sets within the ``crm_config`` section. In advanced configurations, there may be more than one set. (This will be described later in the chapter on :ref:`rules` where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once. You can obtain an up-to-date list of cluster options, including their default values, by running the ``man pacemaker-schedulerd`` and ``man pacemaker-controld`` commands. .. table:: **Cluster Options** :class: longtable :widths: 2 1 4 +---------------------------+---------+----------------------------------------------------+ | Option | Default | Description | +===========================+=========+====================================================+ | cluster-name | | .. index:: | | | | pair: cluster option; cluster-name | | | | | | | | An (optional) name for the cluster as a whole. | | | | This is mostly for users' convenience for use | | | | as desired in administration, but this can be | | | | used in the Pacemaker configuration in | | | | :ref:`rules` (as the ``#cluster-name`` | | | | :ref:`node attribute | | | | `. It may | | | | also be used by higher-level tools when | | | | displaying cluster information, and by | | | | certain resource agents (for example, the | | | | ``ocf:heartbeat:GFS2`` agent stores the | | | | cluster name in filesystem meta-data). | +---------------------------+---------+----------------------------------------------------+ | dc-version | | .. index:: | | | | pair: cluster option; dc-version | | | | | | | | Version of Pacemaker on the cluster's DC. | | | | Determined automatically by the cluster. Often | | | | includes the hash which identifies the exact | | | | Git changeset it was built from. Used for | | | | diagnostic purposes. | +---------------------------+---------+----------------------------------------------------+ | cluster-infrastructure | | .. index:: | | | | pair: cluster option; cluster-infrastructure | | | | | | | | The messaging stack on which Pacemaker is | | | | currently running. Determined automatically by | | | | the cluster. Used for informational and | | | | diagnostic purposes. | +---------------------------+---------+----------------------------------------------------+ | no-quorum-policy | stop | .. index:: | | | | pair: cluster option; no-quorum-policy | | | | | | | | What to do when the cluster does not have | | | | quorum. Allowed values: | | | | | | | | * ``ignore:`` continue all resource management | | | | * ``freeze:`` continue resource management, but | | | | don't recover resources from nodes not in the | | | | affected partition | | | | * ``stop:`` stop all resources in the affected | | | | cluster partition | | | | * ``demote:`` demote promotable resources and | | | | stop all other resources in the affected | | | | cluster partition *(since 2.0.5)* | | | | * ``suicide:`` fence all nodes in the affected | | | | cluster partition | +---------------------------+---------+----------------------------------------------------+ | batch-limit | 0 | .. index:: | | | | pair: cluster option; batch-limit | | | | | | | | The maximum number of actions that the cluster | | | | may execute in parallel across all nodes. The | | | | "correct" value will depend on the speed and | | | | load of your network and cluster nodes. If zero, | | | | the cluster will impose a dynamically calculated | | | | limit only when any node has high load. If -1, the | | | | cluster will not impose any limit. | +---------------------------+---------+----------------------------------------------------+ | migration-limit | -1 | .. index:: | | | | pair: cluster option; migration-limit | | | | | | | | The number of | | | | :ref:`live migration ` actions | | | | that the cluster is allowed to execute in | | | | parallel on a node. A value of -1 means | | | | unlimited. | +---------------------------+---------+----------------------------------------------------+ | symmetric-cluster | true | .. index:: | | | | pair: cluster option; symmetric-cluster | | | | | | | | Whether resources can run on any node by default | | | | (if false, a resource is allowed to run on a | | | | node only if a | | | | :ref:`location constraint ` | | | | enables it) | +---------------------------+---------+----------------------------------------------------+ | stop-all-resources | false | .. index:: | | | | pair: cluster option; stop-all-resources | | | | | | | | Whether all resources should be disallowed from | | | | running (can be useful during maintenance) | +---------------------------+---------+----------------------------------------------------+ | stop-orphan-resources | true | .. index:: | | | | pair: cluster option; stop-orphan-resources | | | | | | | | Whether resources that have been deleted from | | | | the configuration should be stopped. This value | | | | takes precedence over ``is-managed`` (that is, | | | | even unmanaged resources will be stopped when | | | | orphaned if this value is ``true`` | +---------------------------+---------+----------------------------------------------------+ | stop-orphan-actions | true | .. index:: | | | | pair: cluster option; stop-orphan-actions | | | | | | | | Whether recurring :ref:`operations ` | | | | that have been deleted from the configuration | | | | should be cancelled | +---------------------------+---------+----------------------------------------------------+ | start-failure-is-fatal | true | .. index:: | | | | pair: cluster option; start-failure-is-fatal | | | | | | | | Whether a failure to start a resource on a | | | | particular node prevents further start attempts | | | | on that node? If ``false``, the cluster will | | | | decide whether the node is still eligible based | | | | on the resource's current failure count and | | | | :ref:`migration-threshold `. | +---------------------------+---------+----------------------------------------------------+ | enable-startup-probes | true | .. index:: | | | | pair: cluster option; enable-startup-probes | | | | | | | | Whether the cluster should check the | | | | pre-existing state of resources when the cluster | | | | starts | +---------------------------+---------+----------------------------------------------------+ | maintenance-mode | false | .. index:: | | | | pair: cluster option; maintenance-mode | | | | | | | | Whether the cluster should refrain from | | | | monitoring, starting and stopping resources | +---------------------------+---------+----------------------------------------------------+ | stonith-enabled | true | .. index:: | | | | pair: cluster option; stonith-enabled | | | | | | | | Whether the cluster is allowed to fence nodes | | | | (for example, failed nodes and nodes with | | | | resources that can't be stopped. | | | | | | | | If true, at least one fence device must be | | | | configured before resources are allowed to run. | | | | | | | | If false, unresponsive nodes are immediately | | | | assumed to be running no resources, and resource | | | | recovery on online nodes starts without any | | | | further protection (which can mean *data loss* | | | | if the unresponsive node still accesses shared | | | | storage, for example). See also the | | | | :ref:`requires ` resource | | | | meta-attribute. | +---------------------------+---------+----------------------------------------------------+ | stonith-action | reboot | .. index:: | | | | pair: cluster option; stonith-action | | | | | | | | Action the cluster should send to the fence agent | | | | when a node must be fenced. Allowed values are | | | | ``reboot``, ``off``, and (for legacy agents only) | | | | ``poweroff``. | +---------------------------+---------+----------------------------------------------------+ | stonith-timeout | 60s | .. index:: | | | | pair: cluster option; stonith-timeout | | | | | | | | How long to wait for ``on``, ``off``, and | | | | ``reboot`` fence actions to complete by default. | +---------------------------+---------+----------------------------------------------------+ | stonith-max-attempts | 10 | .. index:: | | | | pair: cluster option; stonith-max-attempts | | | | | | | | How many times fencing can fail for a target | | | | before the cluster will no longer immediately | | | | re-attempt it. | +---------------------------+---------+----------------------------------------------------+ | stonith-watchdog-timeout | 0 | .. index:: | | | | pair: cluster option; stonith-watchdog-timeout | | | | | | | | If nonzero, and the cluster detects | | | | ``have-watchdog`` as ``true``, then watchdog-based | | | | self-fencing will be performed via SBD when | | | | fencing is required, without requiring a fencing | | | | resource explicitly configured. | | | | | | | | If this is set to a positive value, unseen nodes | | | | are assumed to self-fence within this much time. | | | | | | | | **Warning:** It must be ensured that this value is | | | | larger than the ``SBD_WATCHDOG_TIMEOUT`` | | | | environment variable on all nodes. Pacemaker | | | | verifies the settings individually on all nodes | | | | and prevents startup or shuts down if configured | | | | wrongly on the fly. It is strongly recommended | | | | that ``SBD_WATCHDOG_TIMEOUT`` be set to the same | | | | value on all nodes. | | | | | | | | If this is set to a negative value, and | | | | ``SBD_WATCHDOG_TIMEOUT`` is set, twice that value | | | | will be used. | | | | | | | | **Warning:** In this case, it is essential (and | | | | currently not verified by pacemaker) that | | | | ``SBD_WATCHDOG_TIMEOUT`` is set to the same | | | | value on all nodes. | +---------------------------+---------+----------------------------------------------------+ | concurrent-fencing | false | .. index:: | | | | pair: cluster option; concurrent-fencing | | | | | | | | Whether the cluster is allowed to initiate | | | | multiple fence actions concurrently. Fence actions | | | | initiated externally, such as via the | | | | ``stonith_admin`` tool or an application such as | | | | DLM, or by the fencer itself such as recurring | | | | device monitors and ``status`` and ``list`` | | | | commands, are not limited by this option. | +---------------------------+---------+----------------------------------------------------+ | fence-reaction | stop | .. index:: | | | | pair: cluster option; fence-reaction | | | | | | | | How should a cluster node react if notified of its | | | | own fencing? A cluster node may receive | | | | notification of its own fencing if fencing is | | | | misconfigured, or if fabric fencing is in use that | | | | doesn't cut cluster communication. Allowed values | | | | are ``stop`` to attempt to immediately stop | | | | pacemaker and stay stopped, or ``panic`` to | | | | attempt to immediately reboot the local node, | | | | falling back to stop on failure. The default is | | | | likely to be changed to ``panic`` in a future | | | | release. *(since 2.0.3)* | +---------------------------+---------+----------------------------------------------------+ | priority-fencing-delay | 0 | .. index:: | | | | pair: cluster option; priority-fencing-delay | | | | | | | | Apply this delay to any fencing targeting the lost | | | | nodes with the highest total resource priority in | | | | case we don't have the majority of the nodes in | | | | our cluster partition, so that the more | | | | significant nodes potentially win any fencing | | | | match (especially meaningful in a split-brain of a | | | | 2-node cluster). A promoted resource instance | | | | takes the resource's priority plus 1 if the | | | | resource's priority is not 0. Any static or random | | | | delays introduced by ``pcmk_delay_base`` and | | | | ``pcmk_delay_max`` configured for the | | | | corresponding fencing resources will be added to | | | | this delay. This delay should be significantly | | | | greater than (safely twice) the maximum delay from | | | | those parameters. *(since 2.0.4)* | +---------------------------+---------+----------------------------------------------------+ | node-pending-timeout | 10min | .. index:: | | | | pair: cluster option; node-pending-timeout | | | | | | | | A node that has joined the cluster can be pending | | | | on joining the process group. We wait up to this | | | | much time for it. If it times out, fencing | | | | targeting the node will be issued if enabled. | | | | *(since 2.1.7)* | +---------------------------+---------+----------------------------------------------------+ | cluster-delay | 60s | .. index:: | | | | pair: cluster option; cluster-delay | | | | | | | | Estimated maximum round-trip delay over the | | | | network (excluding action execution). If the DC | | | | requires an action to be executed on another node, | | | | it will consider the action failed if it does not | | | | get a response from the other node in this time | | | | (after considering the action's own timeout). The | | | | "correct" value will depend on the speed and load | | | | of your network and cluster nodes. | +---------------------------+---------+----------------------------------------------------+ | dc-deadtime | 20s | .. index:: | | | | pair: cluster option; dc-deadtime | | | | | | | | How long to wait for a response from other nodes | | | | during startup. The "correct" value will depend on | | | | the speed/load of your network and the type of | | | | switches used. | +---------------------------+---------+----------------------------------------------------+ | cluster-ipc-limit | 500 | .. index:: | | | | pair: cluster option; cluster-ipc-limit | | | | | | | | The maximum IPC message backlog before one cluster | | | | daemon will disconnect another. This is of use in | | | | large clusters, for which a good value is the | | | | number of resources in the cluster multiplied by | | | | the number of nodes. The default of 500 is also | | | | the minimum. Raise this if you see | | | | "Evicting client" messages for cluster daemon PIDs | | | | in the logs. | +---------------------------+---------+----------------------------------------------------+ | pe-error-series-max | -1 | .. index:: | | | | pair: cluster option; pe-error-series-max | | | | | | | | The number of scheduler inputs resulting in errors | | | | to save. Used when reporting problems. A value of | | | | -1 means unlimited (report all), and 0 means none. | +---------------------------+---------+----------------------------------------------------+ | pe-warn-series-max | 5000 | .. index:: | | | | pair: cluster option; pe-warn-series-max | | | | | | | | The number of scheduler inputs resulting in | | | | warnings to save. Used when reporting problems. A | | | | value of -1 means unlimited (report all), and 0 | | | | means none. | +---------------------------+---------+----------------------------------------------------+ | pe-input-series-max | 4000 | .. index:: | | | | pair: cluster option; pe-input-series-max | | | | | | | | The number of "normal" scheduler inputs to save. | | | | Used when reporting problems. A value of -1 means | | | | unlimited (report all), and 0 means none. | +---------------------------+---------+----------------------------------------------------+ | enable-acl | false | .. index:: | | | | pair: cluster option; enable-acl | | | | | | | | Whether :ref:`acl` should be used to authorize | | | | modifications to the CIB | +---------------------------+---------+----------------------------------------------------+ | placement-strategy | default | .. index:: | | | | pair: cluster option; placement-strategy | | | | | - | | | How the cluster should allocate resources to nodes | + | | | How the cluster should assign resources to nodes | | | | (see :ref:`utilization`). Allowed values are | | | | ``default``, ``utilization``, ``balanced``, and | | | | ``minimal``. | +---------------------------+---------+----------------------------------------------------+ | node-health-strategy | none | .. index:: | | | | pair: cluster option; node-health-strategy | | | | | | | | How the cluster should react to node health | | | | attributes (see :ref:`node-health`). Allowed values| | | | are ``none``, ``migrate-on-red``, ``only-green``, | | | | ``progressive``, and ``custom``. | +---------------------------+---------+----------------------------------------------------+ | node-health-base | 0 | .. index:: | | | | pair: cluster option; node-health-base | | | | | | | | The base health score assigned to a node. Only | | | | used when ``node-health-strategy`` is | | | | ``progressive``. | +---------------------------+---------+----------------------------------------------------+ | node-health-green | 0 | .. index:: | | | | pair: cluster option; node-health-green | | | | | | | | The score to use for a node health attribute whose | | | | value is ``green``. Only used when | | | | ``node-health-strategy`` is ``progressive`` or | | | | ``custom``. | +---------------------------+---------+----------------------------------------------------+ | node-health-yellow | 0 | .. index:: | | | | pair: cluster option; node-health-yellow | | | | | | | | The score to use for a node health attribute whose | | | | value is ``yellow``. Only used when | | | | ``node-health-strategy`` is ``progressive`` or | | | | ``custom``. | +---------------------------+---------+----------------------------------------------------+ | node-health-red | 0 | .. index:: | | | | pair: cluster option; node-health-red | | | | | | | | The score to use for a node health attribute whose | | | | value is ``red``. Only used when | | | | ``node-health-strategy`` is ``progressive`` or | | | | ``custom``. | +---------------------------+---------+----------------------------------------------------+ | cluster-recheck-interval | 15min | .. index:: | | | | pair: cluster option; cluster-recheck-interval | | | | | | | | Pacemaker is primarily event-driven, and looks | | | | ahead to know when to recheck the cluster for | | | | failure timeouts and most time-based rules | | | | *(since 2.0.3)*. However, it will also recheck the | | | | cluster after this amount of inactivity. This has | | | | two goals: rules with ``date_spec`` are only | | | | guaranteed to be checked this often, and it also | | | | serves as a fail-safe for some kinds of scheduler | | | | bugs. A value of 0 disables this polling; positive | | | | values are a time interval. | +---------------------------+---------+----------------------------------------------------+ | shutdown-lock | false | .. index:: | | | | pair: cluster option; shutdown-lock | | | | | | | | The default of false allows active resources to be | | | | recovered elsewhere when their node is cleanly | | | | shut down, which is what the vast majority of | | | | users will want. However, some users prefer to | | | | make resources highly available only for failures, | | | | with no recovery for clean shutdowns. If this | | | | option is true, resources active on a node when it | | | | is cleanly shut down are kept "locked" to that | | | | node (not allowed to run elsewhere) until they | | | | start again on that node after it rejoins (or for | | | | at most ``shutdown-lock-limit``, if set). Stonith | | | | resources and Pacemaker Remote connections are | | | | never locked. Clone and bundle instances and the | | | | promoted role of promotable clones are currently | | | | never locked, though support could be added in a | | | | future release. Locks may be manually cleared | | | | using the ``--refresh`` option of ``crm_resource`` | | | | (both the resource and node must be specified; | | | | this works with remote nodes if their connection | | | | resource's ``target-role`` is set to ``Stopped``, | | | | but not if Pacemaker Remote is stopped on the | | | | remote node without disabling the connection | | | | resource). *(since 2.0.4)* | +---------------------------+---------+----------------------------------------------------+ | shutdown-lock-limit | 0 | .. index:: | | | | pair: cluster option; shutdown-lock-limit | | | | | | | | If ``shutdown-lock`` is true, and this is set to a | | | | nonzero time duration, locked resources will be | | | | allowed to start after this much time has passed | | | | since the node shutdown was initiated, even if the | | | | node has not rejoined. (This works with remote | | | | nodes only if their connection resource's | | | | ``target-role`` is set to ``Stopped``.) | | | | *(since 2.0.4)* | +---------------------------+---------+----------------------------------------------------+ | remove-after-stop | false | .. index:: | | | | pair: cluster option; remove-after-stop | | | | | | | | *Deprecated* Should the cluster remove | | | | resources from Pacemaker's executor after they are | | | | stopped? Values other than the default are, at | | | | best, poorly tested and potentially dangerous. | | | | This option is deprecated and will be removed in a | | | | future release. | +---------------------------+---------+----------------------------------------------------+ | startup-fencing | true | .. index:: | | | | pair: cluster option; startup-fencing | | | | | | | | *Advanced Use Only:* Should the cluster fence | | | | unseen nodes at start-up? Setting this to false is | | | | unsafe, because the unseen nodes could be active | | | | and running resources but unreachable. | +---------------------------+---------+----------------------------------------------------+ | election-timeout | 2min | .. index:: | | | | pair: cluster option; election-timeout | | | | | | | | *Advanced Use Only:* If you need to adjust this | | | | value, it probably indicates the presence of a bug.| +---------------------------+---------+----------------------------------------------------+ | shutdown-escalation | 20min | .. index:: | | | | pair: cluster option; shutdown-escalation | | | | | | | | *Advanced Use Only:* If you need to adjust this | | | | value, it probably indicates the presence of a bug.| +---------------------------+---------+----------------------------------------------------+ | join-integration-timeout | 3min | .. index:: | | | | pair: cluster option; join-integration-timeout | | | | | | | | *Advanced Use Only:* If you need to adjust this | | | | value, it probably indicates the presence of a bug.| +---------------------------+---------+----------------------------------------------------+ | join-finalization-timeout | 30min | .. index:: | | | | pair: cluster option; join-finalization-timeout | | | | | | | | *Advanced Use Only:* If you need to adjust this | | | | value, it probably indicates the presence of a bug.| +---------------------------+---------+----------------------------------------------------+ | transition-delay | 0s | .. index:: | | | | pair: cluster option; transition-delay | | | | | | | | *Advanced Use Only:* Delay cluster recovery for | | | | the configured interval to allow for additional or | | | | related events to occur. This can be useful if | | | | your configuration is sensitive to the order in | | | | which ping updates arrive. Enabling this option | | | | will slow down cluster recovery under all | | | | conditions. | +---------------------------+---------+----------------------------------------------------+ diff --git a/doc/sphinx/Pacemaker_Explained/utilization.rst b/doc/sphinx/Pacemaker_Explained/utilization.rst index 93c67cdf31..87eef6021e 100644 --- a/doc/sphinx/Pacemaker_Explained/utilization.rst +++ b/doc/sphinx/Pacemaker_Explained/utilization.rst @@ -1,264 +1,264 @@ .. _utilization: Utilization and Placement Strategy ---------------------------------- Pacemaker decides where to place a resource according to the resource -allocation scores on every node. The resource will be allocated to the +assignment scores on every node. The resource will be assigned to the node where the resource has the highest score. -If the resource allocation scores on all the nodes are equal, by the default +If the resource assignment scores on all the nodes are equal, by the default placement strategy, Pacemaker will choose a node with the least number of -allocated resources for balancing the load. If the number of resources on each +assigned resources for balancing the load. If the number of resources on each node is equal, the first eligible node listed in the CIB will be chosen to run the resource. Often, in real-world situations, different resources use significantly different proportions of a node's capacities (memory, I/O, etc.). We cannot balance the load ideally just according to the number of resources -allocated to a node. Besides, if resources are placed such that their combined +assigned to a node. Besides, if resources are placed such that their combined requirements exceed the provided capacity, they may fail to start completely or run with degraded performance. To take these factors into account, Pacemaker allows you to configure: #. The capacity a certain node provides. #. The capacity a certain resource requires. #. An overall strategy for placement of resources. Utilization attributes ###################### To configure the capacity that a node provides or a resource requires, you can use *utilization attributes* in ``node`` and ``resource`` objects. You can name utilization attributes according to your preferences and define as many name/value pairs as your configuration needs. However, the attributes' values must be integers. .. topic:: Specifying CPU and RAM capacities of two nodes .. code-block:: xml .. topic:: Specifying CPU and RAM consumed by several resources .. code-block:: xml A node is considered eligible for a resource if it has sufficient free capacity to satisfy the resource's requirements. The nature of the required or provided capacities is completely irrelevant to Pacemaker -- it just makes sure that all capacity requirements of a resource are satisfied before placing a resource to a node. Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*. These attributes are added to a ``transient_attributes`` section for the node and are forgotten by the cluster when the node goes offline. The ``attrd_updater`` tool can be used to set these attributes. .. topic:: Transient utilization attribute for node cluster-1 .. code-block:: xml .. note:: Utilization is supported for bundles *(since 2.1.3)*, but only for bundles with an inner primitive. Any resource utilization values should be specified for the inner primitive, but any priority meta-attribute should be specified for the outer bundle. Placement Strategy ################## After you have configured the capacities your nodes provide and the capacities your resources require, you need to set the ``placement-strategy`` in the global cluster options, otherwise the capacity configurations have *no effect*. Four values are available for the ``placement-strategy``: * **default** Utilization values are not taken into account at all. - Resources are allocated according to allocation scores. If scores are equal, + Resources are assigned according to assignment scores. If scores are equal, resources are evenly distributed across nodes. * **utilization** Utilization values are taken into account *only* when deciding whether a node is considered eligible (i.e. whether it has sufficient free capacity to satisfy the resource's requirements). Load-balancing is still done based on the - number of resources allocated to a node. + number of resources assigned to a node. * **balanced** Utilization values are taken into account when deciding whether a node is eligible to serve a resource *and* when load-balancing, so an attempt is made to spread the resources in a way that optimizes resource performance. * **minimal** Utilization values are taken into account *only* when deciding whether a node is eligible to serve a resource. For load-balancing, an attempt is made to concentrate the resources on as few nodes as possible, thereby enabling possible power savings on the remaining nodes. Set ``placement-strategy`` with ``crm_attribute``: .. code-block:: none # crm_attribute --name placement-strategy --update balanced Now Pacemaker will ensure the load from your resources will be distributed evenly throughout the cluster, without the need for convoluted sets of colocation constraints. -Allocation Details +Assignment Details ################## -Which node is preferred to get consumed first when allocating resources? -________________________________________________________________________ +Which node is preferred to get consumed first when assigning resources? +_______________________________________________________________________ * The node with the highest node weight gets consumed first. Node weight is a score maintained by the cluster to represent node health. * If multiple nodes have the same node weight: * If ``placement-strategy`` is ``default`` or ``utilization``, - the node that has the least number of allocated resources gets consumed first. + the node that has the least number of assigned resources gets consumed first. - * If their numbers of allocated resources are equal, + * If their numbers of assigned resources are equal, the first eligible node listed in the CIB gets consumed first. * If ``placement-strategy`` is ``balanced``, the node that has the most free capacity gets consumed first. * If the free capacities of the nodes are equal, - the node that has the least number of allocated resources gets consumed first. + the node that has the least number of assigned resources gets consumed first. - * If their numbers of allocated resources are equal, + * If their numbers of assigned resources are equal, the first eligible node listed in the CIB gets consumed first. * If ``placement-strategy`` is ``minimal``, the first eligible node listed in the CIB gets consumed first. Which node has more free capacity? __________________________________ If only one type of utilization attribute has been defined, free capacity is a simple numeric comparison. If multiple types of utilization attributes have been defined, then the node that is numerically highest in the the most attribute types has the most free capacity. For example: * If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``, then their free capacities are equal. * If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory`` and ``storage``, then ``nodeB`` has more free capacity. Which resource is preferred to be assigned first? _________________________________________________ * The resource that has the highest ``priority`` (see :ref:`resource_options`) gets - allocated first. + assigned first. * If their priorities are equal, check whether they are already running. The - resource that has the highest score on the node where it's running gets allocated + resource that has the highest score on the node where it's running gets assigned first, to prevent resource shuffling. * If the scores above are equal or the resources are not running, the resource has - the highest score on the preferred node gets allocated first. + the highest score on the preferred node gets assigned first. * If the scores above are equal, the first runnable resource listed in the CIB - gets allocated first. + gets assigned first. Limitations and Workarounds ########################### The type of problem Pacemaker is dealing with here is known as the `knapsack problem `_ and falls into the `NP-complete `_ category of computer science problems -- a fancy way of saying "it takes a really long time to solve". Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours or days, finding an optimal solution while services remain unavailable. So instead of trying to solve the problem completely, Pacemaker uses a *best effort* algorithm for determining which node should host a particular service. This means it arrives at a solution much faster than traditional linear programming algorithms, but by doing so at the price of leaving some services stopped. In the contrived example at the start of this chapter: -* ``rsc-small`` would be allocated to ``node1`` +* ``rsc-small`` would be assigned to ``node1`` -* ``rsc-medium`` would be allocated to ``node2`` +* ``rsc-medium`` would be assigned to ``node2`` * ``rsc-large`` would remain inactive Which is not ideal. There are various approaches to dealing with the limitations of pacemaker's placement strategy: * **Ensure you have sufficient physical capacity.** It might sound obvious, but if the physical capacity of your nodes is (close to) maxed out by the cluster under normal conditions, then failover isn't going to go well. Even without the utilization feature, you'll start hitting timeouts and getting secondary failures. * **Build some buffer into the capabilities advertised by the nodes.** Advertise slightly more resources than we physically have, on the (usually valid) assumption that a resource will not use 100% of the configured amount of CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*. * **Specify resource priorities.** If the cluster is going to sacrifice services, it should be the ones you care about (comparatively) the least. Ensure that resource priorities are properly set so that your most important resources are scheduled first.