diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst index f886eb2568..3d1579ee76 100644 --- a/doc/sphinx/Pacemaker_Development/components.rst +++ b/doc/sphinx/Pacemaker_Development/components.rst @@ -1,514 +1,514 @@ Coding Particular Pacemaker Components -------------------------------------- The Pacemaker code can be intricate and difficult to follow. This chapter has some high-level descriptions of how individual components work. .. index:: single: controller single: pacemaker-controld Controller ########## ``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons to orchestrate actions that need to be taken in the cluster. It receives CIB change notifications from the CIB manager, passes the new CIB to the scheduler to determine whether anything needs to be done, uses the executor and fencer to execute any actions required, and sets failure counts (among other things) via the attribute manager. As might be expected, it has the most code of any of the daemons. .. index:: single: join Join sequence _____________ Most daemons track their cluster peers using Corosync's membership and :term:`CPG` only. The controller additionally requires peers to `join`, which ensures they are ready to be assigned tasks. Joining proceeds through a series of phases referred to as the `join sequence` or `join process`. A node's current join phase is tracked by the ``user_data`` member of ``pcmk__node_status_t`` (used in the peer cache). It is an ``enum controld_join_phase`` that (ideally) progresses from the DC's point of view as follows: * The node initially starts at ``controld_join_none`` * The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node proceeds to ``controld_join_welcomed``. This can happen in three ways: * The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at its controller startup, and the DC will reply to that with a join offer. * When the DC's peer status callback notices that the node has joined the messaging layer, it registers ``I_NODE_JOIN`` (which leads to ``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` -> ``join_make_offer()``). * After certain events (notably a new DC being elected), the DC will send all nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``). These can overlap. The DC can send a join offer and the node can send a join announce at nearly the same time, so the node responds to the original join offer while the DC responds to the join announce with a new join offer. The situation resolves itself after looping a bit. * The node responds to join offers with a `join request` (``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and ``join_query_callback()``). When the DC receives the request, the node proceeds to ``controld_join_integrated`` (via ``do_dc_join_filter_offer()``). * As each node is integrated, the current best CIB is sync'ed to each integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``) and the node proceeds to ``controld_join_finalized`` (via ``finalize_sync_callback()`` + ``finalize_join_for()``). * Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via ``do_cl_join_finalize_respond()``), including its current resource operation history (via ``controld_query_executor_state()``). Once the DC receives this confirmation, the node proceeds to ``controld_join_confirmed`` via ``do_dc_join_ack()``. Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks for quorum and responds appropriately. When peers are lost, their join phase is reset to none (in various places). ``crm_update_peer_join()`` updates a node's join phase. The DC increments the global ``current_join_id`` for each joining round, and rejects any (older) replies that don't match. .. index:: single: fencer single: pacemaker-fenced Fencer ###### ``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In the broadest terms, fencing works like this: #. The initiator (an external program such as ``stonith_admin``, or the cluster itself via the controller) asks the local fencer, "Hey, could you please fence this node?" #. The local fencer asks all the fencers in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?" #. Each fencer in the cluster replies with a list of available devices that it knows about. #. Once the original fencer gets all the replies, it asks the most appropriate fencer peer to actually carry out the fencing. It may send out more than one such request if the target node must be fenced with multiple devices. #. The chosen fencer(s) call the appropriate fencing resource agent(s) to do the fencing, then reply to the original fencer with the result. #. The original fencer broadcasts the result to all fencers. #. Each fencer sends the result to each of its local clients (including, at some point, the initiator). A more detailed description follows. .. index:: single: libstonithd Initiating a fencing request ____________________________ A fencing request can be initiated by the cluster or externally, using the libstonithd API. * The cluster always initiates fencing via ``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the ``fence()`` API method). This occurs when a transition graph synapse contains a ``CRM_OP_FENCE`` XML operation. * The main external clients are ``stonith_admin`` and ``cts-fence-helper``. The ``DLM`` project also uses Pacemaker for fencing. Highlights of the fencing API: * ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose ``cmds`` member has methods for connect, disconnect, fence, etc. * the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with the desired action and target node. Callers do not have to choose or even have any knowledge about particular fencing devices. Fencing queries _______________ The function calls for a fencing request go something like this: The local fencer receives the client's request via an :term:`IPC` or messaging layer callback, which calls * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML request with the target, desired action, timeout, etc. then broadcasts the operation to the cluster group (i.e. all fencer instances) and starts a timer. The query is broadcast because (1) location constraints might prevent the local node from accessing the stonith device directly, and (2) even if the local node does have direct access, another node might be preferred to carry out the fencing. Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast request via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls * ``stonith_query()``, which calls * ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add device information to an XML reply and send it. (A message is considered a reply if it contains ``T_STONITH_REPLY``, which is only set by fencer peers, not clients.) The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls * ``process_remote_stonith_query()``, which allocates a new query result structure, parses device information into it, and adds it to the operation object. It increments the number of replies received for this operation, and compares it against the expected number of replies (i.e. the number of active peers), and if this is the last expected reply, calls * ``request_peer_fencing()``, which calculates the timeout and sends ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target node has a fencing "topology" (which allows specifications such as "this node can be fenced either with device A, or devices B and C in combination"), it will choose the device(s), and send out as many requests as needed. If it chooses a device, it will choose the peer; a peer is preferred if it has "verified" access to the desired device, meaning that it has the device "running" on it and thus has a monitor operation ensuring reachability. Fencing operations __________________ Each ``STONITH_OP_FENCE`` request goes something like this: The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via :term:`IPC` or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls * ``stonith_fence()``, which calls * ``schedule_stonith_command()`` (using supplied device if ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable device obtained via ``get_capable_devices()`` with ``stonith_fence_get_devices_cb()``), which adds the operation to the device's pending operations list and triggers processing. The chosen peer fencer's mainloop is triggered and calls * ``stonith_device_dispatch()``, which calls * ``stonith_device_execute()``, which pops off the next item from the device's pending operations list. If acting as the (internally implemented) watchdog agent, it panics the node, otherwise it calls * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to call the fencing agent. The chosen peer fencer's mainloop is triggered again once the fencing agent returns, and calls * ``stonith_action_async_done()`` which adds the results to an action object then calls its * done callback (``st_child_done()``), which calls ``schedule_stonith_command()`` for a new device if there are further required actions to execute or if the original action failed, then builds and sends an XML reply to the original fencer (via ``send_async_reply()``), then checks whether any pending actions are the same as the one just executed and merges them if so. Fencing replies _______________ The original fencer receives the ``STONITH_OP_FENCE`` reply via :term:`IPC` or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()``, which calls * ``fenced_process_fencing_reply()``, which calls either ``request_peer_fencing()`` (to retry a failed operation, or try the next device in a topology if appropriate, which issues a new ``STONITH_OP_FENCE`` request, proceeding as before) or ``finalize_op()`` (if the operation is definitively failed or successful). * ``finalize_op()`` broadcasts the result to all peers. Finally, all peers receive the broadcast result and call * ``finalize_op()``, which sends the result to all local clients. .. index:: single: fence history Fencing History _______________ The fencer keeps a running history of all fencing operations. The bulk of the relevant code is in `fenced_history.c` and ensures the history is synchronized across all nodes even if a node leaves and rejoins the cluster. In libstonithd, this information is represented by `stonith_history_t` and is queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and `stonith_admin` use this API to display the history. .. index:: single: scheduler single: pacemaker-schedulerd single: libcrmcommon single: libpe_status single: libpacemaker Scheduler ######### ``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker scheduler for the controller, but "the scheduler" in general refers to related library code in various files in ``libcrmcommon``, ``libpe_status``, and ``libpacemaker``. The purpose of the scheduler is to take a CIB as input and generate a transition graph (list of actions that need to be taken) as output. The controller invokes the scheduler by contacting the scheduler daemon via local :term:`IPC`. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource`` can also invoke the scheduler, but do so by calling the library functions directly. This allows them to run using a ``CIB_file`` without the cluster needing to be active. The main entry point for the scheduler code is ``lib/pacemaker/pcmk_scheduler.c:pcmk__schedule_actions()``. It sets defaults and calls a series of functions for the scheduling. Some key steps: * ``unpack_cib()`` parses most of the CIB XML into data structures, and determines the current cluster status. * ``apply_node_criteria()`` applies factors that make resources prefer certain nodes, such as shutdown locks, location constraints, and stickiness. * ``pcmk__create_internal_constraints()`` creates internal constraints, such as the implicit ordering for group members, or start actions being implicitly ordered before promote actions. * ``pcmk__handle_rsc_config_changes()`` processes resource history entries in the CIB status section. This is used to decide whether certain - actions need to be done, such as deleting orphan resources, forcing a restart + actions need to be done, such as deleting removed resources, forcing a restart when a resource definition changes, etc. * ``assign_resources()`` :term:`assigns ` resources to nodes. * ``schedule_resource_actions()`` schedules resource-specific actions (which might or might not end up in the final graph). * ``pcmk__apply_orderings()`` processes ordering constraints in order to modify action attributes such as optional or required. * ``pcmk__create_graph()`` creates the transition graph. Challenges __________ Working with the scheduler is difficult. Challenges include: * It is far too much code to keep more than a small portion in your head at one time. * Small changes can have large (and unexpected) effects. This is why we have a large number of regression tests (``cts/cts-scheduler``), which should be run after making code changes. * It produces an insane amount of log messages at debug and trace levels. You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to enable trace-level messages only when related to specific resources. * Different parts of the main ``pcmk_scheduler_t`` structure are finalized at different points in the scheduling process, so you have to keep in mind whether information you're using at one point of the code can possibly change later. For example, data unpacked from the CIB can safely be used anytime after ``unpack_cib(),`` but actions may become optional or required anytime before ``pcmk__create_graph()``. There's no easy way to deal with this. .. index:: single: pcmk_scheduler_t The Scheduler Object ____________________ The main data object for the scheduler is ``pcmk_scheduler_t``, which contains all information needed about nodes, resources, constraints, etc., both as the raw CIB XML and parsed into more usable data structures, plus the resulting transition graph XML. The variable name is usually ``scheduler``. .. index:: single: pcmk_resource_t Resources _________ ``pcmk_resource_t`` is the data object representing cluster resources. It has a couple of public members for backward compatibility reasons, but most of the implementation is in the internal ``pcmk__resource_private_t`` type. A resource has a variant: :term:`primitive`, group, clone, or :term:`bundle`. The private resource object has members for two sets of methods, ``pcmk__rsc_methods_t`` from ``libcrmcommon``, and ``pcmk__assignment_methods_t`` whose implementation is internal to ``libpacemaker``. The actual functions vary by variant. The resource methods have basic capabilities such as unpacking the resource XML, and determining the current or planned location of the resource. The :term:`assignment ` methods have more obscure capabilities needed for scheduling, such as processing location and ordering constraints. For example, ``pcmk__create_internal_constraints()`` simply calls the ``internal_constraints()`` method for each top-level resource in the cluster. .. index:: single: pcmk_node_t Nodes _____ :term:`Assignment ` of resources to nodes is done by choosing the node with the highest :term:`score` for a given resource. The scheduler does a bunch of processing to generate the scores, then the actual assignment is straightforward. The scheduler node implementation is a little confusing. ``pcmk_node_t`` (``struct pcmk__scored_node``) is the primary object used. It contains two sub-structs, ``pcmk__node_private_t *priv`` (which is internal) and ``struct pcmk__node_details *details`` (which is public for backward compatibility reasons), that contain all node information that is independent of resource assignment (the node name, etc.). It contains one other (internal) sub-struct, ``struct pcmk__node_assignment *assign``, which contains information particular to a specific resource being assigned. Node lists are frequently used. For example, ``pcmk_scheduler_t`` has a ``nodes`` member which is a list of all nodes in the cluster, and the internal resource object has an ``active_nodes`` member which is a list of all nodes on which the resource is (or might be) active. Only the scheduler's ``nodes`` list has the full, original node instances. All other node lists have shallow copies created by ``pe__copy_node()``, which share ``details`` and ``priv`` from the main list (but can differ in their ``assign`` member). .. index:: single: pcmk_action_t single: pcmk__action_flags Actions _______ ``pcmk_action_t`` is the data object representing actions that might need to be taken. These could be resource actions, cluster-wide actions such as fencing a node, or "pseudo-actions" which are abstractions used as convenient points for ordering other actions against. Its (internal) implementation has a ``flags`` member which is a bitmask of ``enum pcmk__action_flags``. The most important of these are ``pcmk__action_runnable`` (if not set, the action is "blocked" and cannot be added to the transition graph) and ``pcmk__action_optional`` (actions with this set will not be added to the transition graph; actions often start out as optional, and may become required later). .. index:: single: pcmk__colocation_t Colocations ___________ ``pcmk__colocation_t`` is the data object representing colocations. Colocation constraints come into play in these parts of the scheduler code: * When sorting resources for :term:`assignment `, so resources with highest node :term:`score` are assigned first (see ``cmp_resources()``) * When updating node scores for resource assigment or promotion priority * When assigning resources, so any resources to be colocated with can be assigned first, and so colocations affect where the resource is assigned * When choosing roles for promotable clone instances, so colocations involving a specific role can affect which instances are promoted The resource assignment functions have several methods related to colocations: * ``apply_coloc_score():`` This applies a colocation's score to either the dependent's allowed node scores (if called while resources are being assigned) or the dependent's priority (if called while choosing promotable instance roles). It can behave differently depending on whether it is being called as the :term:`primary's ` method or as the :term:`dependent's ` method. * ``add_colocated_node_scores():`` This updates a table of nodes for a given colocation attribute and score. It goes through colocations involving a given resource, and updates the scores of the nodes in the table with the best scores of nodes that match up according to the colocation criteria. * ``colocated_resources():`` This generates a list of all resources involved in mandatory colocations (directly or indirectly via colocation chains) with a given resource. .. index:: single: pcmk__action_relation_t single: action; relation Action Relations ________________ Ordering constraints are simple in concept, but they are one of the most important, powerful, and difficult to follow aspects of the scheduler code. ``pcmk__action_relation_t`` is the data object representing an ordering, better thought of as a relationship between two actions, since the relation can be more complex than just "this one runs after that one". For a relation "A then B", the code generally refers to A as "first" or "before", and B as "then" or "after". Much of the power comes from ``enum pcmk__action_relation_flags``, which are flags that determine how a relation behaves. There are many obscure flags with big effects. A few examples: * ``pcmk__ar_none`` means the relation is disabled and will be ignored. The value is 0, meaning no flags set, so it must be compared with equality rather than ``pcmk_is_set()``. * ``pcmk__ar_ordered`` without any other flags set means the relation does not make either action required, so it applies only if they both become required for other reasons. * ``pcmk__ar_then_implies_first`` means that if action B becomes required for any reason, then action A will become required as well. Adding a New Scheduler Regression Test ______________________________________ #. Choose a test name. #. Copy the uncompressed input CIB to cts/scheduler/xml/TESTNAME.xml. It's helpful to add an XML comment at the top describing the essential features of the test (which configuration and status scenarios are being tested). #. Edit ``cts/cts-scheduler.in`` and add the test name and description to the ``TESTS`` array. #. Run ``cts/cts-scheduler --update --run TESTNAME`` to generate the expected transition graph, scores, etc. Look over the generated files to make sure they are as expected. #. Commit your changes. diff --git a/doc/sphinx/Pacemaker_Explained/cluster-options.rst b/doc/sphinx/Pacemaker_Explained/cluster-options.rst index 22329142e5..83f8a05a1c 100644 --- a/doc/sphinx/Pacemaker_Explained/cluster-options.rst +++ b/doc/sphinx/Pacemaker_Explained/cluster-options.rst @@ -1,936 +1,936 @@ Cluster-Wide Configuration -------------------------- .. index:: pair: XML element; cib pair: XML element; configuration Configuration Layout #################### The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this: .. topic:: An empty configuration .. code-block:: xml The empty configuration above contains the major sections that make up a CIB: * ``cib``: The entire CIB is enclosed with a ``cib`` element. Certain fundamental settings are defined as attributes of this element. * ``configuration``: This section -- the primary focus of this document -- contains traditional configuration information such as what resources the cluster serves and the relationships among them. * ``crm_config``: cluster-wide configuration options * ``nodes``: the machines that host the cluster * ``resources``: the services run by the cluster * ``constraints``: indications of how resources should be placed * ``status``: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way. In this document, configuration settings will be described as properties or options based on how they are defined in the CIB: * Properties are XML attributes of an XML element. * Options are name-value pairs expressed as ``nvpair`` child elements of an XML element. Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak. Options can appear within four types of enclosing elements: * ``cluster_property_set`` * ``instance_attributes`` * ``meta_attributes`` * ``utilization`` We will refer to a set of options and its enclosing element as a *block*. .. list-table:: **Properties of an Option Block's Enclosing Element** :class: longtable :widths: 15 15 15 55 :header-rows: 1 * - Name - Type - Default - Description * - .. _option_block_id: .. index:: pair: id; cluster_property_set pair: id; instance_attributes pair: id; meta_attributes pair: id; utilization single: attribute; id (cluster_property_set) single: attribute; id (instance_attributes) single: attribute; id (meta_attributes) single: attribute; id (utilization) id - :ref:`id ` - - A unique name for the block (required) * - .. _option_block_score: .. index:: pair: score; cluster_property_set pair: score; instance_attributes pair: score; meta_attributes pair: score; utilization single: attribute; score (cluster_property_set) single: attribute; score (instance_attributes) single: attribute; score (meta_attributes) single: attribute; score (utilization) score - :ref:`score ` - 0 - Priority with which to process the block Each block may optionally contain a :ref:`rule `. .. _option_precedence: Option Precedence ################# This subsection describes the precedence of options within a set of blocks and within a single block. Options are processed as follows: * All option blocks of a given type are processed in order of their ``score`` attribute, from highest to lowest. For ``cluster_property_set``, if there is a block whose enclosing element has ``id="cib-bootstrap-options"``, then that block is always processed first regardless of score. * If a block contains a rule that evaluates to false, that block is skipped. * Within a block, options are processed in order from first to last. * The first value found for a given option is applied, and the rest are ignored. Note that this means it is pointless to configure the same option twice in a single block, because occurrences after the first one would be ignored. For example, in the following configuration snippet, the ``no-quorum-policy`` value ``demote`` is applied. ``property-set2`` has a higher score than ``property-set1``, so it's processed first. There are no rules in this snippet, so both sets are processed. Within ``property-set2``, the value ``demote`` appears first, so the later value ``freeze`` is ignored. We've already found a value for ``no-quorum-policy`` before we begin processing ``property-set1``, so its value ``stop`` is ignored. .. code-block:: xml CIB Properties ############## Certain settings are defined by CIB properties (that is, attributes of the ``cib`` tag) rather than with the rest of the cluster configuration in the ``configuration`` section. The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location. .. list-table:: **CIB Properties** :class: longtable :widths: 20 15 10 55 :header-rows: 1 * - Name - Type - Default - Description * - .. _admin_epoch: .. index:: pair: admin_epoch; cib admin_epoch - :ref:`nonnegative integer ` - 0 - When a node joins the cluster, the cluster asks the node with the highest (``admin_epoch``, ``epoch``, ``num_updates``) tuple to replace the configuration on all the nodes -- which makes setting them correctly very important. ``admin_epoch`` is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete. * - .. _epoch: .. index:: pair: epoch; cib epoch - :ref:`nonnegative integer ` - 0 - The cluster increments this every time the CIB's configuration section is updated. * - .. _num_updates: .. index:: pair: num_updates; cib num_updates - :ref:`nonnegative integer ` - 0 - The cluster increments this every time the CIB's configuration or status sections are updated, and resets it to 0 when epoch changes. * - .. _validate_with: .. index:: pair: validate-with; cib validate-with - :ref:`enumeration ` - - Determines the type of XML validation that will be done on the configuration. Allowed values are ``none`` (in which case the cluster will not require that updates conform to expected syntax) and the base names of schema files installed on the local machine (for example, "pacemaker-3.9") * - .. _remote_tls_port: .. index:: pair: remote-tls-port; cib remote-tls-port - :ref:`port ` - - If set, the CIB manager will listen for anonymously encrypted remote connections on this port, to allow CIB administration from hosts not in the cluster. No key is used, so this should be used only on a protected network where man-in-the-middle attacks can be avoided. * - .. _remote_clear_port: .. index:: pair: remote-clear-port; cib remote-clear-port - :ref:`port ` - - If set to a TCP port number, the CIB manager will listen for remote connections on this port, to allow for CIB administration from hosts not in the cluster. No encryption is used, so this should be used only on a protected network. * - .. _cib_last_written: .. index:: pair: cib-last-written; cib cib-last-written - :ref:`date/time ` - - Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. * - .. _have_quorum: .. index:: pair: have-quorum; cib have-quorum - :ref:`boolean ` - - Indicates whether the cluster has quorum. If false, the cluster's response is determined by ``no-quorum-policy`` (see below). Maintained by the cluster. * - .. _dc_uuid: .. index:: pair: dc-uuid; cib dc-uuid - :ref:`text ` - - Node ID of the cluster's current designated controller (DC). Used and maintained by the cluster. * - .. _execution_date: .. index:: pair: execution-date; cib execution-date - :ref:`epoch time ` - - Time to use when evaluating rules. .. _cluster_options: Cluster Options ############### Cluster options, as you might expect, control how the cluster behaves when confronted with various situations. They are grouped into sets within the ``crm_config`` section. In advanced configurations, there may be more than one set. (This will be described later in the chapter on :ref:`rules` where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once. You can obtain an up-to-date list of cluster options, including their default values, by running the ``man pacemaker-schedulerd`` and ``man pacemaker-controld`` commands. .. list-table:: **Cluster Options** :class: longtable :widths: 25 13 12 50 :header-rows: 1 * - Name - Type - Default - Description * - .. _cluster_name: .. index:: pair: cluster option; cluster-name cluster-name - :ref:`text ` - - An (optional) name for the cluster as a whole. This is mostly for users' convenience for use as desired in administration, but can be used in the Pacemaker configuration in :ref:`rules` (as the ``#cluster-name`` :ref:`node attribute `). It may also be used by higher-level tools when displaying cluster information, and by certain resource agents (for example, the ``ocf:heartbeat:GFS2`` agent stores the cluster name in filesystem meta-data). * - .. _dc_version: .. index:: pair: cluster option; dc-version dc-version - :ref:`version ` - *detected* - Version of Pacemaker on the cluster's designated controller (DC). Maintained by the cluster, and intended for diagnostic purposes. * - .. _cluster_infrastructure: .. index:: pair: cluster option; cluster-infrastructure cluster-infrastructure - :ref:`text ` - *detected* - The messaging layer with which Pacemaker is currently running. Maintained by the cluster, and intended for informational and diagnostic purposes. * - .. _no_quorum_policy: .. index:: pair: cluster option; no-quorum-policy no-quorum-policy - :ref:`enumeration ` - stop - What to do when the cluster does not have quorum. Allowed values: * ``ignore:`` continue all resource management * ``freeze:`` continue resource management, but don't recover resources from nodes not in the affected partition * ``stop:`` stop all resources in the affected cluster partition * ``demote:`` demote promotable resources and stop all other resources in the affected cluster partition *(since 2.0.5)* * ``fence:`` fence all nodes in the affected cluster partition *(since 2.1.9)* * ``suicide:`` same as ``fence`` *(deprecated since 2.1.9)* * - .. _batch_limit: .. index:: pair: cluster option; batch-limit batch-limit - :ref:`integer ` - 0 - The maximum number of actions that the cluster may execute in parallel across all nodes. The ideal value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. If -1, the cluster will not impose any limit. * - .. _migration_limit: .. index:: pair: cluster option; migration-limit migration-limit - :ref:`integer ` - -1 - The number of :ref:`live migration ` actions that the cluster is allowed to execute in parallel on a node. A value of -1 means unlimited. * - .. _load_threshold: .. index:: pair: cluster option; load-threshold load-threshold - :ref:`percentage ` - 80% - Maximum amount of system load that should be used by cluster nodes. The cluster will slow down its recovery process when the amount of system resources used (currently CPU) approaches this limit. * - .. _node_action_limit: .. index:: pair: cluster option; node-action-limit node-action-limit - :ref:`integer ` - 0 - Maximum number of jobs that can be scheduled per node. If nonpositive or invalid, double the number of cores is used as the maximum number of jobs per node. :ref:`PCMK_node_action_limit ` overrides this option on a per-node basis. * - .. _symmetric_cluster: .. index:: pair: cluster option; symmetric-cluster symmetric-cluster - :ref:`boolean ` - true - If true, resources can run on any node by default. If false, a resource is allowed to run on a node only if a :ref:`location constraint ` enables it. * - .. _stop_all_resources: .. index:: pair: cluster option; stop-all-resources stop-all-resources - :ref:`boolean ` - false - Whether all resources should be disallowed from running (can be useful during maintenance or troubleshooting) - * - .. _stop_orphan_resources: + * - .. _stop_removed_resources: .. index:: - pair: cluster option; stop-orphan-resources + pair: cluster option; stop-removed-resources - stop-orphan-resources + stop-removed-resources - :ref:`boolean ` - true - Whether resources that have been deleted from the configuration should be stopped. This value takes precedence over :ref:`is-managed ` (that is, even unmanaged resources will - be stopped when orphaned if this value is ``true``). - * - .. _stop_orphan_actions: + be stopped when removed if this value is ``true``). + * - .. _stop_removed_actions: .. index:: - pair: cluster option; stop-orphan-actions + pair: cluster option; stop-removed-actions - stop-orphan-actions + stop-removed-actions - :ref:`boolean ` - true - Whether recurring :ref:`operations ` that have been deleted from the configuration should be cancelled * - .. _start_failure_is_fatal: .. index:: pair: cluster option; start-failure-is-fatal start-failure-is-fatal - :ref:`boolean ` - true - Whether a failure to start a resource on a particular node prevents further start attempts on that node. If ``false``, the cluster will decide whether the node is still eligible based on the resource's current failure count and ``migration-threshold``. * - .. _enable_startup_probes: .. index:: pair: cluster option; enable-startup-probes enable-startup-probes - :ref:`boolean ` - true - Whether the cluster should check the pre-existing state of resources when the cluster starts * - .. _maintenance_mode: .. index:: pair: cluster option; maintenance-mode maintenance-mode - :ref:`boolean ` - false - If true, the cluster will not start or stop any resource in the cluster, and any recurring operations (expect those specifying ``role`` as ``Stopped``) will be paused. If true, this overrides the :ref:`maintenance ` node attribute, :ref:`is-managed ` and :ref:`maintenance ` resource meta-attributes, and :ref:`enabled ` operation meta-attribute. * - .. _stonith_enabled: .. index:: pair: cluster option; stonith-enabled stonith-enabled - :ref:`boolean ` - true - Whether the cluster is allowed to fence nodes (for example, failed nodes and nodes with resources that can't be stopped). If true, at least one fence device must be configured before resources are allowed to run. If false, unresponsive nodes are immediately assumed to be running no resources, and resource recovery on online nodes starts without any further protection (which can mean *data loss* if the unresponsive node still accesses shared storage, for example). See also the :ref:`requires ` resource meta-attribute. This option applies only to fencing scheduled by the cluster, not to requests initiated externally (such as with the ``stonith_admin`` command-line tool). * - .. _stonith_action: .. index:: pair: cluster option; stonith-action stonith-action - :ref:`enumeration ` - reboot - Action the cluster should send to the fence agent when a node must be fenced. Allowed values are ``reboot`` and ``off``. * - .. _stonith_timeout: .. index:: pair: cluster option; stonith-timeout stonith-timeout - :ref:`duration ` - 60s - How long to wait for ``on``, ``off``, and ``reboot`` fence actions to complete by default. * - .. _stonith_max_attempts: .. index:: pair: cluster option; stonith-max-attempts stonith-max-attempts - :ref:`score ` - 10 - How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. Any value below 1 will be ignored, and the default will be used instead. * - .. _have_watchdog: .. index:: pair: cluster option; have-watchdog have-watchdog - :ref:`boolean ` - *detected* - Whether watchdog integration is enabled. This is set automatically by the cluster according to whether SBD is detected to be in use. User-configured values are ignored. The value `true` is meaningful if diskless SBD is used and :ref:`stonith-watchdog-timeout ` is nonzero. In that case, if fencing is required, watchdog-based self-fencing will be performed via SBD without requiring a fencing resource explicitly configured. * - .. _stonith_watchdog_timeout: .. index:: pair: cluster option; stonith-watchdog-timeout stonith-watchdog-timeout - :ref:`timeout ` - 0 - If nonzero, and the cluster detects ``have-watchdog`` as ``true``, then watchdog-based self-fencing will be performed via SBD when fencing is required. If this is set to a positive value, lost nodes are assumed to achieve self-fencing within this much time. This does not require a fencing resource to be explicitly configured, though a fence_watchdog resource can be configured, to limit use to specific nodes. If this is set to 0 (the default), the cluster will never assume watchdog-based self-fencing. If this is set to a negative value, the cluster will use twice the local value of the ``SBD_WATCHDOG_TIMEOUT`` environment variable if that is positive, or otherwise treat this as 0. **Warning:** When used, this timeout must be larger than ``SBD_WATCHDOG_TIMEOUT`` on all nodes that use watchdog-based SBD, and Pacemaker will refuse to start on any of those nodes where this is not true for the local value or SBD is not active. When this is set to a negative value, ``SBD_WATCHDOG_TIMEOUT`` must be set to the same value on all nodes that use SBD, otherwise data corruption or loss could occur. * - .. _concurrent-fencing: .. index:: pair: cluster option; concurrent-fencing concurrent-fencing - :ref:`boolean ` - false - Whether the cluster is allowed to initiate multiple fence actions concurrently. Fence actions initiated externally, such as via the ``stonith_admin`` tool or an application such as DLM, or by the fencer itself such as recurring device monitors and ``status`` and ``list`` commands, are not limited by this option. * - .. _fence_reaction: .. index:: pair: cluster option; fence-reaction fence-reaction - :ref:`enumeration ` - stop - How should a cluster node react if notified of its own fencing? A cluster node may receive notification of a "succeeded" fencing that targeted it if fencing is misconfigured, or if fabric fencing is in use that doesn't cut cluster communication. Allowed values are ``stop`` to attempt to immediately stop Pacemaker and stay stopped, or ``panic`` to attempt to immediately reboot the local node, falling back to stop on failure. The default is likely to be changed to ``panic`` in a future release. *(since 2.0.3)* * - .. _priority_fencing_delay: .. index:: pair: cluster option; priority-fencing-delay priority-fencing-delay - :ref:`duration ` - 0 - Apply this delay to any fencing targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match (especially meaningful in a split-brain of a 2-node cluster). A promoted resource instance takes the resource's priority plus 1 if the resource's priority is not 0. Any static or random delays introduced by ``pcmk_delay_base`` and ``pcmk_delay_max`` configured for the corresponding fencing resources will be added to this delay. This delay should be significantly greater than (safely twice) the maximum delay from those parameters. *(since 2.0.4)* * - .. _node_pending_timeout: .. index:: pair: cluster option; node-pending-timeout node-pending-timeout - :ref:`duration ` - 0 - Fence nodes that do not join the controller process group within this much time after joining the cluster, to allow the cluster to continue managing resources. A value of 0 means never fence pending nodes. Setting the value to 2h means fence nodes after 2 hours. *(since 2.1.7)* * - .. _cluster_delay: .. index:: pair: cluster option; cluster-delay cluster-delay - :ref:`duration ` - 60s - If the DC requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node within this time (beyond the action's own timeout). The ideal value will depend on the speed and load of your network and cluster nodes. * - .. _dc_deadtime: .. index:: pair: cluster option; dc-deadtime dc-deadtime - :ref:`duration ` - 20s - How long to wait for a response from other nodes when electing a DC. The ideal value will depend on the speed and load of your network and cluster nodes. * - .. _cluster_ipc_limit: .. index:: pair: cluster option; cluster-ipc-limit cluster-ipc-limit - :ref:`nonnegative integer ` - 500 - The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see "Evicting client" log messages for cluster daemon process IDs. * - .. _pe_error_series_max: .. index:: pair: cluster option; pe-error-series-max pe-error-series-max - :ref:`integer ` - -1 - The number of scheduler inputs resulting in errors to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. * - .. _pe_warn_series_max: .. index:: pair: cluster option; pe-warn-series-max pe-warn-series-max - :ref:`integer ` - 5000 - The number of scheduler inputs resulting in warnings to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. * - .. _pe_input_series_max: .. index:: pair: cluster option; pe-input-series-max pe-input-series-max - :ref:`integer ` - 4000 - The number of "normal" scheduler inputs to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none. * - .. _enable_acl: .. index:: pair: cluster option; enable-acl enable-acl - :ref:`boolean ` - false - Whether :ref:`access control lists ` should be used to authorize CIB modifications * - .. _placement_strategy: .. index:: pair: cluster option; placement-strategy placement-strategy - :ref:`enumeration ` - default - How the cluster should assign resources to nodes (see :ref:`utilization`). Allowed values are ``default``, ``utilization``, ``balanced``, and ``minimal``. * - .. _node_health_strategy: .. index:: pair: cluster option; node-health-strategy node-health-strategy - :ref:`enumeration ` - none - How the cluster should react to :ref:`node health ` attributes. Allowed values are ``none``, ``migrate-on-red``, ``only-green``, ``progressive``, and ``custom``. * - .. _node_health_base: .. index:: pair: cluster option; node-health-base node-health-base - :ref:`score ` - 0 - The base health score assigned to a node. Only used when ``node-health-strategy`` is ``progressive``. * - .. _node_health_green: .. index:: pair: cluster option; node-health-green node-health-green - :ref:`score ` - 0 - The score to use for a node health attribute whose value is ``green``. Only used when ``node-health-strategy`` is ``progressive`` or ``custom``. * - .. _node_health_yellow: .. index:: pair: cluster option; node-health-yellow node-health-yellow - :ref:`score ` - 0 - The score to use for a node health attribute whose value is ``yellow``. Only used when ``node-health-strategy`` is ``progressive`` or ``custom``. * - .. _node_health_red: .. index:: pair: cluster option; node-health-red node-health-red - :ref:`score ` - -INFINITY - The score to use for a node health attribute whose value is ``red``. Only used when ``node-health-strategy`` is ``progressive`` or ``custom``. * - .. _cluster_recheck_interval: .. index:: pair: cluster option; cluster-recheck-interval cluster-recheck-interval - :ref:`duration ` - 15min - Pacemaker is primarily event-driven, and looks ahead to know when to recheck the cluster for failure-timeout settings and most time-based rules *(since 2.0.3)*. However, it will also recheck the cluster after this amount of inactivity. This has three main effects: * :ref:`Rules ` using ``date_spec`` are guaranteed to be checked only this often. * If :ref:`fencing ` fails enough to reach :ref:`stonith-max-attempts `, attempts will begin again after at most this time. * It serves as a fail-safe in case of certain scheduler bugs. If the scheduler incorrectly determines only some of the actions needed to react to a particular event, it will often correctly determine the rest after at most this time. A value of 0 disables this polling. * - .. _shutdown_lock: .. index:: pair: cluster option; shutdown-lock shutdown-lock - :ref:`boolean ` - false - The default of false allows active resources to be recovered elsewhere when their node is cleanly shut down, which is what the vast majority of users will want. However, some users prefer to make resources highly available only for failures, with no recovery for clean shutdowns. If this option is true, resources active on a node when it is cleanly shut down are kept "locked" to that node (not allowed to run elsewhere) until they start again on that node after it rejoins (or for at most ``shutdown-lock-limit``, if set). Stonith resources and Pacemaker Remote connections are never locked. Clone and bundle instances and the promoted role of promotable clones are currently never locked, though support could be added in a future release. Locks may be manually cleared using the ``--refresh`` option of ``crm_resource`` (both the resource and node must be specified; this works with remote nodes if their connection resource's ``target-role`` is set to ``Stopped``, but not if Pacemaker Remote is stopped on the remote node without disabling the connection resource). *(since 2.0.4)* * - .. _shutdown_lock_limit: .. index:: pair: cluster option; shutdown-lock-limit shutdown-lock-limit - :ref:`duration ` - 0 - If ``shutdown-lock`` is true, and this is set to a nonzero time duration, locked resources will be allowed to start after this much time has passed since the node shutdown was initiated, even if the node has not rejoined. (This works with remote nodes only if their connection resource's ``target-role`` is set to ``Stopped``.) *(since 2.0.4)* * - .. _startup_fencing: .. index:: pair: cluster option; startup-fencing startup-fencing - :ref:`boolean ` - true - *Advanced Use Only:* Whether the cluster should fence unseen nodes at start-up. Setting this to false is unsafe, because the unseen nodes could be active and running resources but unreachable. ``dc-deadtime`` acts as a grace period before this fencing, since a DC must be elected to schedule fencing. * - .. _election_timeout: .. index:: pair: cluster option; election-timeout election-timeout - :ref:`duration ` - 2min - *Advanced Use Only:* If a winner is not declared within this much time of starting an election, the node that initiated the election will declare itself the winner. * - .. _shutdown_escalation: .. index:: pair: cluster option; shutdown-escalation shutdown-escalation - :ref:`duration ` - 20min - *Advanced Use Only:* The controller will exit immediately if a shutdown does not complete within this much time. * - .. _join_integration_timeout: .. index:: pair: cluster option; join-integration-timeout join-integration-timeout - :ref:`duration ` - 3min - *Advanced Use Only:* If you need to adjust this value, it probably indicates the presence of a bug. * - .. _join_finalization_timeout: .. index:: pair: cluster option; join-finalization-timeout join-finalization-timeout - :ref:`duration ` - 30min - *Advanced Use Only:* If you need to adjust this value, it probably indicates the presence of a bug. * - .. _transition_delay: .. index:: pair: cluster option; transition-delay transition-delay - :ref:`duration ` - 0s - *Advanced Use Only:* Delay cluster recovery for the configured interval to allow for additional or related events to occur. This can be useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions.