diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst index 5086fa85d5..ce6b36ba52 100644 --- a/doc/sphinx/Pacemaker_Development/components.rst +++ b/doc/sphinx/Pacemaker_Development/components.rst @@ -1,489 +1,491 @@ Coding Particular Pacemaker Components -------------------------------------- The Pacemaker code can be intricate and difficult to follow. This chapter has some high-level descriptions of how individual components work. .. index:: single: controller single: pacemaker-controld Controller ########## ``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons to orchestrate actions that need to be taken in the cluster. It receives CIB change notifications from the CIB manager, passes the new CIB to the scheduler to determine whether anything needs to be done, uses the executor and fencer to execute any actions required, and sets failure counts (among other things) via the attribute manager. As might be expected, it has the most code of any of the daemons. .. index:: single: join Join sequence _____________ -Most daemons track their cluster peers using Corosync's membership and CPG -only. The controller additionally requires peers to `join`, which ensures they -are ready to be assigned tasks. Joining proceeds through a series of phases -referred to as the `join sequence` or `join process`. +Most daemons track their cluster peers using Corosync's membership and +:term:`CPG` only. The controller additionally requires peers to `join`, which +ensures they are ready to be assigned tasks. Joining proceeds through a series +of phases referred to as the `join sequence` or `join process`. A node's current join phase is tracked by the ``join`` member of ``crm_node_t`` (used in the peer cache). It is an ``enum crm_join_phase`` that (ideally) progresses from the DC's point of view as follows: * The node initially starts at ``crm_join_none`` * The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node proceeds to ``crm_join_welcomed``. This can happen in three ways: * The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at its controller startup, and the DC will reply to that with a join offer. * When the DC's peer status callback notices that the node has joined the messaging layer, it registers ``I_NODE_JOIN`` (which leads to ``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` -> ``join_make_offer()``). * After certain events (notably a new DC being elected), the DC will send all nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``). These can overlap. The DC can send a join offer and the node can send a join announce at nearly the same time, so the node responds to the original join offer while the DC responds to the join announce with a new join offer. The situation resolves itself after looping a bit. * The node responds to join offers with a `join request` (``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and ``join_query_callback()``). When the DC receives the request, the node proceeds to ``crm_join_integrated`` (via ``do_dc_join_filter_offer()``). * As each node is integrated, the current best CIB is sync'ed to each integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``) and the node proceeds to ``crm_join_finalized`` (via ``finalize_sync_callback()`` + ``finalize_join_for()``). * Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via ``do_cl_join_finalize_respond()``), including its current resource operation history (via ``controld_query_executor_state()``). Once the DC receives this confirmation, the node proceeds to ``crm_join_confirmed`` via ``do_dc_join_ack()``. Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks for quorum and responds appropriately. When peers are lost, their join phase is reset to none (in various places). ``crm_update_peer_join()`` updates a node's join phase. The DC increments the global ``current_join_id`` for each joining round, and rejects any (older) replies that don't match. .. index:: single: fencer single: pacemaker-fenced Fencer ###### ``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In the broadest terms, fencing works like this: #. The initiator (an external program such as ``stonith_admin``, or the cluster itself via the controller) asks the local fencer, "Hey, could you please fence this node?" #. The local fencer asks all the fencers in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?" #. Each fencer in the cluster replies with a list of available devices that it knows about. #. Once the original fencer gets all the replies, it asks the most appropriate fencer peer to actually carry out the fencing. It may send out more than one such request if the target node must be fenced with multiple devices. #. The chosen fencer(s) call the appropriate fencing resource agent(s) to do the fencing, then reply to the original fencer with the result. #. The original fencer broadcasts the result to all fencers. #. Each fencer sends the result to each of its local clients (including, at some point, the initiator). A more detailed description follows. .. index:: single: libstonithd Initiating a fencing request ____________________________ A fencing request can be initiated by the cluster or externally, using the libstonithd API. * The cluster always initiates fencing via ``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the ``fence()`` API method). This occurs when a transition graph synapse contains a ``CRM_OP_FENCE`` XML operation. * The main external clients are ``stonith_admin`` and ``cts-fence-helper``. The ``DLM`` project also uses Pacemaker for fencing. Highlights of the fencing API: * ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose ``cmds`` member has methods for connect, disconnect, fence, etc. * the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with the desired action and target node. Callers do not have to choose or even have any knowledge about particular fencing devices. Fencing queries _______________ The function calls for a fencing request go something like this: -The local fencer receives the client's request via an IPC or messaging +The local fencer receives the client's request via an :term:`IPC` or messaging layer callback, which calls * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML request with the target, desired action, timeout, etc. then broadcasts the operation to the cluster group (i.e. all fencer instances) and starts a timer. The query is broadcast because (1) location constraints might prevent the local node from accessing the stonith device directly, and (2) even if the local node does have direct access, another node might be preferred to carry out the fencing. Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast request via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls * ``stonith_query()``, which calls * ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add device information to an XML reply and send it. (A message is considered a reply if it contains ``T_STONITH_REPLY``, which is only set by fencer peers, not clients.) The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls * ``process_remote_stonith_query()``, which allocates a new query result structure, parses device information into it, and adds it to the operation object. It increments the number of replies received for this operation, and compares it against the expected number of replies (i.e. the number of active peers), and if this is the last expected reply, calls * ``request_peer_fencing()``, which calculates the timeout and sends ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target node has a fencing "topology" (which allows specifications such as "this node can be fenced either with device A, or devices B and C in combination"), it will choose the device(s), and send out as many requests as needed. If it chooses a device, it will choose the peer; a peer is preferred if it has "verified" access to the desired device, meaning that it has the device "running" on it and thus has a monitor operation ensuring reachability. Fencing operations __________________ Each ``STONITH_OP_FENCE`` request goes something like this: -The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or -messaging layer callback, which calls: +The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via +:term:`IPC` or messaging layer callback, which calls: * ``stonith_command()``, which (for requests) calls * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls * ``stonith_fence()``, which calls * ``schedule_stonith_command()`` (using supplied device if ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable device obtained via ``get_capable_devices()`` with ``stonith_fence_get_devices_cb()``), which adds the operation to the device's pending operations list and triggers processing. The chosen peer fencer's mainloop is triggered and calls * ``stonith_device_dispatch()``, which calls * ``stonith_device_execute()``, which pops off the next item from the device's pending operations list. If acting as the (internally implemented) watchdog agent, it panics the node, otherwise it calls * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to call the fencing agent. The chosen peer fencer's mainloop is triggered again once the fencing agent returns, and calls * ``stonith_action_async_done()`` which adds the results to an action object then calls its * done callback (``st_child_done()``), which calls ``schedule_stonith_command()`` for a new device if there are further required actions to execute or if the original action failed, then builds and sends an XML reply to the original fencer (via ``send_async_reply()``), then checks whether any pending actions are the same as the one just executed and merges them if so. Fencing replies _______________ -The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or +The original fencer receives the ``STONITH_OP_FENCE`` reply via :term:`IPC` or messaging layer callback, which calls: * ``stonith_command()``, which (for replies) calls * ``handle_reply()``, which calls * ``fenced_process_fencing_reply()``, which calls either ``request_peer_fencing()`` (to retry a failed operation, or try the next device in a topology if appropriate, which issues a new ``STONITH_OP_FENCE`` request, proceeding as before) or ``finalize_op()`` (if the operation is definitively failed or successful). * ``finalize_op()`` broadcasts the result to all peers. Finally, all peers receive the broadcast result and call * ``finalize_op()``, which sends the result to all local clients. .. index:: single: fence history Fencing History _______________ The fencer keeps a running history of all fencing operations. The bulk of the relevant code is in `fenced_history.c` and ensures the history is synchronized across all nodes even if a node leaves and rejoins the cluster. In libstonithd, this information is represented by `stonith_history_t` and is queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and `stonith_admin` use this API to display the history. .. index:: single: scheduler single: pacemaker-schedulerd single: libpe_status single: libpe_rules single: libpacemaker Scheduler ######### ``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker scheduler for the controller, but "the scheduler" in general refers to related library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``). The purpose of the scheduler is to take a CIB as input and generate a transition graph (list of actions that need to be taken) as output. The controller invokes the scheduler by contacting the scheduler daemon via -local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource`` -can also invoke the scheduler, but do so by calling the library functions -directly. This allows them to run using a ``CIB_file`` without the cluster -needing to be active. +local :term:`IPC`. Tools such as ``crm_simulate``, ``crm_mon``, and +``crm_resource`` can also invoke the scheduler, but do so by calling the +library functions directly. This allows them to run using a ``CIB_file`` +without the cluster needing to be active. The main entry point for the scheduler code is ``lib/pacemaker/pcmk_scheduler.c:pcmk__schedule_actions()``. It sets defaults and calls a series of functions for the scheduling. Some key steps: * ``unpack_cib()`` parses most of the CIB XML into data structures, and determines the current cluster status. * ``apply_node_criteria()`` applies factors that make resources prefer certain nodes, such as shutdown locks, location constraints, and stickiness. * ``pcmk__create_internal_constraints()`` creates internal constraints, such as the implicit ordering for group members, or start actions being implicitly ordered before promote actions. * ``pcmk__handle_rsc_config_changes()`` processes resource history entries in the CIB status section. This is used to decide whether certain actions need to be done, such as deleting orphan resources, forcing a restart when a resource definition changes, etc. -* ``assign_resources()`` assigns resources to nodes. +* ``assign_resources()`` :term:`assigns ` resources to nodes. * ``schedule_resource_actions()`` schedules resource-specific actions (which might or might not end up in the final graph). * ``pcmk__apply_orderings()`` processes ordering constraints in order to modify action attributes such as optional or required. * ``pcmk__create_graph()`` creates the transition graph. Challenges __________ Working with the scheduler is difficult. Challenges include: * It is far too much code to keep more than a small portion in your head at one time. * Small changes can have large (and unexpected) effects. This is why we have a large number of regression tests (``cts/cts-scheduler``), which should be run after making code changes. * It produces an insane amount of log messages at debug and trace levels. You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to enable trace-level messages only when related to specific resources. * Different parts of the main ``pcmk_scheduler_t`` structure are finalized at different points in the scheduling process, so you have to keep in mind whether information you're using at one point of the code can possibly change later. For example, data unpacked from the CIB can safely be used anytime after ``unpack_cib(),`` but actions may become optional or required anytime before ``pcmk__create_graph()``. There's no easy way to deal with this. * Many names of struct members, functions, etc., are suboptimal, but are part of the public API and cannot be changed until an API backward compatibility break. .. index:: single: pcmk_scheduler_t Cluster Working Set ___________________ The main data object for the scheduler is ``pcmk_scheduler_t``, which contains all information needed about nodes, resources, constraints, etc., both as the raw CIB XML and parsed into more usable data structures, plus the resulting transition graph XML. The variable name is usually ``scheduler``. .. index:: single: pcmk_resource_t Resources _________ ``pcmk_resource_t`` is the data object representing cluster resources. A -resource has a variant: primitive (a.k.a. native), group, clone, or bundle. +resource has a variant: :term:`primitive`, group, clone, or :term:`bundle`. The resource object has members for two sets of methods, ``resource_object_functions_t`` from the ``libpe_status`` public API, and ``resource_alloc_functions_t`` whose implementation is internal to ``libpacemaker``. The actual functions vary by variant. The object functions have basic capabilities such as unpacking the resource XML, and determining the current or planned location of the resource. -The assignment functions have more obscure capabilities needed for scheduling, -such as processing location and ordering constraints. For example, -``pcmk__create_internal_constraints()`` simply calls the +The :term:`assignment ` functions have more obscure capabilities needed +for scheduling, such as processing location and ordering constraints. For +example, ``pcmk__create_internal_constraints()`` simply calls the ``internal_constraints()`` method for each top-level resource in the cluster. .. index:: single: pcmk_node_t Nodes _____ -Assignment of resources to nodes is done by choosing the node with the highest -score for a given resource. The scheduler does a bunch of processing to -generate the scores, then the actual assignment is straightforward. +:term:`Assignment ` of resources to nodes is done by choosing the node +with the highest :term:`score` for a given resource. The scheduler does a bunch +of processing to generate the scores, then the actual assignment is +straightforward. Node lists are frequently used. For example, ``pcmk_scheduler_t`` has a ``nodes`` member which is a list of all nodes in the cluster, and ``pcmk_resource_t`` has a ``running_on`` member which is a list of all nodes on which the resource is (or might be) active. These are lists of ``pcmk_node_t`` objects. The ``pcmk_node_t`` object contains a ``struct pe_node_shared_s *details`` member with all node information that is independent of resource assignment (the node name, etc.). The working set's ``nodes`` member contains the original of this information. All other node lists contain copies of ``pcmk_node_t`` where only the ``details`` member points to the originals in the working set's ``nodes`` list. In this way, the other members of ``pcmk_node_t`` (such as ``weight``, which is the node score) may vary by node list, while the common details are shared. .. index:: single: pcmk_action_t single: pe_action_flags Actions _______ ``pcmk_action_t`` is the data object representing actions that might need to be taken. These could be resource actions, cluster-wide actions such as fencing a node, or "pseudo-actions" which are abstractions used as convenient points for ordering other actions against. It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The most important of these are ``pe_action_runnable`` (if not set, the action is "blocked" and cannot be added to the transition graph) and ``pe_action_optional`` (actions with this set will not be added to the transition graph; actions often start out as optional, and may become required later). .. index:: single: pe__colocation_t Colocations ___________ ``pcmk__colocation_t`` is the data object representing colocations. Colocation constraints come into play in these parts of the scheduler code: -* When sorting resources for assignment, so resources with highest node score - are assigned first (see ``cmp_resources()``) +* When sorting resources for :term:`assignment `, so resources with + highest node :term:`score` are assigned first (see ``cmp_resources()``) * When updating node scores for resource assigment or promotion priority * When assigning resources, so any resources to be colocated with can be assigned first, and so colocations affect where the resource is assigned * When choosing roles for promotable clone instances, so colocations involving a specific role can affect which instances are promoted The resource assignment functions have several methods related to colocations: * ``apply_coloc_score():`` This applies a colocation's score to either the dependent's allowed node scores (if called while resources are being assigned) or the dependent's priority (if called while choosing promotable instance roles). It can behave differently depending on whether it is being - called as the primary's method or as the dependent's method. + called as the :term:`primary's ` method or as the :term:`dependent's + ` method. * ``add_colocated_node_scores():`` This updates a table of nodes for a given colocation attribute and score. It goes through colocations involving a given resource, and updates the scores of the nodes in the table with the best scores of nodes that match up according to the colocation criteria. * ``colocated_resources():`` This generates a list of all resources involved in mandatory colocations (directly or indirectly via colocation chains) with a given resource. .. index:: single: pe__ordering_t single: pe_ordering Orderings _________ Ordering constraints are simple in concept, but they are one of the most important, powerful, and difficult to follow aspects of the scheduler code. ``pe__ordering_t`` is the data object representing an ordering, better thought of as a relationship between two actions, since the relation can be more complex than just "this one runs after that one". For an ordering "A then B", the code generally refers to A as "first" or "before", and B as "then" or "after". Much of the power comes from ``enum pe_ordering``, which are flags that determine how an ordering behaves. There are many obscure flags with big effects. A few examples: * ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0, meaning no flags set, so it must be compared with equality rather than ``pcmk_is_set()``. * ``pe_order_optional`` means the ordering does not make either action required, so it only applies if they both become required for other reasons. * ``pe_order_implies_first`` means that if action B becomes required for any reason, then action A will become required as well. diff --git a/doc/sphinx/Pacemaker_Development/general.rst b/doc/sphinx/Pacemaker_Development/general.rst index 9d9dcec1cf..94015c9b8f 100644 --- a/doc/sphinx/Pacemaker_Development/general.rst +++ b/doc/sphinx/Pacemaker_Development/general.rst @@ -1,40 +1,50 @@ .. index:: single: guidelines; all languages General Guidelines for All Languages ------------------------------------ .. index:: copyright Copyright ######### When copyright notices are added to a file, they should look like this: .. note:: **Copyright Notice Format** | Copyright *YYYY[-YYYY]* the Pacemaker project contributors | | The version control history for this file may have further details. The first *YYYY* is the year the file was *originally* published. The original date is important for two reasons: when two entities claim copyright ownership of the same work, the earlier claim generally prevails; and copyright expiration is generally calculated from the original publication date. [1]_ If the file is modified in later years, add *-YYYY* with the most recent year of modification. Even though Pacemaker is an ongoing project, copyright notices are about the years of *publication* of specific content. Copyright notices are intended to indicate, but do not affect, copyright *ownership*, which is determined by applicable laws and regulations. Authors may put more specific copyright notices in their commit messages if desired. .. rubric:: Footnotes .. [1] See the U.S. Copyright Office's `"Compendium of U.S. Copyright Office Practices" `_, particularly "Chapter 2200: Notice of Copyright", sections 2205.1(A) and 2205.1(F), or `"Updating Copyright Notices" `_ for a more readable summary. + +Terminology +########### + +Pacemaker is extremely complex, and it helps to use terminology consistently +throughout documentation, symbol names and comments in code, and so forth. It +also helps to use natural language when practical instead of technical jargon +and acronyms. + +For specific recommendations, see the :ref:`glossary`. diff --git a/doc/sphinx/Pacemaker_Development/glossary.rst b/doc/sphinx/Pacemaker_Development/glossary.rst new file mode 100644 index 0000000000..6f73e961f7 --- /dev/null +++ b/doc/sphinx/Pacemaker_Development/glossary.rst @@ -0,0 +1,84 @@ +.. index:: + single: glossary + +.. _glossary: + +Glossary +-------- + +.. glossary:: + + assign + In the scheduler, this refers to associating a resource with a node. Do + not use *allocate* for this purpose. + + bundle + The collective resource type associating instances of a container with + storage and networking. Do not use :term:`container` when referring to + the bundle as a whole. + + cluster layer + The layer of the :term:`cluster stack` that provides membership and + messaging capabilities (such as Corosync). + + cluster stack + The core components of a high-availability cluster: the + :term:`cluster layer` at the "bottom" of the stack, then Pacemaker, then + resource agents, and then the actual services managed by the cluster at + the "top" of the stack. Do not use *stack* for the cluster layer alone. + + CPG + Corosync Process Group. This is the messaging layer in a Corosync-based + cluster. Pacemaker daemons use CPG to communicate with their counterparts + on other nodes. + + container + This can mean either a container in the usual sense (whether as a + standalone resource or as part of a bundle), or as the container resource + meta-attribute (which does not necessarily reference a container in the + usual sense). + + dangling migration + Live migration of a resource consists of a **migrate_to** action on the + source node, followed by a **migrate_from** on the target node, followed + by a **stop** on the source node. If the **migrate_to** and + **migrate_from** have completed successfully, but the **stop** has not + yet been done, the migration is considered to be *dangling*. + + dependent + In colocation constraints, this refers to the resource located relative + to the :term:`primary` resource. Do not use *rh* or *right-hand* for this + purpose. + + IPC + Inter-process communication. In Pacemaker, clients send requests to + daemons using libqb IPC. + + message + This can refer to log messages, custom messages defined for a + **pcmk_output_t** object, or XML messages sent via :term:`CPG` or + :term:`IPC`. + + metadata + In the context of options and resource agents, this refers to OCF-style + metadata. Do not use a hyphen except when referring to the OCF-defined + action name *meta-data*. + + primary + In colocation constraints, this refers to the resource that the + :term:`dependent` resource is located relative to. Do not use *lh* or + *left-hand* for this purpose. + + primitive + The fundamental resource type in Pacemaker. Do not use *native* for this + purpose. + + score + An integer value constrained between **-PCMK_SCORE_INFINITY** and + **+PCMK_SCORE_INFINITY**. Certain strings (such as + **PCMK_VALUE_INFINITY**) parse as particular score values. Do not use + *weight* for this purpose. + + self-fencing + When a node is chosen to execute its own fencing. Do not use *suicide* + for this purpose. diff --git a/doc/sphinx/Pacemaker_Development/index.rst b/doc/sphinx/Pacemaker_Development/index.rst index 1e80df9cfa..a3f624f65b 100644 --- a/doc/sphinx/Pacemaker_Development/index.rst +++ b/doc/sphinx/Pacemaker_Development/index.rst @@ -1,34 +1,35 @@ Pacemaker Development ===================== *Working with the Pacemaker Code Base* Abstract -------- This document has guidelines and tips for developers interested in editing Pacemaker source code and submitting changes for inclusion in the project. Start with the FAQ; the rest is optional detail. Table of Contents ----------------- .. toctree:: :maxdepth: 3 :numbered: faq general documentation python c components helpers evolution + glossary Index ----- * :ref:`genindex` * :ref:`search`