diff --git a/doc/sphinx/Pacemaker_Explained/utilization.rst b/doc/sphinx/Pacemaker_Explained/utilization.rst index 87eef6021e..9f3b640eec 100644 --- a/doc/sphinx/Pacemaker_Explained/utilization.rst +++ b/doc/sphinx/Pacemaker_Explained/utilization.rst @@ -1,264 +1,226 @@ .. _utilization: Utilization and Placement Strategy ---------------------------------- -Pacemaker decides where to place a resource according to the resource -assignment scores on every node. The resource will be assigned to the -node where the resource has the highest score. +Pacemaker decides where a resource should run by assigning a score to every +node, considering factors such as the resource's constraints and stickiness, +then assigning the resource to the node with the highest score. -If the resource assignment scores on all the nodes are equal, by the default -placement strategy, Pacemaker will choose a node with the least number of -assigned resources for balancing the load. If the number of resources on each -node is equal, the first eligible node listed in the CIB will be chosen to run -the resource. +If more than one node has the highest score, Pacemaker by default chooses +the one with the least number of assigned resources, or if that is also the +same, the one listed first in the CIB. This results in simple load balancing. -Often, in real-world situations, different resources use significantly -different proportions of a node's capacities (memory, I/O, etc.). -We cannot balance the load ideally just according to the number of resources -assigned to a node. Besides, if resources are placed such that their combined -requirements exceed the provided capacity, they may fail to start completely or -run with degraded performance. +Sometimes, simple load balancing is insufficient. Different resources can use +significantly different amounts of a node's memory, CPU, and other capacities. +Some combinations of resources may strain a node's capacity, causing them to +fail or have degraded performance. Or, an administrator may prefer to +concentrate resources rather than balance them, to minimize energy consumption +by spare nodes. -To take these factors into account, Pacemaker allows you to configure: - -#. The capacity a certain node provides. - -#. The capacity a certain resource requires. - -#. An overall strategy for placement of resources. +Pacemaker offers flexibility by allowing you to configure *utilization +attributes* specifying capacities that each node provides and each resource +requires, as well as a *placement strategy*. Utilization attributes ###################### -To configure the capacity that a node provides or a resource requires, -you can use *utilization attributes* in ``node`` and ``resource`` objects. -You can name utilization attributes according to your preferences and define as -many name/value pairs as your configuration needs. However, the attributes' -values must be integers. +You can define any number of utilization attributes to represent capacities of +interest (CPU, memory, I/O bandwidth, etc.). Their values must be integers. + +The nature and units of the capacities are irrelevant to Pacemaker. It just +makes sure that each node has sufficient capacity to run the resources assigned +to it. .. topic:: Specifying CPU and RAM capacities of two nodes .. code-block:: xml .. topic:: Specifying CPU and RAM consumed by several resources .. code-block:: xml -A node is considered eligible for a resource if it has sufficient free -capacity to satisfy the resource's requirements. The nature of the required -or provided capacities is completely irrelevant to Pacemaker -- it just makes -sure that all capacity requirements of a resource are satisfied before placing -a resource to a node. - -Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*. -These attributes are added to a ``transient_attributes`` section for the node -and are forgotten by the cluster when the node goes offline. The ``attrd_updater`` -tool can be used to set these attributes. +Utilization attributes for a node may be permanent or *(since 2.1.6)* +transient. Permanent attributes persist after Pacemaker is restarted, while +transient attributes do not. .. topic:: Transient utilization attribute for node cluster-1 .. code-block:: xml +Utilization attributes may be configured only on primitive resources. Pacemaker +will consider a collective resource's utilization based on the primitives it +contains. + .. note:: Utilization is supported for bundles *(since 2.1.3)*, but only for bundles - with an inner primitive. Any resource utilization values should be specified - for the inner primitive, but any priority meta-attribute should be specified - for the outer bundle. + with an inner primitive. Placement Strategy ################## -After you have configured the capacities your nodes provide and the -capacities your resources require, you need to set the ``placement-strategy`` -in the global cluster options, otherwise the capacity configurations have -*no effect*. +The ``placement-strategy`` cluster option determines how utilization attributes +are used. Its allowed values are: -Four values are available for the ``placement-strategy``: +* ``default``: The cluster ignores utilization values, and places resources + according to (from highest to lowest precedence) assignment scores, the + number of resources already assigned to each node, and the order nodes are + listed in the CIB. -* **default** +* ``utilization``: The cluster uses the same method as the default strategy to + assign a resource to a node, but only nodes with sufficient free capacity to + meet the resource's requirements are eligible. - Utilization values are not taken into account at all. - Resources are assigned according to assignment scores. If scores are equal, - resources are evenly distributed across nodes. +* ``balanced``: Only nodes with sufficient free capacity are eligible to run a + resource, and the cluster load-balances based on the sum of resource + utilization values rather than the number of resources. -* **utilization** +* ``minimal``: Only nodes with sufficient free capacity are eligible to run a + resource, and the cluster concentrates resources on as few nodes as possible. - Utilization values are taken into account *only* when deciding whether a node - is considered eligible (i.e. whether it has sufficient free capacity to satisfy - the resource's requirements). Load-balancing is still done based on the - number of resources assigned to a node. -* **balanced** +To look at it another way, when deciding where to run a resource, the cluster +starts by considering all nodes, then applies these criteria one by one until +a single node remains: - Utilization values are taken into account when deciding whether a node - is eligible to serve a resource *and* when load-balancing, so an attempt is - made to spread the resources in a way that optimizes resource performance. +* If ``placement-strategy`` is ``utilization``, ``balanced``, or ``minimal``, + consider only nodes that have sufficient spare capacities to meet the + resource's requirements. -* **minimal** +* Consider only nodes with the highest score for the resource. Scores take into + account factors such as the node's health; the resource's stickiness, failure + count on the node, and migration threshold; and constraints. - Utilization values are taken into account *only* when deciding whether a node - is eligible to serve a resource. For load-balancing, an attempt is made to - concentrate the resources on as few nodes as possible, thereby enabling - possible power savings on the remaining nodes. - -Set ``placement-strategy`` with ``crm_attribute``: - - .. code-block:: none - - # crm_attribute --name placement-strategy --update balanced - -Now Pacemaker will ensure the load from your resources will be distributed -evenly throughout the cluster, without the need for convoluted sets of -colocation constraints. - -Assignment Details -################## +* If ``placement-strategy`` is ``balanced``, consider only nodes with the most + free capacity. -Which node is preferred to get consumed first when assigning resources? -_______________________________________________________________________ +* If ``placement-strategy`` is ``default``, ``utilization``, or ``balanced``, + consider only nodes with the least number of assigned resources. -* The node with the highest node weight gets consumed first. Node weight - is a score maintained by the cluster to represent node health. +* If more than one node is eligible after considering all other criteria, + choose the one listed first in the CIB. -* If multiple nodes have the same node weight: +How Multiple Capacities Combine +############################### - * If ``placement-strategy`` is ``default`` or ``utilization``, - the node that has the least number of assigned resources gets consumed first. +If only one type of utilization attribute has been defined, free capacity is a +simple numeric comparison. - * If their numbers of assigned resources are equal, - the first eligible node listed in the CIB gets consumed first. +If multiple utilization attributes have been defined, then the node that has +the highest value in the most attribute types has the most free capacity. - * If ``placement-strategy`` is ``balanced``, - the node that has the most free capacity gets consumed first. - - * If the free capacities of the nodes are equal, - the node that has the least number of assigned resources gets consumed first. - - * If their numbers of assigned resources are equal, - the first eligible node listed in the CIB gets consumed first. - - * If ``placement-strategy`` is ``minimal``, - the first eligible node listed in the CIB gets consumed first. - -Which node has more free capacity? -__________________________________ - -If only one type of utilization attribute has been defined, free capacity -is a simple numeric comparison. - -If multiple types of utilization attributes have been defined, then -the node that is numerically highest in the the most attribute types -has the most free capacity. For example: +For example: * If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``, then their free capacities are equal. * If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory`` and ``storage``, then ``nodeB`` has more free capacity. -Which resource is preferred to be assigned first? -_________________________________________________ +Order of Resource Assignment +############################ -* The resource that has the highest ``priority`` (see :ref:`resource_options`) gets - assigned first. +When assigning resources to nodes, the cluster chooses the next one to assign +by considering the following criteria one by one until a single resource is +selected: -* If their priorities are equal, check whether they are already running. The - resource that has the highest score on the node where it's running gets assigned - first, to prevent resource shuffling. +* Assign the resource with the highest :ref:`priority `. -* If the scores above are equal or the resources are not running, the resource has - the highest score on the preferred node gets assigned first. +* If any resources are already active, assign the one with the highest score on + its current node. This avoids unnecessary resource shuffling. -* If the scores above are equal, the first runnable resource listed in the CIB - gets assigned first. +* Assign the resource with the highest score on its preferred node. -Limitations and Workarounds -########################### +* If more than one resource remains after considering all other criteria, + assign the one of them that is listed first in the CIB. + +.. note:: + + For bundles, only the priority set for the bundle itself matters. If the + bundle contains a primitive, the primitive's priority is ignored. + +Limitations +########### The type of problem Pacemaker is dealing with here is known as the -`knapsack problem `_ and falls into -the `NP-complete `_ category of computer -science problems -- a fancy way of saying "it takes a really long time -to solve". +`knapsack problem `_ and falls +into the `NP-complete `_ +category of computer science problems -- a fancy way of saying "it takes a +really long time to solve". -Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours -or days, finding an optimal solution while services remain unavailable. +In a high-availability cluster, it is unacceptable to spend minutes, let alone +hours or days, finding an optimal solution while services are down. -So instead of trying to solve the problem completely, Pacemaker uses a -*best effort* algorithm for determining which node should host a particular -service. This means it arrives at a solution much faster than traditional -linear programming algorithms, but by doing so at the price of leaving some -services stopped. +Instead of trying to solve the problem completely, Pacemaker uses a "best +effort" algorithm. This arrives at a quick solution, but at the cost of +possibly leaving some resources stopped unnecessarily. -In the contrived example at the start of this chapter: +Using the example configuration at the start of this chapter, and the balanced +placement strategy: * ``rsc-small`` would be assigned to ``node1`` * ``rsc-medium`` would be assigned to ``node2`` * ``rsc-large`` would remain inactive -Which is not ideal. - -There are various approaches to dealing with the limitations of -pacemaker's placement strategy: +That is not ideal. There are various approaches to dealing with the limitations +of Pacemaker's placement strategy: * **Ensure you have sufficient physical capacity.** - It might sound obvious, but if the physical capacity of your nodes is (close to) - maxed out by the cluster under normal conditions, then failover isn't going to - go well. Even without the utilization feature, you'll start hitting timeouts and - getting secondary failures. + It might sound obvious, but if the physical capacity of your nodes is maxed + out even under normal conditions, failover isn't going to go well. Even + without the utilization feature, you'll start hitting timeouts and getting + secondary failures. -* **Build some buffer into the capabilities advertised by the nodes.** +* **Build some buffer into the capacities advertised by the nodes.** - Advertise slightly more resources than we physically have, on the (usually valid) - assumption that a resource will not use 100% of the configured amount of - CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*. + Advertise slightly more resources than we physically have, on the (usually + valid) assumption that resources will not always use 100% of their + configured utilization. This practice is sometimes called *overcommitting*. * **Specify resource priorities.** - If the cluster is going to sacrifice services, it should be the ones you care - about (comparatively) the least. Ensure that resource priorities are properly set - so that your most important resources are scheduled first. + If the cluster is going to sacrifice services, it should be the ones you + care about the least.