diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
index c99f3d1c68..3dfd512745 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
@@ -1,447 +1,454 @@
= Cluster-Wide Configuration =
== CIB Properties ==
Certain settings are defined by CIB properties (that is, attributes of the
+cib+ tag) rather than with the rest of the cluster configuration in the
+configuration+ section.
The reason is simply a matter of parsing. These options are used by the
configuration database which is, by design, mostly ignorant of the content it
holds. So the decision was made to place them in an easy-to-find location.
.CIB Properties
[width="95%",cols="2m,5<",options="header",align="center"]
|=========================================================
|Field |Description
| admin_epoch |
indexterm:[Configuration Version,Cluster]
indexterm:[Cluster,Option,Configuration Version]
indexterm:[admin_epoch,Cluster Option]
indexterm:[Cluster,Option,admin_epoch]
When a node joins the cluster, the cluster performs a check to see
which node has the best configuration. It asks the node with the highest
(+admin_epoch+, +epoch+, +num_updates+) tuple to replace the configuration on
all the nodes -- which makes setting them, and setting them correctly, very
important. +admin_epoch+ is never modified by the cluster; you can use this
to make the configurations on any inactive nodes obsolete. _Never set this
value to zero_. In such cases, the cluster cannot tell the difference between
your configuration and the "empty" one used when nothing is found on disk.
| epoch |
indexterm:[epoch,Cluster Option]
indexterm:[Cluster,Option,epoch]
The cluster increments this every time the configuration is updated (usually by
the administrator).
| num_updates |
indexterm:[num_updates,Cluster Option]
indexterm:[Cluster,Option,num_updates]
The cluster increments this every time the configuration or status is updated
(usually by the cluster) and resets it to 0 when epoch changes.
| validate-with |
indexterm:[validate-with,Cluster Option]
indexterm:[Cluster,Option,validate-with]
Determines the type of XML validation that will be done on the configuration.
If set to +none+, the cluster will not verify that updates conform to the
DTD (nor reject ones that don't). This option can be useful when
operating a mixed-version cluster during an upgrade.
|cib-last-written |
indexterm:[cib-last-written,Cluster Property]
indexterm:[Cluster,Property,cib-last-written]
Indicates when the configuration was last written to disk. Maintained by the
cluster; for informational purposes only.
|have-quorum |
indexterm:[have-quorum,Cluster Property]
indexterm:[Cluster,Property,have-quorum]
Indicates if the cluster has quorum. If false, this may mean that the
cluster cannot start resources or fence other nodes (see
+no-quorum-policy+ below). Maintained by the cluster.
|dc-uuid |
indexterm:[dc-uuid,Cluster Property]
indexterm:[Cluster,Property,dc-uuid]
Indicates which cluster node is the current leader. Used by the
cluster when placing resources and determining the order of some
events. Maintained by the cluster.
|=========================================================
=== Working with CIB Properties ===
Although these fields can be written to by the user, in
most cases the cluster will overwrite any values specified by the
user with the "correct" ones.
To change the ones that can be specified by the user,
for example +admin_epoch+, one should use:
----
# cibadmin --modify --xml-text ''
----
A complete set of CIB properties will look something like this:
.Attributes set for a cib object
======
[source,XML]
-------
-------
======
[[s-cluster-options]]
== Cluster Options ==
Cluster options, as you might expect, control how the cluster behaves
when confronted with certain situations.
They are grouped into sets within the +crm_config+ section, and, in advanced
configurations, there may be more than one set. (This will be described later
in the section on <> where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including
their default values, by running the `man pengine` and `man crmd` commands.
.Cluster Options
[width="95%",cols="5m,2,11>).
| enable-startup-probes | TRUE |
indexterm:[enable-startup-probes,Cluster Option]
indexterm:[Cluster,Option,enable-startup-probes]
Should the cluster check for active resources during startup?
| maintenance-mode | FALSE |
indexterm:[maintenance-mode,Cluster Option]
indexterm:[Cluster,Option,maintenance-mode]
Should the cluster refrain from monitoring, starting and stopping resources?
| stonith-enabled | TRUE |
indexterm:[stonith-enabled,Cluster Option]
indexterm:[Cluster,Option,stonith-enabled]
Should failed nodes and nodes with resources that can't be stopped be
shot? If you value your data, set up a STONITH device and enable this.
If true, or unset, the cluster will refuse to start resources unless
one or more STONITH resources have been configured.
If false, unresponsive nodes are immediately assumed to be running no
resources, and resource takeover to online nodes starts without any
further protection (which means _data loss_ if the unresponsive node
still accesses shared storage, for example). See also the +requires+
meta-attribute in <>.
| stonith-action | reboot |
indexterm:[stonith-action,Cluster Option]
indexterm:[Cluster,Option,stonith-action]
Action to send to STONITH device. Allowed values are +reboot+ and +off+.
The value +poweroff+ is also allowed, but is only used for
legacy devices.
| stonith-timeout | 60s |
indexterm:[stonith-timeout,Cluster Option]
indexterm:[Cluster,Option,stonith-timeout]
How long to wait for STONITH actions (reboot, on, off) to complete
| stonith-max-attempts | 10 |
indexterm:[stonith-max-attempts,Cluster Option]
indexterm:[Cluster,Option,stonith-max-attempts]
How many times stonith can fail before it will no longer be attempted on a target.
Positive non-zero values are allowed. '(since 1.1.17)'
| concurrent-fencing | FALSE |
indexterm:[concurrent-fencing,Cluster Option]
indexterm:[Cluster,Option,concurrent-fencing]
Is the cluster allowed to initiate multiple fence actions concurrently?
| cluster-delay | 60s |
indexterm:[cluster-delay,Cluster Option]
indexterm:[Cluster,Option,cluster-delay]
Estimated maximum round-trip delay over the network (excluding action
execution). If the TE requires an action to be executed on another node,
it will consider the action failed if it does not get a response
from the other node in this time (after considering the action's
own timeout). The "correct" value will depend on the speed and load of your
network and cluster nodes.
| dc-deadtime | 20s |
indexterm:[dc-deadtime,Cluster Option]
indexterm:[Cluster,Option,dc-deadtime]
How long to wait for a response from other nodes during startup.
The "correct" value will depend on the speed/load of your network and the type of switches used.
| cluster-recheck-interval | 15min |
indexterm:[cluster-recheck-interval,Cluster Option]
indexterm:[Cluster,Option,cluster-recheck-interval]
Polling interval for time-based changes to options, resource parameters and constraints.
The Cluster is primarily event-driven, but your configuration can have
elements that take effect based on the time of day. To ensure these changes
take effect, we can optionally poll the cluster's status for changes. A value
of 0 disables polling. Positive values are an interval (in seconds unless other
SI units are specified, e.g. 5min).
| pe-error-series-max | -1 |
indexterm:[pe-error-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-error-series-max]
The number of PE inputs resulting in ERRORs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
| pe-warn-series-max | -1 |
indexterm:[pe-warn-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-warn-series-max]
The number of PE inputs resulting in WARNINGs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
| pe-input-series-max | -1 |
indexterm:[pe-input-series-max,Cluster Option]
indexterm:[Cluster,Option,pe-input-series-max]
The number of "normal" PE inputs to save. Used when reporting problems.
A value of -1 means unlimited (report all).
+| placement-strategy | default |
+indexterm:[placement-strategy,Cluster Option]
+indexterm:[Cluster,Option,placement-strategy]
+ How the cluster should allocate resources to nodes (see <>).
+ Allowed values are +default+, +utilization+, +balanced+, and +minimal+.
+ '(since 1.1.0)'
+
| node-health-strategy | none |
indexterm:[node-health-strategy,Cluster Option]
indexterm:[Cluster,Option,node-health-strategy]
How the cluster should react to node health attributes (see <>).
Allowed values are +none+, +migrate-on-red+, +only-green+, +progressive+, and
+custom+.
| node-health-base | 0 |
indexterm:[node-health-base,Cluster Option]
indexterm:[Cluster,Option,node-health-base]
The base health score assigned to a node. Only used when
+node-health-strategy+ is +progressive+. '(since 1.1.16)'
| node-health-green | 0 |
indexterm:[node-health-green,Cluster Option]
indexterm:[Cluster,Option,node-health-green]
The score to use for a node health attribute whose value is +green+.
Only used when +node-health-strategy+ is +progressive+ or +custom+.
| node-health-yellow | 0 |
indexterm:[node-health-yellow,Cluster Option]
indexterm:[Cluster,Option,node-health-yellow]
The score to use for a node health attribute whose value is +yellow+.
Only used when +node-health-strategy+ is +progressive+ or +custom+.
| node-health-red | 0 |
indexterm:[node-health-red,Cluster Option]
indexterm:[Cluster,Option,node-health-red]
The score to use for a node health attribute whose value is +red+.
Only used when +node-health-strategy+ is +progressive+ or +custom+.
| remove-after-stop | FALSE |
indexterm:[remove-after-stop,Cluster Option]
indexterm:[Cluster,Option,remove-after-stop]
_Advanced Use Only:_ Should the cluster remove resources from the LRM after
they are stopped? Values other than the default are, at best, poorly tested and
potentially dangerous.
| startup-fencing | TRUE |
indexterm:[startup-fencing,Cluster Option]
indexterm:[Cluster,Option,startup-fencing]
_Advanced Use Only:_ Should the cluster shoot unseen nodes?
Not using the default is very unsafe!
| election-timeout | 2min |
indexterm:[election-timeout,Cluster Option]
indexterm:[Cluster,Option,election-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| shutdown-escalation | 20min |
indexterm:[shutdown-escalation,Cluster Option]
indexterm:[Cluster,Option,shutdown-escalation]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-integration-timeout | 3min |
indexterm:[crmd-integration-timeout,Cluster Option]
indexterm:[Cluster,Option,crmd-integration-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-finalization-timeout | 30min |
indexterm:[crmd-finalization-timeout,Cluster Option]
indexterm:[Cluster,Option,crmd-finalization-timeout]
_Advanced Use Only:_ If you need to adjust this value, it probably indicates
the presence of a bug.
| crmd-transition-delay | 0s |
indexterm:[crmd-transition-delay,Cluster Option]
indexterm:[Cluster,Option,crmd-transition-delay]
_Advanced Use Only:_ Delay cluster recovery for the configured interval to
allow for additional/related events to occur. Useful if your configuration is
sensitive to the order in which ping updates arrive.
Enabling this option will slow down cluster recovery under
all conditions.
|default-resource-stickiness | 0 |
indexterm:[default-resource-stickiness,Cluster Option]
indexterm:[Cluster,Option,default-resource-stickiness]
_Deprecated:_ See <> instead
| is-managed-default | TRUE |
indexterm:[is-managed-default,Cluster Option]
indexterm:[Cluster,Option,is-managed-default]
_Deprecated:_ See <> instead
| default-action-timeout | 20s |
indexterm:[default-action-timeout,Cluster Option]
indexterm:[Cluster,Option,default-action-timeout]
_Deprecated:_ See <> instead
|=========================================================
=== Querying and Setting Cluster Options ===
indexterm:[Querying,Cluster Option]
indexterm:[Setting,Cluster Option]
indexterm:[Cluster,Querying Options]
indexterm:[Cluster,Setting Options]
Cluster options can be queried and modified using the `crm_attribute` tool. To
get the current value of +cluster-delay+, you can run:
----
# crm_attribute --query --name cluster-delay
----
which is more simply written as
----
# crm_attribute -G -n cluster-delay
----
If a value is found, you'll see a result like this:
----
# crm_attribute -G -n cluster-delay
scope=crm_config name=cluster-delay value=60s
----
If no value is found, the tool will display an error:
----
# crm_attribute -G -n clusta-deway
scope=crm_config name=clusta-deway value=(null)
Error performing operation: No such device or address
----
To use a different value (for example, 30 seconds), simply run:
----
# crm_attribute --name cluster-delay --update 30s
----
To go back to the cluster's default value, you can delete the value, for example:
----
# crm_attribute --name cluster-delay --delete
Deleted crm_config option: id=cib-bootstrap-options-cluster-delay name=cluster-delay
----
=== When Options are Listed More Than Once ===
If you ever see something like the following, it means that the option you're modifying is present more than once.
.Deleting an option that is listed twice
=======
------
# crm_attribute --name batch-limit --delete
Multiple attributes match name=batch-limit in crm_config:
Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit)
Value: 100 (set=custom, id=custom-batch-limit)
Please choose from one of the matches above and supply the 'id' with --id
-------
=======
In such cases, follow the on-screen instructions to perform the
requested action. To determine which value is currently being used by
the cluster, refer to <>.
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
index a7238dc07b..9fecf4c681 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
@@ -1,227 +1,229 @@
= Utilization and Placement Strategy =
+[[s-utilization]]
+
Pacemaker decides where to place a resource according to the resource
allocation scores on every node. The resource will be allocated to the
node where the resource has the highest score.
If the resource allocation scores on all the nodes are equal, by the default
placement strategy, Pacemaker will choose a node with the least number of
allocated resources for balancing the load. If the number of resources on each
node is equal, the first eligible node listed in the CIB will be chosen to run
the resource.
Often, in real-world situations, different resources use significantly
different proportions of a node's capacities (memory, I/O, etc.).
We cannot balance the load ideally just according to the number of resources
allocated to a node. Besides, if resources are placed such that their combined
requirements exceed the provided capacity, they may fail to start completely or
run with degraded performance.
To take these factors into account, Pacemaker allows you to configure:
. The capacity a certain node provides.
. The capacity a certain resource requires.
. An overall strategy for placement of resources.
== Utilization attributes ==
To configure the capacity that a node provides or a resource requires,
you can use 'utilization attributes' in +node+ and +resource+ objects.
You can name utilization attributes according to your preferences and define as
many name/value pairs as your configuration needs. However, the attributes'
values must be integers.
.Specifying CPU and RAM capacities of two nodes
====
[source,XML]
----
----
====
.Specifying CPU and RAM consumed by several resources
====
[source,XML]
----
----
====
A node is considered eligible for a resource if it has sufficient free
capacity to satisfy the resource's requirements. The nature of the required
or provided capacities is completely irrelevant to Pacemaker -- it just makes
sure that all capacity requirements of a resource are satisfied before placing
a resource to a node.
== Placement Strategy ==
After you have configured the capacities your nodes provide and the
capacities your resources require, you need to set the +placement-strategy+
in the global cluster options, otherwise the capacity configurations have
'no effect'.
Four values are available for the +placement-strategy+:
+default+::
Utilization values are not taken into account at all.
Resources are allocated according to allocation scores. If scores are equal,
resources are evenly distributed across nodes.
+utilization+::
Utilization values are taken into account 'only' when deciding whether a node
is considered eligible (i.e. whether it has sufficient free capacity to satisfy
the resource's requirements). Load-balancing is still done based on the
number of resources allocated to a node.
+balanced+::
Utilization values are taken into account when deciding whether a node
is eligible to serve a resource 'and' when load-balancing, so an attempt is
made to spread the resources in a way that optimizes resource performance.
+minimal+::
Utilization values are taken into account 'only' when deciding whether a node
is eligible to serve a resource. For load-balancing, an attempt is made to
concentrate the resources on as few nodes as possible, thereby enabling
possible power savings on the remaining nodes.
Set +placement-strategy+ with `crm_attribute`:
----
# crm_attribute --name placement-strategy --update balanced
----
Now Pacemaker will ensure the load from your resources will be distributed
evenly throughout the cluster, without the need for convoluted sets of
colocation constraints.
== Allocation Details ==
=== Which node is preferred to get consumed first when allocating resources? ===
- The node with the highest node weight gets consumed first. Node weight
is a score maintained by the cluster to represent node health.
- If multiple nodes have the same node weight:
* If +placement-strategy+ is +default+ or +utilization+,
the node that has the least number of allocated resources gets consumed first.
** If their numbers of allocated resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If +placement-strategy+ is +balanced+,
the node that has the most free capacity gets consumed first.
** If the free capacities of the nodes are equal,
the node that has the least number of allocated resources gets consumed first.
*** If their numbers of allocated resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If +placement-strategy+ is +minimal+,
the first eligible node listed in the CIB gets consumed first.
=== Which node has more free capacity? ===
If only one type of utilization attribute has been defined, free capacity
is a simple numeric comparison.
If multiple types of utilization attributes have been defined, then
the node that is numerically highest in the the most attribute types
has the most free capacity. For example:
- If +nodeA+ has more free +cpus+, and +nodeB+ has more free +memory+,
then their free capacities are equal.
- If +nodeA+ has more free +cpus+, while +nodeB+ has more free +memory+ and +storage+,
then +nodeB+ has more free capacity.
=== Which resource is preferred to be assigned first? ===
- The resource that has the highest +priority+ (see <>) gets allocated first.
- If their priorities are equal, check whether they are already running. The
resource that has the highest score on the node where it's running gets allocated
first, to prevent resource shuffling.
- If the scores above are equal or the resources are not running, the resource has
the highest score on the preferred node gets allocated first.
- If the scores above are equal, the first runnable resource listed in the CIB
gets allocated first.
== Limitations and Workarounds ==
The type of problem Pacemaker is dealing with here is known as the
http://en.wikipedia.org/wiki/Knapsack_problem[knapsack problem] and falls into
the http://en.wikipedia.org/wiki/NP-complete[NP-complete] category of computer
science problems -- a fancy way of saying "it takes a really long time
to solve".
Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
or days, finding an optimal solution while services remain unavailable.
So instead of trying to solve the problem completely, Pacemaker uses a
'best effort' algorithm for determining which node should host a particular
service. This means it arrives at a solution much faster than traditional
linear programming algorithms, but by doing so at the price of leaving some
services stopped.
In the contrived example at the start of this chapter:
- +rsc-small+ would be allocated to +node1+
- +rsc-medium+ would be allocated to +node2+
- +rsc-large+ would remain inactive
Which is not ideal.
There are various approaches to dealing with the limitations of
pacemaker's placement strategy:
Ensure you have sufficient physical capacity.::
It might sound obvious, but if the physical capacity of your nodes is (close to)
maxed out by the cluster under normal conditions, then failover isn't going to
go well. Even without the utilization feature, you'll start hitting timeouts and
getting secondary failures.
Build some buffer into the capabilities advertised by the nodes.::
Advertise slightly more resources than we physically have, on the (usually valid)
assumption that a resource will not use 100% of the configured amount of
CPU, memory and so forth 'all' the time. This practice is sometimes called 'overcommit'.
Specify resource priorities.::
If the cluster is going to sacrifice services, it should be the ones you care
about (comparatively) the least. Ensure that resource priorities are properly set
so that your most important resources are scheduled first.