diff --git a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
index dc17610b49..8a3f9705a7 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
@@ -1,331 +1,331 @@
= Multi-Site Clusters and Tickets =
Apart from local clusters, Pacemaker also supports multi-site clusters.
That means you can have multiple, geographically dispersed sites, each with a
local cluster. Failover between these clusters can be coordinated
manually by the administrator, or automatically by a higher-level entity called
a 'Cluster Ticket Registry (CTR)'.
== Challenges for Multi-Site Clusters ==
Typically, multi-site environments are too far apart to support
synchronous communication and data replication between the sites.
That leads to significant challenges:
- How do we make sure that a cluster site is up and running?
- How do we make sure that resources are only started once?
- How do we make sure that quorum can be reached between the different
sites and a split-brain scenario avoided?
- How do we manage failover between sites?
- How do we deal with high latency in case of resources that need to be
stopped?
In the following sections, learn how to meet these challenges.
== Conceptual Overview ==
Multi-site clusters can be considered as “overlay” clusters where
each cluster site corresponds to a cluster node in a traditional cluster.
The overlay cluster can be managed by a CTR in order to
guarantee that the cluster resources will be highly
available across different cluster sites. This is achieved by using
'tickets' that are treated as failover domain between cluster
sites, in case a site should be down.
The following sections explain the individual components and mechanisms
that were introduced for multi-site clusters in more detail.
=== Ticket ===
Tickets are, essentially, cluster-wide attributes. A ticket grants the
right to run certain resources on a specific cluster site. Resources can
be bound to a certain ticket by +rsc_ticket+ constraints. Only if the
ticket is available at a site can the respective resources be started there.
Vice versa, if the ticket is revoked, the resources depending on that
ticket must be stopped.
The ticket thus is similar to a 'site quorum', i.e. the permission to
manage/own resources associated with that site. (One can also think of the
current +have-quorum+ flag as a special, cluster-wide ticket that is granted in
case of node majority.)
Tickets can be granted and revoked either manually by administrators
(which could be the default for classic enterprise clusters), or via
the automated CTR mechanism described below.
A ticket can only be owned by one site at a time. Initially, none
of the sites has a ticket. Each ticket must be granted once by the cluster
administrator.
The presence or absence of tickets for a site is stored in the CIB as a
cluster status. With regards to a certain ticket, there are only two states
for a site: +true+ (the site has the ticket) or +false+ (the site does
not have the ticket). The absence of a certain ticket (during the initial
state of the multi-site cluster) is the same as the value +false+.
=== Dead Man Dependency ===
A site can only activate resources safely if it can be sure that the
other site has deactivated them. However after a ticket is revoked, it can
take a long time until all resources depending on that ticket are stopped
"cleanly", especially in case of cascaded resources. To cut that process
short, the concept of a 'Dead Man Dependency' was introduced.
If a dead man dependency is in force, if a ticket is revoked from a site, the
nodes that are hosting dependent resources are fenced. This considerably speeds
up the recovery process of the cluster and makes sure that resources can be
migrated more quickly.
This can be configured by specifying a +loss-policy="fence"+ in
+rsc_ticket+ constraints.
=== Cluster Ticket Registry ===
A CTR is a network daemon that automatically handles granting, revoking, and
timing out tickets (instead of the administrator revoking the ticket somewhere,
waiting for everything to stop, and then granting it on the desired site).
Pacemaker does not implement its own CTR, but interoperates with external
software designed for that purpose (similar to how resource and fencing agents
are not directly part of pacemaker).
Participating clusters run the CTR daemons, which connect to each other, exchange
information about their connectivity, and vote on which sites gets which
tickets.
A ticket is granted to a site only once the CTR is sure that the ticket
has been relinquished by the previous owner, implemented via a timer in most
scenarios. If a site loses connection to its peers, its tickets time out and
recovery occurs. After the connection timeout plus the recovery timeout has
passed, the other sites are allowed to re-acquire the ticket and start the
resources again.
This can also be thought of as a "quorum server", except that it is not
a single quorum ticket, but several.
=== Configuration Replication ===
As usual, the CIB is synchronized within each cluster, but it is 'not' synchronized
across cluster sites of a multi-site cluster. You have to configure the resources
that will be highly available across the multi-site cluster for every site
accordingly.
[[s-ticket-constraints]]
== Configuring Ticket Dependencies ==
The `rsc_ticket` constraint lets you specify the resources depending on a certain
ticket. Together with the constraint, you can set a `loss-policy` that defines
what should happen to the respective resources if the ticket is revoked.
The attribute `loss-policy` can have the following values:
* +fence:+ Fence the nodes that are running the relevant resources.
* +stop:+ Stop the relevant resources.
* +freeze:+ Do nothing to the relevant resources.
* +demote:+ Demote relevant resources that are running in master mode to slave mode.
.Constraint that fences node if +ticketA+ is revoked
====
[source,XML]
-------
-------
====
The example above creates a constraint with the ID +rsc1-req-ticketA+. It
defines that the resource +rsc1+ depends on +ticketA+ and that the node running
the resource should be fenced if +ticketA+ is revoked.
If resource +rsc1+ were a multi-state resource (i.e. it could run in master or
slave mode), you might want to configure that only master mode
depends on +ticketA+. With the following configuration, +rsc1+ will be
demoted to slave mode if +ticketA+ is revoked:
.Constraint that demotes +rsc1+ if +ticketA+ is revoked
====
[source,XML]
-------
-------
====
You can create multiple `rsc_ticket` constraints to let multiple resources
depend on the same ticket. However, `rsc_ticket` also supports resource sets,
so one can easily list all the resources in one `rsc_ticket` constraint instead.
.Ticket constraint for multiple resources
====
[source,XML]
-------
-------
====
In the example above, there are two resource sets, so we can list resources
with different roles in a single +rsc_ticket+ constraint. There's no dependency
between the two resource sets, and there's no dependency among the
resources within a resource set. Each of the resources just depends on
+ticketA+.
Referencing resource templates in +rsc_ticket+ constraints, and even
referencing them within resource sets, is also supported.
If you want other resources to depend on further tickets, create as many
constraints as necessary with +rsc_ticket+.
== Managing Multi-Site Clusters ==
=== Granting and Revoking Tickets Manually ===
You can grant tickets to sites or revoke them from sites manually.
If you want to re-distribute a ticket, you should wait for
the dependent resources to stop cleanly at the previous site before you
grant the ticket to the new site.
Use the `crm_ticket` command line tool to grant and revoke tickets.
To grant a ticket to this site:
-------
# crm_ticket --ticket ticketA --grant
-------
To revoke a ticket from this site:
-------
# crm_ticket --ticket ticketA --revoke
-------
[IMPORTANT]
====
If you are managing tickets manually, use the `crm_ticket` command with
great care, because it cannot check whether the same ticket is already
granted elsewhere.
====
=== Granting and Revoking Tickets via a Cluster Ticket Registry ===
We will use https://github.com/ClusterLabs/booth[Booth] here as an example of
software that can be used with pacemaker as a Cluster Ticket Registry. Booth
implements the
-http://en.wikipedia.org/wiki/Paxos_%28computer_science%29['Paxos'] lease
+http://en.wikipedia.org/wiki/Raft_%28computer_science%29[Raft]
algorithm to guarantee the distributed consensus among different
cluster sites, and manages the ticket distribution (and thus the failover
process between sites).
Each of the participating clusters and 'arbitrators' runs the Booth daemon
`boothd`.
An 'arbitrator' is the multi-site equivalent of a quorum-only node in a local
cluster. If you have a setup with an even number of sites,
you need an additional instance to reach consensus about decisions such
as failover of resources across sites. In this case, add one or more
arbitrators running at additional sites. Arbitrators are single machines
that run a booth instance in a special mode. An arbitrator is especially
important for a two-site scenario, otherwise there is no way for one site
to distinguish between a network failure between it and the other site, and
a failure of the other site.
The most common multi-site scenario is probably a multi-site cluster with two
sites and a single arbitrator on a third site. However, technically, there are
no limitations with regards to the number of sites and the number of
arbitrators involved.
`Boothd` at each site connects to its peers running at the other sites and
exchanges connectivity details. Once a ticket is granted to a site, the
booth mechanism will manage the ticket automatically: If the site which
holds the ticket is out of service, the booth daemons will vote which
of the other sites will get the ticket. To protect against brief
connection failures, sites that lose the vote (either explicitly or
implicitly by being disconnected from the voting body) need to
relinquish the ticket after a time-out. Thus, it is made sure that a
ticket will only be re-distributed after it has been relinquished by the
previous site. The resources that depend on that ticket will fail over
to the new site holding the ticket. The nodes that have run the
resources before will be treated according to the `loss-policy` you set
within the `rsc_ticket` constraint.
Before the booth can manage a certain ticket within the multi-site cluster,
you initially need to grant it to a site manually via the `booth` command-line
tool. After you have initially granted a ticket to a site, `boothd`
will take over and manage the ticket automatically.
[IMPORTANT]
====
The `booth` command-line tool can be used to grant, list, or
revoke tickets and can be run on any machine where `boothd` is running.
If you are managing tickets via Booth, use only `booth` for manual
intervention, not `crm_ticket`. That ensures the same ticket
will only be owned by one cluster site at a time.
====
==== Booth Requirements ====
* All clusters that will be part of the multi-site cluster must be based on
Pacemaker.
* Booth must be installed on all cluster nodes and on all arbitrators that will
be part of the multi-site cluster.
* Nodes belonging to the same cluster site should be synchronized via NTP. However,
time synchronization is not required between the individual cluster sites.
=== General Management of Tickets ===
Display the information of tickets:
-------
# crm_ticket --info
-------
Or you can monitor them with:
-------
# crm_mon --tickets
-------
Display the +rsc_ticket+ constraints that apply to a ticket:
-------
# crm_ticket --ticket ticketA --constraints
-------
When you want to do maintenance or manual switch-over of a ticket,
revoking the ticket would trigger the loss policies. If
+loss-policy="fence"+, the dependent resources could not be gracefully
stopped/demoted, and other unrelated resources could even be affected.
The proper way is making the ticket 'standby' first with:
-------
# crm_ticket --ticket ticketA --standby
-------
Then the dependent resources will be stopped or demoted gracefully without
triggering the loss policies.
If you have finished the maintenance and want to activate the ticket again,
you can run:
-------
# crm_ticket --ticket ticketA --activate
-------
== For more information ==
-* http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html[SUSE's Multi-site Clusters guide]
+* https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html[SUSE's Geo Clustering quick start]
* https://github.com/ClusterLabs/booth[Booth]
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
index 07211865d9..afba0a9285 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
@@ -1,227 +1,227 @@
= Utilization and Placement Strategy =
Pacemaker decides where to place a resource according to the resource
allocation scores on every node. The resource will be allocated to the
node where the resource has the highest score.
If the resource allocation scores on all the nodes are equal, by the default
placement strategy, Pacemaker will choose a node with the least number of
allocated resources for balancing the load. If the number of resources on each
node is equal, the first eligible node listed in the CIB will be chosen to run
the resource.
Often, in real-world situations, different resources use significantly
different proportions of a node's capacities (memory, I/O, etc.).
We cannot balance the load ideally just according to the number of resources
allocated to a node. Besides, if resources are placed such that their combined
requirements exceed the provided capacity, they may fail to start completely or
run with degraded performance.
To take these factors into account, Pacemaker allows you to configure:
. The capacity a certain node provides.
. The capacity a certain resource requires.
. An overall strategy for placement of resources.
== Utilization attributes ==
To configure the capacity that a node provides or a resource requires,
you can use 'utilization attributes' in +node+ and +resource+ objects.
You can name utilization attributes according to your preferences and define as
many name/value pairs as your configuration needs. However, the attributes'
values must be integers.
.Specifying CPU and RAM capacities of two nodes
====
[source,XML]
----
----
====
.Specifying CPU and RAM consumed by several resources
====
[source,XML]
----
----
====
A node is considered eligible for a resource if it has sufficient free
capacity to satisfy the resource's requirements. The nature of the required
or provided capacities is completely irrelevant to Pacemaker -- it just makes
sure that all capacity requirements of a resource are satisfied before placing
a resource to a node.
== Placement Strategy ==
After you have configured the capacities your nodes provide and the
capacities your resources require, you need to set the +placement-strategy+
in the global cluster options, otherwise the capacity configurations have
'no effect'.
Four values are available for the +placement-strategy+:
+default+::
Utilization values are not taken into account at all.
Resources are allocated according to allocation scores. If scores are equal,
resources are evenly distributed across nodes.
+utilization+::
Utilization values are taken into account 'only' when deciding whether a node
is considered eligible (i.e. whether it has sufficient free capacity to satisfy
the resource's requirements). Load-balancing is still done based on the
number of resources allocated to a node.
+balanced+::
Utilization values are taken into account when deciding whether a node
is eligible to serve a resource 'and' when load-balancing, so an attempt is
made to spread the resources in a way that optimizes resource performance.
+minimal+::
Utilization values are taken into account 'only' when deciding whether a node
is eligible to serve a resource. For load-balancing, an attempt is made to
concentrate the resources on as few nodes as possible, thereby enabling
possible power savings on the remaining nodes.
Set +placement-strategy+ with `crm_attribute`:
----
# crm_attribute --name placement-strategy --update balanced
----
Now Pacemaker will ensure the load from your resources will be distributed
evenly throughout the cluster, without the need for convoluted sets of
colocation constraints.
== Allocation Details ==
=== Which node is preferred to get consumed first when allocating resources? ===
-- The node with the highest weight (cumulative score after taking into account
- location preferences, constraints, etc.) gets consumed first.
+- The node with the highest node weight gets consumed first. Node weight
+ is a score maintained by the cluster to represent node health.
-- If multiple nodes have the same weight:
+- If multiple nodes have the same node weight:
* If +placement-strategy+ is +default+ or +utilization+,
the node that has the least number of allocated resources gets consumed first.
** If their numbers of allocated resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If +placement-strategy+ is +balanced+,
the node that has the most free capacity gets consumed first.
** If the free capacities of the nodes are equal,
the node that has the least number of allocated resources gets consumed first.
*** If their numbers of allocated resources are equal,
the first eligible node listed in the CIB gets consumed first.
* If +placement-strategy+ is +minimal+,
the first eligible node listed in the CIB gets consumed first.
=== Which node has more free capacity? ===
If only one type of utilization attribute has been defined, free capacity
is a simple numeric comparison.
If multiple types of utilization attributes have been defined, then
the node that is numerically highest in the the most attribute types
has the most free capacity. For example:
- If +nodeA+ has more free +cpus+, and +nodeB+ has more free +memory+,
then their free capacities are equal.
- If +nodeA+ has more free +cpus+, while +nodeB+ has more free +memory+ and +storage+,
then +nodeB+ has more free capacity.
=== Which resource is preferred to be assigned first? ===
- The resource that has the highest +priority+ (see <>) gets allocated first.
- If their priorities are equal, check whether they are already running. The
resource that has the highest score on the node where it's running gets allocated
first, to prevent resource shuffling.
- If the scores above are equal or the resources are not running, the resource has
the highest score on the preferred node gets allocated first.
- If the scores above are equal, the first runnable resource listed in the CIB
gets allocated first.
== Limitations and Workarounds ==
The type of problem Pacemaker is dealing with here is known as the
http://en.wikipedia.org/wiki/Knapsack_problem[knapsack problem] and falls into
the http://en.wikipedia.org/wiki/NP-complete[NP-complete] category of computer
science problems -- a fancy way of saying "it takes a really long time
to solve".
Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
or days, finding an optional solution while services remain unavailable.
So instead of trying to solve the problem completely, Pacemaker uses a
'best effort' algorithm for determining which node should host a particular
service. This means it arrives at a solution much faster than traditional
linear programming algorithms, but by doing so at the price of leaving some
services stopped.
In the contrived example at the start of this chapter:
- +rsc-small+ would be allocated to +node1+
- +rsc-medium+ would be allocated to +node2+
- +rsc-large+ would remain inactive
Which is not ideal.
There are various approaches to dealing with the limitations of
pacemaker's placement strategy:
Ensure you have sufficient physical capacity.::
It might sound obvious, but if the physical capacity of your nodes is (close to)
maxed out by the cluster under normal conditions, then failover isn't going to
go well. Even without the utilization feature, you'll start hitting timeouts and
getting secondary failures.
Build some buffer into the capabilities advertised by the nodes.::
Advertise slightly more resources than we physically have, on the (usually valid)
assumption that a resource will not use 100% of the configured amount of
CPU, memory and so forth 'all' the time. This practice is sometimes called 'overcommit'.
Specify resource priorities.::
If the cluster is going to sacrifice services, it should be the ones you care
about (comparatively) the least. Ensure that resource priorities are properly set
so that your most important resources are scheduled first.