diff --git a/doc/Pacemaker_Explained/en-US/Author_Group.xml b/doc/Pacemaker_Explained/en-US/Author_Group.xml
index bba7f0bf16..f787962e07 100644
--- a/doc/Pacemaker_Explained/en-US/Author_Group.xml
+++ b/doc/Pacemaker_Explained/en-US/Author_Group.xml
@@ -1,49 +1,58 @@
AndrewBeekhof
Red Hat
Primary author
andrew@beekhof.net
DanFrîncu
Romanian translation
df.cluster@gmail.com
PhilippMarek
LINBit
Style and formatting updates. Indexing.
philipp.marek@linbit.com
TanjaRoth
SUSE
Utilization chapter
+ Resource Templates chapter
Multi-Site Clusters chapter
- troth@suse.com
+ taroth@suse.com
LarsMarowsky-Bree
SUSE
Multi-Site Clusters chapter
lmb@suse.com
YanGao
SUSE
Utilization chapter
+ Resource Templates chapter
Multi-Site Clusters chapter
ygao@suse.com
ThomasSchraitle
SUSE
Utilization chapter
+ Resource Templates chapter
Multi-Site Clusters chapter
toms@suse.com
+
+ DejanMuhamedagic
+ SUSE
+ Resource Templates chapter
+ dmuhamedagic@suse.com
+
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
index b691f41a75..911d841bf8 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Multi-site-Clusters.txt
@@ -1,319 +1,321 @@
= Multi-Site Clusters and Tickets =
== Abstract ==
Apart from local clusters, Pacemaker also supports multi-site clusters.
That means you can have multiple, geographically dispersed sites with a
local cluster each. Failover between these clusters can be coordinated
by a higher level entity, the so-called `CTR (Cluster Ticket Registry)`.
== Challenges for Multi-Site Clusters ==
Typically, multi-site environments are too far apart to support
synchronous communication between the sites and synchronous data
replication. That leads to the following challenges:
- How to make sure that a cluster site is up and running?
- How to make sure that resources are only started once?
- How to make sure that quorum can be reached between the different
sites and a split brain scenario can be avoided?
- How to manage failover between the sites?
- How to deal with high latency in case of resources that need to be
stopped?
In the following sections, learn how to meet these challenges.
== Conceptual Overview ==
Multi-site clusters can be considered as “overlay” clusters where
each cluster site corresponds to a cluster node in a traditional cluster.
The overlay cluster can be managed by a `CTR (Cluster Ticket Registry)`
mechanism. It guarantees that the cluster resources will be highly
available across different cluster sites. This is achieved by using
so-called `tickets` that are treated as failover domain between cluster
sites, in case a site should be down.
The following list explains the individual components and mechanisms
that were introduced for multi-site clusters in more detail.
=== Components and Concepts ===
==== Ticket ====
"Tickets" are, essentially, cluster-wide attributes. A ticket grants the
right to run certain resources on a specific cluster site. Resources can
be bound to a certain ticket by `rsc_ticket` dependencies. Only if the
ticket is available at a site, the respective resources are started.
Vice versa, if the ticket is revoked, the resources depending on that
ticket need to be stopped.
The ticket thus is similar to a 'site quorum'; i.e., the permission to
manage/own resources associated with that site.
(One can also think of the current `have-quorum` flag as a special, cluster-wide
ticket that is granted in case of node majority.)
These tickets can be granted/revoked either manually by administrators
(which could be the default for the classic enterprise clusters), or via
an automated `CTR` mechanism described further below.
A ticket can only be owned by one site at a time. Initially, none
of the sites has a ticket. Each ticket must be granted once by the cluster
administrator.
The presence or absence of tickets for a site is stored in the CIB as a
cluster status. With regards to a certain ticket, there are only two states
for a site: `true` (the site has the ticket) or `false` (the site does
not have the ticket). The absence of a certain ticket (during the initial
state of the multi-site cluster) is also reflected by the value `false`.
==== Dead Man Dependency ====
A site can only activate the resources safely if it can be sure that the
other site has deactivated them. However after a ticket is revoked, it can
take a long time until all resources depending on that ticket are stopped
"cleanly", especially in case of cascaded resources. To cut that process
short, the concept of a `Dead Man Dependency` was introduced:
- If the ticket is revoked from a site, the nodes that are hosting
dependent resources are fenced. This considerably speeds up the recovery
process of the cluster and makes sure that resources can be migrated more
quickly.
This can be configured by specifying a `loss-policy="fence"` in
`rsc_ticket` constraints.
==== CTR (Cluster Ticket Registry) ====
This is for those scenarios where the tickets management is supposed to
be automatic (instead of the administrator revoking the ticket somewhere,
waiting for everything to stop, and then granting it on the desired site).
A `CTR` is a network daemon that handles granting,
revoking, and timing out "tickets". The participating clusters would run
the daemons that would connect to each other, exchange information on
their connectivity details, and vote on which site gets which ticket(s).
A ticket would only be granted to a site once they can be sure that it
has been relinquished by the previous owner, which would need to be
implemented via a timer in most scenarios. If a site loses connection
to its peers, its tickets time out and recovery occurs. After the
connection timeout plus the recovery timeout has passed, the other sites
are allowed to re-acquire the ticket and start the resources again.
This can also be thought of as a "quorum server", except that it is not
a single quorum ticket, but several.
==== Configuration Replication ====
As usual, the CIB is synchronized within each cluster, but it is not synchronized
across cluster sites of a multi-site cluster. You have to configure the resources
that will be highly available across the multi-site cluster for every site
accordingly.
== Configuring Ticket Dependencies ==
The `rsc_ticket` constraint lets you specify the resources depending on a certain
ticket. Together with the constraint, you can set a `loss-policy` that defines
what should happen to the respective resources if the ticket is revoked.
The attribute `loss-policy` can have the following values:
fence:: Fence the nodes that are running the relevant resources.
stop:: Stop the relevant resources.
freeze:: Do nothing to the relevant resources.
demote:: Demote relevant resources that are running in master mode to slave mode.
An example to configure a `rsc_ticket` constraint:
[source,XML]
-------
-------
This creates a constraint with the ID `rsc1-req-ticketA`. It defines that the
resource `rsc1` depends on `ticketA` and that the node running the resource should
be fenced in case `ticketA` is revoked.
If resource `rsc1` was a multi-state resource that can run in master or
slave mode, you may want to configure that only `rsc1's` master mode
depends on `ticketA`. With the following configuration, `rsc1` will be
demoted to slave mode if `ticketA` is revoked:
[source,XML]
-------
-------
-If you want other resources to depend on further tickets, create as many
-constraints as necessary with `rsc_ticket`.
-
-
+You can create more `rsc_ticket` constraints to let multiple resources
+depend on the same ticket.
+
`rsc_ticket` also supports resource sets. So one can easily list all the
resources in one `rsc_ticket` constraint. For example:
[source,XML]
-------
-------
+If you want other resources to depend on further tickets, create as many
+constraints as necessary with `rsc_ticket`.
+
== Managing Multi-Site Clusters ==
=== Managing Tickets Manually ===
You can grant tickets to sites or revoke them from sites manually.
Though if you want to re-distribute a ticket, you should wait for
the dependent resources to cleanly stop at the previous site before you
grant the ticket to another desired site.
Use the `crm_ticket` command line tool to grant, revoke, or query tickets.
To grant a ticket to this site:
[source,Bash]
-------
# crm_ticket -t ticketA -v true
-------
To revoke a ticket from this site:
[source,Bash]
-------
# crm_ticket -t ticketA -v false
-------
Query if the specified ticket is granted to this site or not:
[source,Bash]
-------
# crm_ticket -t ticketA -G
-------
Query the time of last granted the specified ticket to this site:
[source,Bash]
-------
# crm_ticket -t ticketA -T
-------
[IMPORTANT]
====
If you are managing tickets manually. Use the `crm_ticket` command with
great care as they cannot help verify if the same ticket is already
granted elsewhere.
====
=== Managing Tickets via a Cluster Ticket Registry ===
==== Booth ====
Booth is an implementation of `Cluster Ticket Registry` or so-called
`Cluster Ticket Manager`.
Booth is the instance managing the ticket distribution and thus,
the failover process between the sites of a multi-site cluster. Each of
the participating clusters and arbitrators runs a service, the boothd.
It connects to the booth daemons running at the other sites and
exchanges connectivity details. Once a ticket is granted to a site, the
booth mechanism will manage the ticket automatically: If the site which
holds the ticket is out of service, the booth daemons will vote which
of the other sites will get the ticket. To protect against brief
connection failures, sites that lose the vote (either explicitly or
implicitly by being disconnected from the voting body) need to
relinquish the ticket after a time-out. Thus, it is made sure that a
ticket will only be re-distributed after it has been relinquished by the
previous site. The resources that depend on that ticket will fail over
to the new site holding the ticket. The nodes that have run the
resources before will be treated according to the `loss-policy` you set
within the `rsc_ticket` constraint.
Before the booth can manage a certain ticket within the multi-site cluster,
you initially need to grant it to a site manually via `booth client` command.
After you have initially granted a ticket to a site, the booth mechanism
will take over and manage the ticket automatically.
[IMPORTANT]
====
The `booth client` command line tool can be used to grant, list, or
revoke tickets. The `booth client` commands work on any machine where
the booth daemon is running.
If you are managing tickets via `Booth`, only use `booth client` for manual
intervention instead of `crm_ticket`. That can make sure the same ticket
will only be owned by one cluster site at a time.
====
Booth includes an implementation of
http://en.wikipedia.org/wiki/Paxos_algorithm['Paxos'] and 'Paxos Lease'
algorithm, which guarantees the distributed consensus among different
cluster sites.
[TIP]
====
`Arbitrator`
Each site runs one booth instance that is responsible for communicating
with the other sites. If you have a setup with an even number of sites,
you need an additional instance to reach consensus about decisions such
as failover of resources across sites. In this case, add one or more
arbitrators running at additional sites. Arbitrators are single machines
that run a booth instance in a special mode. As all booth instances
communicate with each other, arbitrators help to make more reliable
decisions about granting or revoking tickets.
An arbitrator is especially important for a two-site scenario: For example,
if site `A` can no longer communicate with site `B`, there are two possible
causes for that:
- `A` network failure between `A` and `B`.
- Site `B` is down.
However, if site `C` (the arbitrator) can still communicate with site `B`,
site `B` must still be up and running.
====
===== Requirements =====
- All clusters that will be part of the multi-site cluster must be based on Pacemaker.
- Booth must be installed on all cluster nodes and on all arbitrators that will
be part of the multi-site cluster.
The most common scenario is probably a multi-site cluster with two sites and a
single arbitrator on a third site. However, technically, there are no limitations
with regards to the number of sites and the number of arbitrators involved.
Nodes belonging to the same cluster site should be synchronized via NTP. However,
time synchronization is not required between the individual cluster sites.
== For more information ==
`Multi-site Clusters`
http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html
`Booth`
https://github.com/ClusterLabs/booth
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
index 79ed0ecb9e..a6cb66f752 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt
@@ -1,276 +1,282 @@
= Cluster Options =
== Special Options ==
indexterm:[Special Cluster Options]
indexterm:[Cluster Options,Special Options]
The reason for these fields to be placed at the top level instead of
with the rest of cluster options is simply a matter of parsing. These
options are used by the configuration database which is, by design,
mostly ignorant of the content it holds. So the decision was made to
place them in an easy to find location.
== Configuration Version ==
indexterm:[Configuration Version, Cluster Option]
indexterm:[Cluster Options,Configuration Version]
When a node joins the cluster, the cluster will perform a check to see
who has the best configuration based on the fields below. It then
asks the node with the highest (+admin_epoch+, +epoch+, +num_updates+)
tuple to replace the configuration on all the nodes - which makes
setting them, and setting them correctly, very important.
.Configuration Version Properties
[width="95%",cols="1m,5<",options="header",align="center"]
|=========================================================
|Field |Description
| admin_epoch |
indexterm:[admin_epoch Cluster Option]
indexterm:[Cluster Options,admin_epoch]
Never modified by the cluster. Use this to make the configurations on
any inactive nodes obsolete.
_Never set this value to zero_, in such cases the cluster cannot tell
the difference between your configuration and the "empty" one used
when nothing is found on disk.
| epoch |
indexterm:[epoch Cluster Option]
indexterm:[Cluster Options,epoch]
Incremented every time the configuration is updated (usually by the admin)
| num_updates |
indexterm:[num_updates Cluster Option]
indexterm:[Cluster Options,num_updates]
Incremented every time the configuration or status is updated (usually by the cluster)
|=========================================================
== Other Fields ==
.Properties Controlling Validation
[width="95%",cols="1m,5<",options="header",align="center"]
|=========================================================
|Field |Description
| validate-with |
indexterm:[validate-with Cluster Option]
indexterm:[Cluster Options,validate-with]
Determines the type of validation being done on the configuration. If
set to "none", the cluster will not verify that updates conform to the
DTD (nor reject ones that don't). This option can be useful when
operating a mixed version cluster during an upgrade.
|=========================================================
== Fields Maintained by the Cluster ==
.Properties Maintained by the Cluster
[width="95%",cols="1m,5<",options="header",align="center"]
|=========================================================
|Field |Description
|crm-debug-origin |
indexterm:[crm-debug-origin Cluster Fields]
indexterm:[Cluster Fields,crm-debug-origin]
Indicates where the last update came from. Informational purposes only.
|cib-last-written |
indexterm:[cib-last-written Cluster Fields]
indexterm:[Cluster Fields,cib-last-written]
Indicates when the configuration was last written to disk. Informational purposes only.
|dc-uuid |
indexterm:[dc-uuid Cluster Fields]
indexterm:[Cluster Fields,dc-uuid]
Indicates which cluster node is the current leader. Used by the
cluster when placing resources and determining the order of some
events.
|have-quorum |
indexterm:[have-quorum Cluster Fields]
indexterm:[Cluster Fields,have-quorum]
Indicates if the cluster has quorum. If false, this may mean that the
cluster cannot start resources or fence other nodes. See
+no-quorum-policy+ below.
|=========================================================
Note that although these fields can be written to by the admin, in
most cases the cluster will overwrite any values specified by the
admin with the "correct" ones. To change the +admin_epoch+, for
example, one would use:
pass:[cibadmin --modify --crm_xml ‘<cib admin_epoch="42"/>']
A complete set of fields will look something like this:
.An example of the fields set for a cib object
[source,XML]
-------
-------
== Cluster Options ==
Cluster options, as you might expect, control how the cluster behaves
when confronted with certain situations.
They are grouped into sets and, in advanced configurations, there may
be more than one.
footnote:[This will be described later in the section on where we will show how to have the cluster use
different sets of options during working hours (when downtime is
usually to be avoided at all costs) than it does during the weekends
(when resources can be moved to the their preferred hosts without
bothering end users)]
For now we will describe the simple case where each option is present at most once.
== Available Cluster Options ==
.Cluster Options
[width="95%",cols="5m,2m,13",options="header",align="center"]
|=========================================================
|Option |Default |Description
| batch-limit | 30 |
indexterm:[batch-limit Cluster Options]
indexterm:[Cluster Options,batch-limit]
The number of jobs that the TE is allowed to execute in parallel. The
"correct" value will depend on the speed and load of your network and
cluster nodes.
+| migration-limit | -1 (unlimited) |
+indexterm:[migration-limit Cluster Options]
+indexterm:[Cluster Options,migration-limit]
+The number of migration jobs that the TE is allowed to execute in
+parallel on a node.
+
| no-quorum-policy | stop |
indexterm:[no-quorum-policy Cluster Options]
indexterm:[Cluster Options,no-quorum-policy]
What to do when the cluster does not have quorum. Allowed values:
* ignore - continue all resource management
* freeze - continue resource management, but don't recover resources from nodes not in the affected partition
* stop - stop all resources in the affected cluster partition
* suicide - fence all nodes in the affected cluster partition
| symmetric-cluster | TRUE |
indexterm:[symmetric-cluster Cluster Options]
indexterm:[Cluster Options,symmetric-cluster]
Can all resources run on any node by default?
| stonith-enabled | TRUE |
indexterm:[stonith-enabled Cluster Options]
indexterm:[Cluster Options,stonith-enabled]
Should failed nodes and nodes with resources that can't be stopped be
shot? If you value your data, set up a STONITH device and enable this.
If true, or unset, the cluster will refuse to start resources unless
one or more STONITH resources have been configured also.
| stonith-action | reboot |
indexterm:[stonith-action Cluster Options]
indexterm:[Cluster Options,stonith-action]
Action to send to STONITH device. Allowed values: reboot, poweroff.
| cluster-delay | 60s |
indexterm:[cluster-delay Cluster Options]
indexterm:[Cluster Options,cluster-delay]
Round trip delay over the network (excluding action execution). The
"correct" value will depend on the speed and load of your network and
cluster nodes.
| stop-orphan-resources | TRUE |
indexterm:[stop-orphan-resources Cluster Options]
indexterm:[Cluster Options,stop-orphan-resources]
Should deleted resources be stopped?
| stop-orphan-actions | TRUE |
indexterm:[stop-orphan-actions Cluster Options]
indexterm:[Cluster Options,stop-orphan-actions]
Should deleted actions be cancelled?
| start-failure-is-fatal | TRUE |
indexterm:[start-failure-is-fatal Cluster Options]
indexterm:[Cluster Options,start-failure-is-fatal]
When set to FALSE, the cluster will instead use the resource's
+failcount+ and value for +resource-failure-stickiness+.
| pe-error-series-max | -1 (all) |
indexterm:[pe-error-series-max Cluster Options]
indexterm:[Cluster Options,pe-error-series-max]
The number of PE inputs resulting in ERRORs to save. Used when reporting problems.
| pe-warn-series-max | -1 (all) |
indexterm:[pe-warn-series-max Cluster Options]
indexterm:[Cluster Options,pe-warn-series-max]
The number of PE inputs resulting in WARNINGs to save. Used when reporting problems.
| pe-input-series-max | -1 (all) |
indexterm:[pe-input-series-max Cluster Options]
indexterm:[Cluster Options,pe-input-series-max]
The number of "normal" PE inputs to save. Used when reporting problems.
|=========================================================
You can always obtain an up-to-date list of cluster options, including
their default values, by running the pass:[pengine
metadata] command.
== Querying and Setting Cluster Options ==
indexterm:[Querying Cluster Options]
indexterm:[Setting Cluster Options]
indexterm:[Cluster Options,Querying]
indexterm:[Cluster Options,Setting]
Cluster options can be queried and modified using the
pass:[crm_attribute] tool. To get the current
value of +cluster-delay+, simply use:
pass:[crm_attribute --attr-name cluster-delay --get-value]
which is more simply written as
pass:[crm_attribute --get-value -n cluster-delay]
If a value is found, you'll see a result like this:
=======
pass:[ # crm_attribute --get-value -n cluster-delay]
name=cluster-delay value=60s
========
However, if no value is found, the tool will display an error:
=======
pass:[# crm_attribute --get-value -n clusta-deway]
name=clusta-deway value=(null)
Error performing operation: The object/attribute does not exist
========
To use a different value, eg. +30+, simply run:
pass:[crm_attribute --attr-name cluster-delay --attr-value 30s]
To go back to the cluster's default value you can delete the value, for example with this command:
pass:[crm_attribute --attr-name cluster-delay --delete-attr]
== When Options are Listed More Than Once ==
If you ever see something like the following, it means that the option you're modifying is present more than once.
.Deleting an option that is listed twice
=======
pass:[# crm_attribute --attr-name batch-limit --delete-attr]
Multiple attributes match name=batch-limit in crm_config:
Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit)
Value: 100 (set=custom, id=custom-batch-limit)
Please choose from one of the matches above and supply the 'id' with --attr-id
=======
In such cases follow the on-screen instructions to perform the
requested action. To determine which value is currently being used by
the cluster, please refer to the section on .
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Resource-Templates.txt b/doc/Pacemaker_Explained/en-US/Ch-Resource-Templates.txt
new file mode 100644
index 0000000000..ccc99d7719
--- /dev/null
+++ b/doc/Pacemaker_Explained/en-US/Ch-Resource-Templates.txt
@@ -0,0 +1,222 @@
+= Resource Templates =
+
+== Abstract ==
+
+If you want to create lots of resources with similar configurations, defining a
+resource template simplifies the task. Once defined, it can be referenced in
+primitives or in certain types of constraints.
+
+
+== Configuring Resources with Templates ==
+
+The primitives referencing the template will inherit all meta
+attributes, instance attributes, utilization attributes and operations defined
+in the template. And you can define specific attributes and operations for any
+of the primitives. If any of these are defined in both the template and the
+primitive, the values defined in the primitive will take precedence over the
+ones defined in the template.
+
+Hence, resource templates help to reduce the amount of configuration work.
+If any changes are needed, they can be done to the template definition and
+will take effect globally in all resource definitions referencing that
+template.
+
+Resource templates have a similar syntax like primitives. For example:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+
+
+
+----
+
+Once you defined the new resource template, you can use it in primitives:
+
+[source,XML]
+----
+
+
+
+
+
+
+----
+
+The new primitive `vm1` is going to inherit everything from the `vm-template`. For
+example, the equivalent of the above two would be:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+----
+
+If you want to overwrite some attributes or operations, add them to the
+particular primitive's definition.
+
+For instance, the following new primitive `vm2` has special
+attribute values. Its `monitor` operation has a longer `timeout` and `interval`, and
+the primitive has an additional `stop` operation.
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+----
+
+The following command shows the resulting definition of a resource:
+
+[source,Bash]
+----
+# crm_resource --query-xml --resource vm2
+----
+
+The following command shows its raw definition in cib:
+
+[source,Bash]
+----
+# crm_resource --query-xml-raw --resource vm2
+----
+
+== Referencing Templates in Constraints ==
+
+A resource template can be referenced in the following types of constraints:
+
+- `order` constraints
+- `colocation` constraints,
+- `rsc_ticket` constraints (for multi-site clusters).
+
+Resource templates referenced in constraints stand for all primitives which are
+derived from that template. This means, the constraint applies to all primitive
+resources referencing the resource template. Referencing resource templates in
+constraints is an alternative to resource sets and can simplify the cluster
+configuration considerably.
+
+For example:
+
+[source,XML]
+----
+
+----
+
+is the equivalent of the following constraint configuration:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+----
+
+[NOTE]
+======
+In a colocation constraint, only one template may be referenced from either
+`rsc` or `with-rsc`, and the other reference must be a regular resource.
+======
+
+Resource templates can also be referenced in resource sets.
+
+For example:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+----
+
+is the equivalent of the following constraint configuration:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+----
+
+If the resources referencing the template can run in parallel:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+
+
+----
+
+is the equivalent of the following constraint configuration:
+
+[source,XML]
+----
+
+
+
+
+
+
+
+
+
+
+
+
+----
diff --git a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
index a444474668..c95d7c7514 100644
--- a/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
+++ b/doc/Pacemaker_Explained/en-US/Ch-Utilization.txt
@@ -1,225 +1,225 @@
= Utilization and Placement Strategy =
== Background ==
Pacemaker decides where to place a resource according to the resource
allocation scores on every node. The resource will be allocated to the
node where the resource has the highest score. If the resource allocation
scores on all the nodes are equal, by the `default` placement strategy,
Pacemaker will choose a node with the least number of allocated resources
for balancing the load. If the number of resources on each node is equal,
the first eligible node listed in cib will be chosen to run the resource.
Though resources are different. They may consume different amounts of the
capacities of the nodes. Actually, we cannot ideally balance the load just
according to the number of resources allocated to a node. Besides, if
resources are placed such that their combined requirements exceed the
provided capacity, they may fail to start completely or run with degraded
performance.
To take these into account, Pacemaker allows you to specify the following
configurations:
. The `capacity` a certain `node provides`.
. The `capacity` a certain `resource requires`.
. An overall `strategy` for placement of resources.
== Utilization attributes ==
To configure the capacity a node provides and the resource's requirements,
use `utilization` attributes. You can name the `utilization` attributes
according to your preferences and define as many `name/value` pairs as your
configuration needs. However, the attribute's values must be `integers`.
First, specify the capacities the nodes provide:
-[source,Bash]
+[source,XML]
----
----
Then, specify the capacities the resources require:
-[source,Bash]
+[source,XML]
----
----
A node is considered eligible for a resource if it has sufficient free
capacity to satisfy the resource's requirements. The nature of the required
or provided capacities is completely irrelevant for Pacemaker, it just makes
sure that all capacity requirements of a resource are satisfied before placing
a resource to a node.
== Placement Strategy ==
After you have configured the capacities your nodes provide and the
capacities your resources require, you need to set the `placement-strategy`
in the global cluster options, otherwise the capacity configurations have
`no effect`.
Four values are available for the `placement-strategy`:
`default`::
Utilization values are not taken into account at all, per default.
Resources are allocated according to allocation scores. If scores are equal,
resources are evenly distributed across nodes.
`utilization`::
Utilization values are taken into account when deciding whether a node
is considered eligible if it has sufficient free capacity to satisfy the
resource's requirements. However, load-balancing is still done based on the
number of resources allocated to a node.
`balanced`::
Utilization values are taken into account when deciding whether a node
is eligible to serve a resource; an attempt is made to spread the resources
evenly, optimizing resource performance.
`minimal`::
Utilization values are taken into account when deciding whether a node
is eligible to serve a resource; an attempt is made to concentrate the
resources on as few nodes as possible, thereby enabling possible power savings
on the remaining nodes.
Set `placement-strategy` with `crm_attribute`:
[source,Bash]
----
# crm_attribute --attr-name placement-strategy --attr-value balanced
----
Now Pacemaker will ensure the load from your resources will be distributed
evenly throughout the cluster - without the need for convoluted sets of
colocation constraints.
== Allocation Details ==
=== Which node is preferred to be chosen to get consumed first on allocating resources? ===
- The node that is most healthy (which has the highest node weight) gets
consumed first.
- If their weights are equal:
* If `placement-strategy="default|utilization"`,
the node that has the least number of allocated resources gets consumed first.
** If their numbers of allocated resources are equal,
the first eligible node listed in cib gets consumed first.
* If `placement-strategy="balanced"`,
the node that has more free capacity gets consumed first.
** If the free capacities of the nodes are equal,
the node that has the least number of allocated resources gets consumed first.
*** If their numbers of allocated resources are equal,
the first eligible node listed in cib gets consumed first.
* If `placement-strategy="minimal"`,
the first eligible node listed in cib gets consumed first.
==== Which node has more free capacity? ====
This will be quite clear if we only define one type of `capacity`. While if we
define multiple types of `capacity`, for example:
- If `nodeA` has more free `cpus`, `nodeB` has more free `memory`,
their free capacities are equal.
- If `nodeA` has more free `cpus`, while `nodeB` has more free `memory` and `storage`,
`nodeB` has more free capacity.
-=== Which resource is preferred to be chosen to get assigned first?
+=== Which resource is preferred to be chosen to get assigned first? ===
- The resource that has the highest priority gets allocated first.
- If their priorities are equal, check if they are already running. The
resource that has the highest score on the node where it's running gets allocated
first (to prevent resource shuffling).
- If the scores above are equal or they are not running, the resource has
the highest score on the preferred node gets allocated first.
- If the scores above are equal, the first runnable resource listed in cib gets allocated first.
-== Limitations
+== Limitations ==
This type of problem Pacemaker is dealing with here is known as the
http://en.wikipedia.org/wiki/Knapsack_problem[knapsack problem] and falls into
the http://en.wikipedia.org/wiki/NP-complete[NP-complete] category of computer
science problems - which is fancy way of saying "it takes a really long time
to solve".
Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
or days, finding an optional solution while services remain unavailable.
So instead of trying to solve the problem completely, Pacemaker uses a
'best effort' algorithm for determining which node should host a particular
service. This means it arrives at a solution much faster than traditional
linear programming algorithms, but by doing so at the price of leaving some
services stopped.
In the contrived example above:
- `rsc-small` would be allocated to `node1`
- `rsc-medium` would be allocated to `node2`
- `rsc-large` would remain inactive
Which is not ideal.
-== Strategies for Dealing with the Limitations
+== Strategies for Dealing with the Limitations ==
- Ensure you have sufficient physical capacity.
It might sounds obvious, but if the physical capacity of your nodes is (close to)
maxed out by the cluster under normal conditions, then failover isn't going to
go well. Even without the Utilization feature, you'll start hitting timeouts and
getting secondary failures'.
- Build some buffer into the capabilities advertised by the nodes.
Advertise slightly more resources than we physically have on the (usually valid)
assumption that a resource will not use 100% of the configured number of
cpu/memory/etc `all` the time. This practice is also known as 'over commit'.
- Specify resource priorities.
If the cluster is going to sacrifice services, it should be the ones you care
(comparatively) about the least. Ensure that resource priorities are properly set
so that your most important resources are scheduled first.
diff --git a/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.xml b/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.xml
index 9c3009dc55..54662d8879 100644
--- a/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.xml
+++ b/doc/Pacemaker_Explained/en-US/Pacemaker_Explained.xml
@@ -1,50 +1,51 @@
Receiving Notification for Cluster Events
Configuring Email Notifications
Configuring SNMP Notifications
+
Further Reading
Project Website
Project Documentation
A comprehensive guide to cluster commands has been written by Novell
Heartbeat configuration:
Corosync Configuration:
diff --git a/xml/resources-1.2.rng b/xml/resources-1.2.rng
index d2fe1a8c24..d295b2a347 100644
--- a/xml/resources-1.2.rng
+++ b/xml/resources-1.2.rng
@@ -1,182 +1,221 @@
+
+
+
+
+
+
+
+ ocf
+
+
+
+
+ lsb
+ heartbeat
+ stonith
+ upstart
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
ocf
lsb
heartbeat
stonith
upstart
Stopped
Started
Slave
Master
nothing
quorum
fencing
ignore
block
stop
restart
standby
fence