Page Menu
Home
ClusterLabs Projects
Search
Configure Global Search
Log In
Files
F3686943
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Flag For Later
Award Token
Size
156 KB
Referenced Files
None
Subscribers
None
View Options
diff --git a/cts/README.md b/cts/README.md
index 1219fa8e47..cbf319a9f5 100644
--- a/cts/README.md
+++ b/cts/README.md
@@ -1,339 +1,321 @@
# Pacemaker Cluster Test Suite (CTS)
The Cluster Test Suite (CTS) refers to all Pacemaker testing code that can be
run in an installed environment. (Pacemaker also has unit tests that must be
run from a source distribution.)
CTS includes:
* Regression tests: These test specific Pacemaker components individually (no
integration tests). The primary front end is cts-regression in this
directory. Run it with the --help option to see its usage.
cts-regression is a wrapper for individual component regression tests also
in this directory (cts-cli, cts-exec, cts-fencing, and cts-scheduler).
The CLI and scheduler regression tests can also be run from a source
distribution. The other regression tests can only run in an installed
environment, and the cluster should not be running on the node running these
tests.
* The CTS lab: This is a cluster exerciser for intensively testing the behavior
of an entire working cluster. It is primarily for developers and packagers of
the Pacemaker source code, but it can be useful for users who wish to see how
- their cluster will react to various situations. In an installed deployment,
- the CTS lab is in the cts subdirectory of this directory; in a source
- distibution, it is in cts/lab.
+ their cluster will react to various situations. Most of the lab code is in
+ the Pacemaker Python module. The front end, cts-lab, is in this directory.
- The CTS lab runs a randomized series of predefined tests on the cluster. CTS
+ The CTS lab runs a randomized series of predefined tests on the cluster. It
can be run against a pre-existing cluster configuration or overwrite the
existing configuration with a test configuration.
* Helpers: Some of the component regression tests and the CTS lab require
certain helpers to be installed as root. These include a dummy LSB init
script, dummy systemd service, etc. In a source distribution, the source for
these is in cts/support.
The tests will install these as needed and uninstall them when done. This
means that the cluster configuration created by the CTS lab will generate
failures if started manually after the lab exits. However, the helper
installer can be run manually to make the configuration usable, if you want
to do your own further testing with it:
/usr/libexec/pacemaker/cts-support install
As you might expect, you can also remove the helpers with:
/usr/libexec/pacemaker/cts-support uninstall
+ (The actual directory location may vary depending on how Pacemaker was
+ built.)
+
* Cluster benchmark: The benchmark subdirectory of this directory contains some
cluster test environment benchmarking code. It is not particularly useful for
end users.
* Valgrind suppressions: When memory-testing Pacemaker code with valgrind,
various bugs in non-Pacemaker libraries and such can clutter the results. The
valgrind-pcmk.suppressions file in this directory can be used with valgrind's
--suppressions option to eliminate many of these.
## Using the CTS lab
### Requirements
* Three or more machines (one test exerciser and at least two cluster nodes).
* The test cluster nodes should be on the same subnet and have journalling
filesystems (ext4, xfs, etc.) for all of their filesystems other than
/boot. You also need a number of free IP addresses on that subnet if you
intend to test IP address takeover.
* The test exerciser machine doesn't need to be on the same subnet as the test
cluster machines. Minimal demands are made on the exerciser; it just has to
stay up during the tests.
* Tracking problems is easier if all machines' clocks are closely synchronized.
NTP does this automatically, but you can do it by hand if you want.
* The account on the exerciser used to run the CTS lab (which does not need to
be root) must be able to ssh as root to the cluster nodes without a password
challenge. See the Mini-HOWTO at the end of this file for details about how
to configure ssh for this.
* The exerciser needs to be able to resolve all cluster node names, whether by
DNS or /etc/hosts.
* CTS is not guaranteed to run on all platforms that Pacemaker itself does.
It calls commands such as service that may not be provided by all OSes.
### Preparation
* Install Pacemaker, including the testing code, on all machines. The testing
code must be the same version as the rest of Pacemaker, and the Pacemaker
version must be the same on the exerciser and all cluster nodes.
You can install from source, although many distributions package the testing
code (named pacemaker-cts or similar). Typically, everything needed by the
CTS lab is installed in /usr/share/pacemaker/tests/cts.
* Configure the cluster layer (Corosync) on the cluster machines (*not* the
exerciser), and verify it works. Node names used in the cluster configuration
*must* match the hosts' names as returned by `uname -n`; they do not have to
match the machines' fully qualified domain names.
### Run
The primary interface to the CTS lab is the cts-lab executable:
/usr/share/pacemaker/tests/cts-lab [options] <number-of-tests-to-run>
+(The actual directory location may vary depending on how Pacemaker was built.)
+
As part of the options, specify the cluster nodes with --nodes, for example:
--nodes "pcmk-1 pcmk-2 pcmk-3"
Most people will want to save the output to a file, for example:
--outputfile ~/cts.log
Unless you want to test a pre-existing cluster configuration, you also want
(*warning*: with these options, any existing configuration will be lost):
--clobber-cib
--populate-resources
You can test floating IP addresses (*not* already used by any host), one per
cluster node, by specifying the first, for example:
--test-ip-base 192.168.9.100
Configure some sort of fencing, for example to use fence\_xvm:
--stonith xvm
Putting all the above together, a command line might look like:
/usr/share/pacemaker/tests/cts-lab --nodes "pcmk-1 pcmk-2 pcmk-3" \
--outputfile ~/cts.log --clobber-cib --populate-resources \
--test-ip-base 192.168.9.100 --stonith xvm 50
For more options, run with the --help option.
There are also a couple of wrappers for cts-lab that some users may find more
convenient: cts, which is typically installed in the same place as the rest of
the testing code; and cluster\_test, which is in the source directory and
typically not installed.
To extract the result of a particular test, run:
crm_report -T $test
### Optional: Memory testing
Pacemaker has various options for testing memory management. On cluster nodes,
Pacemaker components use various environment variables to control these
options. How these variables are set varies by OS, but usually they are set in
a file such as /etc/sysconfig/pacemaker or /etc/default/pacemaker.
Valgrind is a program for detecting memory management problems such as
use-after-free errors. If you have valgrind installed, you can enable it by
setting the following environment variables on all cluster nodes:
PCMK_valgrind_enabled=pacemaker-attrd,pacemaker-based,pacemaker-controld,pacemaker-execd,pacemaker-fenced,pacemaker-schedulerd
VALGRIND_OPTS="--leak-check=full --trace-children=no --num-callers=25
--log-file=/var/lib/pacemaker/valgrind-%p
--suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
--gen-suppressions=all"
If running the CTS lab with valgrind enabled on the cluster nodes, add these
options to cts-lab:
--valgrind-tests --valgrind-procs "pacemaker-attrd pacemaker-based pacemaker-controld pacemaker-execd pacemaker-schedulerd pacemaker-fenced"
These options should only be set while specifically testing memory management,
because they may slow down the cluster significantly, and they will disable
writes to the CIB. If desired, you can enable valgrind on a subset of pacemaker
components rather than all of them as listed above.
Valgrind will put a text file for each process in the location specified by
valgrind's --log-file option. See
https://www.valgrind.org/docs/manual/mc-manual.html for explanations of the
messages valgrind generates.
Separately, if you are using the GNU C library, the G\_SLICE,
MALLOC\_PERTURB\_, and MALLOC\_CHECK\_ environment variables can be set to
affect the library's memory management functions.
When using valgrind, G\_SLICE should be set to "always-malloc", which helps
valgrind track memory by always using the malloc() and free() routines
directly. When not using valgrind, G\_SLICE can be left unset, or set to
"debug-blocks", which enables the C library to catch many memory errors
but may impact performance.
If the MALLOC\_PERTURB\_ environment variable is set to an 8-bit integer, the C
library will initialize all newly allocated bytes of memory to the integer
value, and will set all newly freed bytes of memory to the bitwise inverse of
the integer value. This helps catch uses of uninitialized or freed memory
blocks that might otherwise go unnoticed. Example:
MALLOC_PERTURB_=221
If the MALLOC\_CHECK\_ environment variable is set, the C library will check for
certain heap corruption errors. The most useful value in testing is 3, which
will cause the library to print a message to stderr and abort execution.
Example:
MALLOC_CHECK_=3
Valgrind should be enabled for either all nodes or none when used with the CTS
lab, but the C library variables may be set differently on different nodes.
### Optional: Remote node testing
-If the pacemaker-remoted daemon is installed on all cluster nodes, CTS will
-enable remote node tests.
+If the pacemaker-remoted daemon is installed on all cluster nodes, the CTS lab
+will enable remote node tests.
The remote node tests choose a random node, stop the cluster on it, start
pacemaker-remoted on it, and add an ocf:pacemaker:remote resource to turn it
-into a remote node. When the test is done, CTS will turn the node back into
+into a remote node. When the test is done, the lab will turn the node back into
a cluster node.
-To avoid conflicts, CTS will rename the node, prefixing the original node name
-with "remote-". For example, "pcmk-1" will become "remote-pcmk-1". These names
-do not need to be resolvable.
+To avoid conflicts, the lab will rename the node, prefixing the original node
+name with "remote-". For example, "pcmk-1" will become "remote-pcmk-1". These
+names do not need to be resolvable.
The name change may require special fencing configuration, if the fence agent
expects the node name to be the same as its hostname. A common approach is to
specify the "remote-" names in pcmk\_host\_list. If you use
-pcmk\_host\_list=all, CTS will expand that to all cluster nodes and their
+pcmk\_host\_list=all, the lab will expand that to all cluster nodes and their
"remote-" names. You may additionally need a pcmk\_host\_map argument to map
the "remote-" names to the hostnames. Example:
--stonith xvm --stonith-args \
pcmk_host_list=all,pcmk_host_map=remote-pcmk-1:pcmk-1;remote-pcmk-2:pcmk-2
### Optional: Remote node testing with valgrind
When running the remote node tests, the Pacemaker components on the *cluster*
nodes can be run under valgrind as described in the "Memory testing" section.
However, pacemaker-remoted cannot be run under valgrind that way, because it is
started by the OS's regular boot system and not by Pacemaker.
Details vary by system, but the goal is to set the VALGRIND\_OPTS environment
variable and then start pacemaker-remoted by prefixing it with the path to
valgrind.
The init script and systemd service file provided with pacemaker-remoted will
load the pacemaker environment variables from the same location used by other
Pacemaker components, so VALGRIND\_OPTS will be set correctly if using one of
those.
For an OS using systemd, you can override the ExecStart parameter to run
valgrind. For example:
mkdir /etc/systemd/system/pacemaker_remote.service.d
cat >/etc/systemd/system/pacemaker_remote.service.d/valgrind.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/valgrind /usr/sbin/pacemaker-remoted
EOF
### Mini-HOWTO: Allow passwordless remote SSH connections
-The CTS scripts run "ssh -l root" so you don't have to do any of your testing
+The CTS lab runs "ssh -l root" so you don't have to do any of your testing
logged in as root on the exerciser. Here is how to allow such connections
without requiring a password to be entered each time:
* On your test exerciser, create an SSH key if you do not already have one.
Most commonly, SSH keys will be in your ~/.ssh directory, with the
private key file not having an extension, and the public key file
named the same with the extension ".pub" (for example, ~/.ssh/id\_rsa.pub).
If you don't already have a key, you can create one with:
ssh-keygen -t rsa
* From your test exerciser, authorize your SSH public key for root on all test
machines (both the exerciser and the cluster test machines):
ssh-copy-id -i ~/.ssh/id_rsa.pub root@$MACHINE
You will probably have to provide your password, and possibly say
"yes" to some questions about accepting the identity of the test machines.
The above assumes you have a RSA SSH key in the specified location;
if you have some other type of key (DSA, ECDSA, etc.), use its file name
in the -i option above.
* To verify, try this command from the exerciser machine for each
of your cluster machines, and for the exerciser machine itself.
ssh -l root $MACHINE
If this works without prompting for a password, you're in business.
If not, look at the documentation for your version of ssh.
-## Note on the maintenance
+## Upgrading scheduler test inputs for new XSLTs
-### Tests for scheduler
-
-The source `*.xml` files are preferably kept in sync with the newest
-major (and only major, which is enough) schema version, since these
-tests are not meant to double as schema upgrade ones (except some cases
+The scheduler/xml inputs should be kept in sync with the latest major schema
+version, since these tests are not meant to test schema upgrades (unless
expressly designated as such).
-Currently and unless something goes wrong, the procedure of upgrading
-these tests en masse is as easy as:
+To upgrade the inputs to a new major schema version:
- cd "$(git rev-parse --show-toplevel)/cts" # if not already
- pushd "$(git rev-parse --show-toplevel)/xml"
+ cd "$(git rev-parse --show-toplevel)/xml"
./regression.sh cts_scheduler -G
- popd
+ cd "$(git rev-parse --show-toplevel)/cts"
git add --interactive .
- git commit -m 'XML: upgrade-M.N.xsl: apply on scheduler CTS test cases'
- git reset HEAD && git checkout . # if some differences still remain
- ./cts-scheduler # absolutely vital to check nothing got broken!
-
-Now, sadly, there's no proved automated way to minimize instances like this:
-
- <primitive id="rsc1" class="ocf" provider="heartbeat" type="apache">
- </primitive>
-
-that may be left behind into more canonical:
-
- <primitive id="rsc1" class="ocf" provider="heartbeat" type="apache"/>
-
-so manual editing is tasked, or perhaps `--format` or `--c14n`
-to `xmllint` will be of help (without any other side effects).
+ git commit -m 'Test: scheduler: upgrade test inputs to schema $X.$Y'
+ ./cts-scheduler || echo 'Investigate what went wrong'
-If the overall process gets stuck anywhere, common sense to the rescue.
-The initial part of the above recipe can be repeated anytime to verify
-there's nothing to upgrade artificially like this, which is a desired
-state. Note that `regression.sh` script performs validation of both
-the input and output, should the upgrade take place, implicitly, so
-there's no need of revalidation in the happy case.
+The first two commands can be run anytime to verify no further upgrades are
+needed.
diff --git a/doc/sphinx/Pacemaker_Explained/options.rst b/doc/sphinx/Pacemaker_Explained/options.rst
index db2946d368..d38a2ab892 100644
--- a/doc/sphinx/Pacemaker_Explained/options.rst
+++ b/doc/sphinx/Pacemaker_Explained/options.rst
@@ -1,802 +1,921 @@
Cluster-Wide Configuration
--------------------------
.. index::
pair: XML element; cib
pair: XML element; configuration
Configuration Layout
####################
The cluster is defined by the Cluster Information Base (CIB), which uses XML
notation. The simplest CIB, an empty one, looks like this:
.. topic:: An empty configuration
.. code-block:: xml
<cib crm_feature_set="3.6.0" validate-with="pacemaker-3.5" epoch="1" num_updates="0" admin_epoch="0">
<configuration>
<crm_config/>
<nodes/>
<resources/>
<constraints/>
</configuration>
<status/>
</cib>
The empty configuration above contains the major sections that make up a CIB:
* ``cib``: The entire CIB is enclosed with a ``cib`` element. Certain
fundamental settings are defined as attributes of this element.
* ``configuration``: This section -- the primary focus of this document --
contains traditional configuration information such as what resources the
cluster serves and the relationships among them.
* ``crm_config``: cluster-wide configuration options
* ``nodes``: the machines that host the cluster
* ``resources``: the services run by the cluster
* ``constraints``: indications of how resources should be placed
* ``status``: This section contains the history of each resource on each
node. Based on this data, the cluster can construct the complete current
state of the cluster. The authoritative source for this section is the
local executor (pacemaker-execd process) on each cluster node, and the
cluster will occasionally repopulate the entire section. For this reason,
it is never written to disk, and administrators are advised against
modifying it in any way.
In this document, configuration settings will be described as properties or
options based on how they are defined in the CIB:
* Properties are XML attributes of an XML element.
* Options are name-value pairs expressed as ``nvpair`` child elements of an XML
element.
Normally, you will use command-line tools that abstract the XML, so the
distinction will be unimportant; both properties and options are cluster
settings you can tweak.
Configuration Value Types
#########################
Throughout this document, configuration values will be designated as having one
of the following types:
-.. table:: **Configuration Value Types**
+.. list-table:: **Configuration Value Types**
:class: longtable
:widths: 1 3
+ :header-rows: 1
+
+ * - Type
+ - Description
+ * - .. _boolean:
+
+ .. index::
+ pair: type; boolean
+
+ boolean
+ - Case-insensitive text value where ``1``, ``yes``, ``y``, ``on``,
+ and ``true`` evaluate as true and ``0``, ``no``, ``n``, ``off``,
+ ``false``, and unset evaluate as false
+ * - .. _date_time:
+
+ .. index::
+ pair: type; date/time
+
+ date/time
+ - Textual timestamp like ``Sat Dec 21 11:47:45 2013``
+ * - .. _duration:
+
+ .. index::
+ pair: type; duration
+
+ duration
+ - A time duration, specified either like a :ref:`timeout <timeout>` or an
+ `ISO 8601 duration <https://en.wikipedia.org/wiki/ISO_8601#Durations>`_.
+ A duration may be up to approximately 49 days but is intended for much
+ smaller time periods.
+ * - .. _enumeration:
+
+ .. index::
+ pair: type; enumeration
+
+ enumeration
+ - Text that must be one of a set of defined values (which will be listed
+ in the description)
+ * - .. _integer:
+
+ .. index::
+ pair: type; integer
+
+ integer
+ - 32-bit signed integer value (-2,147,483,648 to 2,147,483,647)
+ * - .. _nonnegative_integer:
+
+ .. index::
+ pair: type; nonnegative integer
+
+ nonnegative integer
+ - 32-bit nonnegative integer value (0 to 2,147,483,647)
+ * - .. _port:
+
+ .. index::
+ pair: type; port
+
+ port
+ - Integer TCP port number (0 to 65535)
+ * - .. _score:
+
+ .. index::
+ pair: type; score
+
+ score
+ - A Pacemaker score can be an integer between -1,000,000 and 1,000,000, or
+ a string alias: ``INFINITY`` or ``+INFINITY`` is equivalent to
+ 1,000,000, ``-INFINITY`` is equivalent to -1,000,000, and ``red``,
+ ``yellow``, and ``green`` are equivalent to integers as described in
+ :ref:`node-health`.
+ * - .. _text:
+
+ .. index::
+ pair: type; text
+
+ text
+ - A text string
+ * - .. _timeout:
+
+ .. index::
+ pair: type; timeout
+
+ timeout
+ - A time duration, specified as a bare number (in which case it is
+ considered to be in seconds) or a number with a unit (``ms`` or ``msec``
+ for milliseconds, ``us`` or ``usec`` for microseconds, ``s`` or ``sec``
+ for seconds, ``m`` or ``min`` for minutes, ``h`` or ``hr`` for hours)
+ optionally with whitespace before and/or after the number.
+ * - .. _version:
+
+ .. index::
+ pair: type; version
+
+ version
+ - Version number (any combination of alphanumeric characters, dots, and
+ dashes, starting with a number).
- +-------------------+-------------------------------------------------------+
- | Type | Description |
- +===================+=======================================================+
- | boolean | .. _boolean: |
- | | |
- | | .. index:: |
- | | pair: type; boolean |
- | | |
- | | Case-insensitive true/false value where "1", "yes", |
- | | "y", "on", and "true" evaluate as true and "0", "no", |
- | | "n", "off", "false", and unset evaluate as false |
- +-------------------+-------------------------------------------------------+
- | date/time | .. _date_time: |
- | | |
- | | .. index:: |
- | | pair: type; date/time |
- | | |
- | | Textual timestamp like "Sat Dec 21 11:47:45 2013" |
- +-------------------+-------------------------------------------------------+
- | enumeration | .. _enumeration: |
- | | |
- | | .. index:: |
- | | pair: type; enumeration |
- | | |
- | | Text that must be one of a set of defined values |
- | | (which will be listed in the description) |
- +-------------------+-------------------------------------------------------+
- | integer | .. _integer: |
- | | |
- | | .. index:: |
- | | pair: type; integer |
- | | |
- | | 32-bit signed integer value (-2,147,483,648 to |
- | | 2,147,483,647) |
- +-------------------+-------------------------------------------------------+
- | nonnegative | .. _nonnegative_integer: |
- | integer | |
- | | .. index:: |
- | | pair: type; nonnegative integer |
- | | |
- | | 32-bit nonnegative integer value (0 to 2,147,483,647) |
- +-------------------+-------------------------------------------------------+
- | port | .. _port: |
- | | |
- | | .. index:: |
- | | pair: type; port |
- | | |
- | | Integer TCP port number (0 to 65535) |
- +-------------------+-------------------------------------------------------+
- | score | .. _score: |
- | | |
- | | .. index:: |
- | | pair: type; score |
- | | |
- | | A Pacemaker score can be an integer between |
- | | -1,000,000 and 1,000,000, or a string alias: |
- | | ``INFINITY`` or ``+INFINITY`` is equivalent to |
- | | 1,000,000, ``-INFINITY`` is equivalent to -1,000,000, |
- | | and ``red``, ``yellow``, and ``green`` are equivalent |
- | | to integers as described in :ref:`node-health`. |
- +-------------------+-------------------------------------------------------+
- | text | .. _text: |
- | | |
- | | .. index:: |
- | | pair: type; text |
- | | |
- | | A text string |
- +-------------------+-------------------------------------------------------+
- | version | .. _version: |
- | | |
- | | .. index:: |
- | | pair: type; version |
- | | |
- | | Version number (three integers separated by dots) |
- +-------------------+-------------------------------------------------------+
Scores
______
Scores are integral to how Pacemaker works. Practically everything from moving
a resource to deciding which resource to stop in a degraded cluster is achieved
by manipulating scores in some way.
Scores are calculated per resource and node. Any node with a negative score for
a resource can't run that resource. The cluster places a resource on the node
with the highest score for it.
Score addition and subtraction follow these rules:
* Any value (including ``INFINITY``) - ``INFINITY`` = ``-INFINITY``
* ``INFINITY`` + any value other than ``-INFINITY`` = ``INFINITY``
.. note::
What if you want to use a score higher than 1,000,000? Typically this possibility
arises when someone wants to base the score on some external metric that might
go above 1,000,000.
The short answer is you can't.
The long answer is it is sometimes possible work around this limitation
creatively. You may be able to set the score to some computed value based on
the external metric rather than use the metric directly. For nodes, you can
store the metric as a node attribute, and query the attribute when computing
the score (possibly as part of a custom resource agent).
CIB Properties
##############
Certain settings are defined by CIB properties (that is, attributes of the
``cib`` tag) rather than with the rest of the cluster configuration in the
``configuration`` section.
The reason is simply a matter of parsing. These options are used by the
configuration database which is, by design, mostly ignorant of the content it
holds. So the decision was made to place them in an easy-to-find location.
.. list-table:: **CIB Properties**
:class: longtable
:widths: 2 2 2 5
:header-rows: 1
- * - Attribute
+ * - Name
- Type
- Default
- Description
* - .. _admin_epoch:
.. index::
pair: admin_epoch; cib
admin_epoch
- :ref:`nonnegative integer <nonnegative_integer>`
- 0
- When a node joins the cluster, the cluster asks the node with the
highest (``admin_epoch``, ``epoch``, ``num_updates``) tuple to replace
the configuration on all the nodes -- which makes setting them correctly
very important. ``admin_epoch`` is never modified by the cluster; you
can use this to make the configurations on any inactive nodes obsolete.
* - .. _epoch:
.. index::
pair: epoch; cib
epoch
- :ref:`nonnegative integer <nonnegative_integer>`
- 0
- The cluster increments this every time the CIB's configuration section
is updated.
* - .. _num_updates:
.. index::
pair: num_updates; cib
num_updates
- :ref:`nonnegative integer <nonnegative_integer>`
- 0
- The cluster increments this every time the CIB's configuration or status
sections are updated, and resets it to 0 when epoch changes.
* - .. _validate_with:
.. index::
pair: validate-with; cib
validate-with
- :ref:`enumeration <enumeration>`
-
- Determines the type of XML validation that will be done on the
configuration. Allowed values are ``none`` (in which case the cluster
will not require that updates conform to expected syntax) and the base
names of schema files installed on the local machine (for example,
"pacemaker-3.9")
* - .. _remote_tls_port:
.. index::
pair: remote-tls-port; cib
remote-tls-port
- :ref:`port <port>`
-
- If set, the CIB manager will listen for anonymously encrypted remote
connections on this port, to allow CIB administration from hosts not in
the cluster. No key is used, so this should be used only on a protected
network where man-in-the-middle attacks can be avoided.
* - .. _remote_clear_port:
.. index::
pair: remote-clear-port; cib
remote-clear-port
- :ref:`port <port>`
-
- If set to a TCP port number, the CIB manager will listen for remote
connections on this port, to allow for CIB administration from hosts not
in the cluster. No encryption is used, so this should be used only on a
protected network.
* - .. _cib_last_written:
.. index::
pair: cib-last-written; cib
cib-last-written
- :ref:`date/time <date_time>`
-
- Indicates when the configuration was last written to disk. Maintained by
the cluster; for informational purposes only.
* - .. _have_quorum:
.. index::
pair: have-quorum; cib
have-quorum
- :ref:`boolean <boolean>`
-
- Indicates whether the cluster has quorum. If false, the cluster's
response is determined by ``no-quorum-policy`` (see below). Maintained
by the cluster.
* - .. _dc_uuid:
.. index::
pair: dc-uuid; cib
dc-uuid
- :ref:`text <text>`
-
- Node ID of the cluster's current designated controller (DC). Used and
maintained by the cluster.
.. _cluster_options:
Cluster Options
###############
Cluster options, as you might expect, control how the cluster behaves when
confronted with various situations.
They are grouped into sets within the ``crm_config`` section. In advanced
configurations, there may be more than one set. (This will be described later
in the chapter on :ref:`rules` where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including their default
values, by running the ``man pacemaker-schedulerd`` and
``man pacemaker-controld`` commands.
-.. table:: **Cluster Options**
+.. list-table:: **Cluster Options**
:class: longtable
- :widths: 2 1 4
-
- +---------------------------+---------+----------------------------------------------------+
- | Option | Default | Description |
- +===========================+=========+====================================================+
- | cluster-name | | .. index:: |
- | | | pair: cluster option; cluster-name |
- | | | |
- | | | An (optional) name for the cluster as a whole. |
- | | | This is mostly for users' convenience for use |
- | | | as desired in administration, but this can be |
- | | | used in the Pacemaker configuration in |
- | | | :ref:`rules` (as the ``#cluster-name`` |
- | | | :ref:`node attribute |
- | | | <node-attribute-expressions-special>`. It may |
- | | | also be used by higher-level tools when |
- | | | displaying cluster information, and by |
- | | | certain resource agents (for example, the |
- | | | ``ocf:heartbeat:GFS2`` agent stores the |
- | | | cluster name in filesystem meta-data). |
- +---------------------------+---------+----------------------------------------------------+
- | dc-version | | .. index:: |
- | | | pair: cluster option; dc-version |
- | | | |
- | | | Version of Pacemaker on the cluster's DC. |
- | | | Determined automatically by the cluster. Often |
- | | | includes the hash which identifies the exact |
- | | | Git changeset it was built from. Used for |
- | | | diagnostic purposes. |
- +---------------------------+---------+----------------------------------------------------+
- | cluster-infrastructure | | .. index:: |
- | | | pair: cluster option; cluster-infrastructure |
- | | | |
- | | | The messaging stack on which Pacemaker is |
- | | | currently running. Determined automatically by |
- | | | the cluster. Used for informational and |
- | | | diagnostic purposes. |
- +---------------------------+---------+----------------------------------------------------+
- | no-quorum-policy | stop | .. index:: |
- | | | pair: cluster option; no-quorum-policy |
- | | | |
- | | | What to do when the cluster does not have |
- | | | quorum. Allowed values: |
- | | | |
- | | | * ``ignore:`` continue all resource management |
- | | | * ``freeze:`` continue resource management, but |
- | | | don't recover resources from nodes not in the |
- | | | affected partition |
- | | | * ``stop:`` stop all resources in the affected |
- | | | cluster partition |
- | | | * ``demote:`` demote promotable resources and |
- | | | stop all other resources in the affected |
- | | | cluster partition *(since 2.0.5)* |
- | | | * ``suicide:`` fence all nodes in the affected |
- | | | cluster partition |
- +---------------------------+---------+----------------------------------------------------+
- | batch-limit | 0 | .. index:: |
- | | | pair: cluster option; batch-limit |
- | | | |
- | | | The maximum number of actions that the cluster |
- | | | may execute in parallel across all nodes. The |
- | | | "correct" value will depend on the speed and |
- | | | load of your network and cluster nodes. If zero, |
- | | | the cluster will impose a dynamically calculated |
- | | | limit only when any node has high load. If -1, the |
- | | | cluster will not impose any limit. |
- +---------------------------+---------+----------------------------------------------------+
- | migration-limit | -1 | .. index:: |
- | | | pair: cluster option; migration-limit |
- | | | |
- | | | The number of |
- | | | :ref:`live migration <live-migration>` actions |
- | | | that the cluster is allowed to execute in |
- | | | parallel on a node. A value of -1 means |
- | | | unlimited. |
- +---------------------------+---------+----------------------------------------------------+
- | symmetric-cluster | true | .. index:: |
- | | | pair: cluster option; symmetric-cluster |
- | | | |
- | | | Whether resources can run on any node by default |
- | | | (if false, a resource is allowed to run on a |
- | | | node only if a |
- | | | :ref:`location constraint <location-constraint>` |
- | | | enables it) |
- +---------------------------+---------+----------------------------------------------------+
- | stop-all-resources | false | .. index:: |
- | | | pair: cluster option; stop-all-resources |
- | | | |
- | | | Whether all resources should be disallowed from |
- | | | running (can be useful during maintenance) |
- +---------------------------+---------+----------------------------------------------------+
- | stop-orphan-resources | true | .. index:: |
- | | | pair: cluster option; stop-orphan-resources |
- | | | |
- | | | Whether resources that have been deleted from |
- | | | the configuration should be stopped. This value |
- | | | takes precedence over |
- | | | :ref:`is-managed <is_managed>` (that is, even |
- | | | unmanaged resources will be stopped when orphaned |
- | | | if this value is ``true``). |
- +---------------------------+---------+----------------------------------------------------+
- | stop-orphan-actions | true | .. index:: |
- | | | pair: cluster option; stop-orphan-actions |
- | | | |
- | | | Whether recurring :ref:`operations <operation>` |
- | | | that have been deleted from the configuration |
- | | | should be cancelled |
- +---------------------------+---------+----------------------------------------------------+
- | start-failure-is-fatal | true | .. index:: |
- | | | pair: cluster option; start-failure-is-fatal |
- | | | |
- | | | Whether a failure to start a resource on a |
- | | | particular node prevents further start attempts |
- | | | on that node? If ``false``, the cluster will |
- | | | decide whether the node is still eligible based |
- | | | on the resource's current failure count and |
- | | | :ref:`migration-threshold <failure-handling>`. |
- +---------------------------+---------+----------------------------------------------------+
- | enable-startup-probes | true | .. index:: |
- | | | pair: cluster option; enable-startup-probes |
- | | | |
- | | | Whether the cluster should check the |
- | | | pre-existing state of resources when the cluster |
- | | | starts |
- +---------------------------+---------+----------------------------------------------------+
- | maintenance-mode | false | .. _maintenance_mode: |
- | | | |
- | | | .. index:: |
- | | | pair: cluster option; maintenance-mode |
- | | | |
- | | | If true, the cluster will not start or stop any |
- | | | resource in the cluster, and any recurring |
- | | | operations (expect those specifying ``role`` as |
- | | | ``Stopped``) will be paused. If true, this |
- | | | overrides the |
- | | | :ref:`maintenance <node_maintenance>` node |
- | | | attribute, :ref:`is-managed <is_managed>` and |
- | | | :ref:`maintenance <rsc_maintenance>` resource |
- | | | meta-attributes, and :ref:`enabled <op_enabled>` |
- | | | operation meta-attribute. |
- +---------------------------+---------+----------------------------------------------------+
- | stonith-enabled | true | .. index:: |
- | | | pair: cluster option; stonith-enabled |
- | | | |
- | | | Whether the cluster is allowed to fence nodes |
- | | | (for example, failed nodes and nodes with |
- | | | resources that can't be stopped). |
- | | | |
- | | | If true, at least one fence device must be |
- | | | configured before resources are allowed to run. |
- | | | |
- | | | If false, unresponsive nodes are immediately |
- | | | assumed to be running no resources, and resource |
- | | | recovery on online nodes starts without any |
- | | | further protection (which can mean *data loss* |
- | | | if the unresponsive node still accesses shared |
- | | | storage, for example). See also the |
- | | | :ref:`requires <requires>` resource |
- | | | meta-attribute. |
- +---------------------------+---------+----------------------------------------------------+
- | stonith-action | reboot | .. index:: |
- | | | pair: cluster option; stonith-action |
- | | | |
- | | | Action the cluster should send to the fence agent |
- | | | when a node must be fenced. Allowed values are |
- | | | ``reboot``, ``off``, and (for legacy agents only) |
- | | | ``poweroff``. |
- +---------------------------+---------+----------------------------------------------------+
- | stonith-timeout | 60s | .. index:: |
- | | | pair: cluster option; stonith-timeout |
- | | | |
- | | | How long to wait for ``on``, ``off``, and |
- | | | ``reboot`` fence actions to complete by default. |
- +---------------------------+---------+----------------------------------------------------+
- | stonith-max-attempts | 10 | .. index:: |
- | | | pair: cluster option; stonith-max-attempts |
- | | | |
- | | | How many times fencing can fail for a target |
- | | | before the cluster will no longer immediately |
- | | | re-attempt it. |
- +---------------------------+---------+----------------------------------------------------+
- | stonith-watchdog-timeout | 0 | .. index:: |
- | | | pair: cluster option; stonith-watchdog-timeout |
- | | | |
- | | | If nonzero, and the cluster detects |
- | | | ``have-watchdog`` as ``true``, then watchdog-based |
- | | | self-fencing will be performed via SBD when |
- | | | fencing is required, without requiring a fencing |
- | | | resource explicitly configured. |
- | | | |
- | | | If this is set to a positive value, unseen nodes |
- | | | are assumed to self-fence within this much time. |
- | | | |
- | | | **Warning:** It must be ensured that this value is |
- | | | larger than the ``SBD_WATCHDOG_TIMEOUT`` |
- | | | environment variable on all nodes. Pacemaker |
- | | | verifies the settings individually on all nodes |
- | | | and prevents startup or shuts down if configured |
- | | | wrongly on the fly. It is strongly recommended |
- | | | that ``SBD_WATCHDOG_TIMEOUT`` be set to the same |
- | | | value on all nodes. |
- | | | |
- | | | If this is set to a negative value, and |
- | | | ``SBD_WATCHDOG_TIMEOUT`` is set, twice that value |
- | | | will be used. |
- | | | |
- | | | **Warning:** In this case, it is essential (and |
- | | | currently not verified by pacemaker) that |
- | | | ``SBD_WATCHDOG_TIMEOUT`` is set to the same |
- | | | value on all nodes. |
- +---------------------------+---------+----------------------------------------------------+
- | concurrent-fencing | false | .. index:: |
- | | | pair: cluster option; concurrent-fencing |
- | | | |
- | | | Whether the cluster is allowed to initiate |
- | | | multiple fence actions concurrently. Fence actions |
- | | | initiated externally, such as via the |
- | | | ``stonith_admin`` tool or an application such as |
- | | | DLM, or by the fencer itself such as recurring |
- | | | device monitors and ``status`` and ``list`` |
- | | | commands, are not limited by this option. |
- +---------------------------+---------+----------------------------------------------------+
- | fence-reaction | stop | .. index:: |
- | | | pair: cluster option; fence-reaction |
- | | | |
- | | | How should a cluster node react if notified of its |
- | | | own fencing? A cluster node may receive |
- | | | notification of its own fencing if fencing is |
- | | | misconfigured, or if fabric fencing is in use that |
- | | | doesn't cut cluster communication. Allowed values |
- | | | are ``stop`` to attempt to immediately stop |
- | | | pacemaker and stay stopped, or ``panic`` to |
- | | | attempt to immediately reboot the local node, |
- | | | falling back to stop on failure. The default is |
- | | | likely to be changed to ``panic`` in a future |
- | | | release. *(since 2.0.3)* |
- +---------------------------+---------+----------------------------------------------------+
- | priority-fencing-delay | 0 | .. index:: |
- | | | pair: cluster option; priority-fencing-delay |
- | | | |
- | | | Apply this delay to any fencing targeting the lost |
- | | | nodes with the highest total resource priority in |
- | | | case we don't have the majority of the nodes in |
- | | | our cluster partition, so that the more |
- | | | significant nodes potentially win any fencing |
- | | | match (especially meaningful in a split-brain of a |
- | | | 2-node cluster). A promoted resource instance |
- | | | takes the resource's priority plus 1 if the |
- | | | resource's priority is not 0. Any static or random |
- | | | delays introduced by ``pcmk_delay_base`` and |
- | | | ``pcmk_delay_max`` configured for the |
- | | | corresponding fencing resources will be added to |
- | | | this delay. This delay should be significantly |
- | | | greater than (safely twice) the maximum delay from |
- | | | those parameters. *(since 2.0.4)* |
- +---------------------------+---------+----------------------------------------------------+
- | node-pending-timeout | 2h | .. index:: |
- | | | pair: cluster option; node-pending-timeout |
- | | | |
- | | | Fence nodes that do not join the controller |
- | | | process group within this much time after joining |
- | | | the cluster, to allow the cluster to continue |
- | | | managing resources. A value of 0 means never fence |
- | | | pending nodes. *(since 2.1.7)* |
- +---------------------------+---------+----------------------------------------------------+
- | cluster-delay | 60s | .. index:: |
- | | | pair: cluster option; cluster-delay |
- | | | |
- | | | Estimated maximum round-trip delay over the |
- | | | network (excluding action execution). If the DC |
- | | | requires an action to be executed on another node, |
- | | | it will consider the action failed if it does not |
- | | | get a response from the other node in this time |
- | | | (after considering the action's own timeout). The |
- | | | "correct" value will depend on the speed and load |
- | | | of your network and cluster nodes. |
- +---------------------------+---------+----------------------------------------------------+
- | dc-deadtime | 20s | .. index:: |
- | | | pair: cluster option; dc-deadtime |
- | | | |
- | | | How long to wait for a response from other nodes |
- | | | during startup. The "correct" value will depend on |
- | | | the speed/load of your network and the type of |
- | | | switches used. |
- +---------------------------+---------+----------------------------------------------------+
- | cluster-ipc-limit | 500 | .. index:: |
- | | | pair: cluster option; cluster-ipc-limit |
- | | | |
- | | | The maximum IPC message backlog before one cluster |
- | | | daemon will disconnect another. This is of use in |
- | | | large clusters, for which a good value is the |
- | | | number of resources in the cluster multiplied by |
- | | | the number of nodes. The default of 500 is also |
- | | | the minimum. Raise this if you see |
- | | | "Evicting client" messages for cluster daemon PIDs |
- | | | in the logs. |
- +---------------------------+---------+----------------------------------------------------+
- | pe-error-series-max | -1 | .. index:: |
- | | | pair: cluster option; pe-error-series-max |
- | | | |
- | | | The number of scheduler inputs resulting in errors |
- | | | to save. Used when reporting problems. A value of |
- | | | -1 means unlimited (report all), and 0 means none. |
- +---------------------------+---------+----------------------------------------------------+
- | pe-warn-series-max | 5000 | .. index:: |
- | | | pair: cluster option; pe-warn-series-max |
- | | | |
- | | | The number of scheduler inputs resulting in |
- | | | warnings to save. Used when reporting problems. A |
- | | | value of -1 means unlimited (report all), and 0 |
- | | | means none. |
- +---------------------------+---------+----------------------------------------------------+
- | pe-input-series-max | 4000 | .. index:: |
- | | | pair: cluster option; pe-input-series-max |
- | | | |
- | | | The number of "normal" scheduler inputs to save. |
- | | | Used when reporting problems. A value of -1 means |
- | | | unlimited (report all), and 0 means none. |
- +---------------------------+---------+----------------------------------------------------+
- | enable-acl | false | .. index:: |
- | | | pair: cluster option; enable-acl |
- | | | |
- | | | Whether :ref:`acl` should be used to authorize |
- | | | modifications to the CIB |
- +---------------------------+---------+----------------------------------------------------+
- | placement-strategy | default | .. index:: |
- | | | pair: cluster option; placement-strategy |
- | | | |
- | | | How the cluster should assign resources to nodes |
- | | | (see :ref:`utilization`). Allowed values are |
- | | | ``default``, ``utilization``, ``balanced``, and |
- | | | ``minimal``. |
- +---------------------------+---------+----------------------------------------------------+
- | node-health-strategy | none | .. index:: |
- | | | pair: cluster option; node-health-strategy |
- | | | |
- | | | How the cluster should react to node health |
- | | | attributes (see :ref:`node-health`). Allowed values|
- | | | are ``none``, ``migrate-on-red``, ``only-green``, |
- | | | ``progressive``, and ``custom``. |
- +---------------------------+---------+----------------------------------------------------+
- | node-health-base | 0 | .. index:: |
- | | | pair: cluster option; node-health-base |
- | | | |
- | | | The base health score assigned to a node. Only |
- | | | used when ``node-health-strategy`` is |
- | | | ``progressive``. |
- +---------------------------+---------+----------------------------------------------------+
- | node-health-green | 0 | .. index:: |
- | | | pair: cluster option; node-health-green |
- | | | |
- | | | The score to use for a node health attribute whose |
- | | | value is ``green``. Only used when |
- | | | ``node-health-strategy`` is ``progressive`` or |
- | | | ``custom``. |
- +---------------------------+---------+----------------------------------------------------+
- | node-health-yellow | 0 | .. index:: |
- | | | pair: cluster option; node-health-yellow |
- | | | |
- | | | The score to use for a node health attribute whose |
- | | | value is ``yellow``. Only used when |
- | | | ``node-health-strategy`` is ``progressive`` or |
- | | | ``custom``. |
- +---------------------------+---------+----------------------------------------------------+
- | node-health-red | 0 | .. index:: |
- | | | pair: cluster option; node-health-red |
- | | | |
- | | | The score to use for a node health attribute whose |
- | | | value is ``red``. Only used when |
- | | | ``node-health-strategy`` is ``progressive`` or |
- | | | ``custom``. |
- +---------------------------+---------+----------------------------------------------------+
- | cluster-recheck-interval | 15min | .. index:: |
- | | | pair: cluster option; cluster-recheck-interval |
- | | | |
- | | | Pacemaker is primarily event-driven, and looks |
- | | | ahead to know when to recheck the cluster for |
- | | | failure timeouts and most time-based rules |
- | | | *(since 2.0.3)*. However, it will also recheck the |
- | | | cluster after this amount of inactivity. This has |
- | | | two goals: rules with ``date_spec`` are only |
- | | | guaranteed to be checked this often, and it also |
- | | | serves as a fail-safe for some kinds of scheduler |
- | | | bugs. A value of 0 disables this polling; positive |
- | | | values are a time interval. |
- +---------------------------+---------+----------------------------------------------------+
- | shutdown-lock | false | .. index:: |
- | | | pair: cluster option; shutdown-lock |
- | | | |
- | | | The default of false allows active resources to be |
- | | | recovered elsewhere when their node is cleanly |
- | | | shut down, which is what the vast majority of |
- | | | users will want. However, some users prefer to |
- | | | make resources highly available only for failures, |
- | | | with no recovery for clean shutdowns. If this |
- | | | option is true, resources active on a node when it |
- | | | is cleanly shut down are kept "locked" to that |
- | | | node (not allowed to run elsewhere) until they |
- | | | start again on that node after it rejoins (or for |
- | | | at most ``shutdown-lock-limit``, if set). Stonith |
- | | | resources and Pacemaker Remote connections are |
- | | | never locked. Clone and bundle instances and the |
- | | | promoted role of promotable clones are currently |
- | | | never locked, though support could be added in a |
- | | | future release. Locks may be manually cleared |
- | | | using the ``--refresh`` option of ``crm_resource`` |
- | | | (both the resource and node must be specified; |
- | | | this works with remote nodes if their connection |
- | | | resource's ``target-role`` is set to ``Stopped``, |
- | | | but not if Pacemaker Remote is stopped on the |
- | | | remote node without disabling the connection |
- | | | resource). *(since 2.0.4)* |
- +---------------------------+---------+----------------------------------------------------+
- | shutdown-lock-limit | 0 | .. index:: |
- | | | pair: cluster option; shutdown-lock-limit |
- | | | |
- | | | If ``shutdown-lock`` is true, and this is set to a |
- | | | nonzero time duration, locked resources will be |
- | | | allowed to start after this much time has passed |
- | | | since the node shutdown was initiated, even if the |
- | | | node has not rejoined. (This works with remote |
- | | | nodes only if their connection resource's |
- | | | ``target-role`` is set to ``Stopped``.) |
- | | | *(since 2.0.4)* |
- +---------------------------+---------+----------------------------------------------------+
- | remove-after-stop | false | .. index:: |
- | | | pair: cluster option; remove-after-stop |
- | | | |
- | | | *Deprecated* Should the cluster remove |
- | | | resources from Pacemaker's executor after they are |
- | | | stopped? Values other than the default are, at |
- | | | best, poorly tested and potentially dangerous. |
- | | | This option is deprecated and will be removed in a |
- | | | future release. |
- +---------------------------+---------+----------------------------------------------------+
- | startup-fencing | true | .. index:: |
- | | | pair: cluster option; startup-fencing |
- | | | |
- | | | *Advanced Use Only:* Should the cluster fence |
- | | | unseen nodes at start-up? Setting this to false is |
- | | | unsafe, because the unseen nodes could be active |
- | | | and running resources but unreachable. |
- +---------------------------+---------+----------------------------------------------------+
- | election-timeout | 2min | .. index:: |
- | | | pair: cluster option; election-timeout |
- | | | |
- | | | *Advanced Use Only:* If you need to adjust this |
- | | | value, it probably indicates the presence of a bug.|
- +---------------------------+---------+----------------------------------------------------+
- | shutdown-escalation | 20min | .. index:: |
- | | | pair: cluster option; shutdown-escalation |
- | | | |
- | | | *Advanced Use Only:* If you need to adjust this |
- | | | value, it probably indicates the presence of a bug.|
- +---------------------------+---------+----------------------------------------------------+
- | join-integration-timeout | 3min | .. index:: |
- | | | pair: cluster option; join-integration-timeout |
- | | | |
- | | | *Advanced Use Only:* If you need to adjust this |
- | | | value, it probably indicates the presence of a bug.|
- +---------------------------+---------+----------------------------------------------------+
- | join-finalization-timeout | 30min | .. index:: |
- | | | pair: cluster option; join-finalization-timeout |
- | | | |
- | | | *Advanced Use Only:* If you need to adjust this |
- | | | value, it probably indicates the presence of a bug.|
- +---------------------------+---------+----------------------------------------------------+
- | transition-delay | 0s | .. index:: |
- | | | pair: cluster option; transition-delay |
- | | | |
- | | | *Advanced Use Only:* Delay cluster recovery for |
- | | | the configured interval to allow for additional or |
- | | | related events to occur. This can be useful if |
- | | | your configuration is sensitive to the order in |
- | | | which ping updates arrive. Enabling this option |
- | | | will slow down cluster recovery under all |
- | | | conditions. |
- +---------------------------+---------+----------------------------------------------------+
+ :widths: 2 2 2 5
+ :header-rows: 1
+
+ * - Name
+ - Type
+ - Default
+ - Description
+ * - .. _cluster_name:
+
+ .. index::
+ pair: cluster option; cluster-name
+
+ cluster-name
+ - :ref:`text <text>`
+ -
+ - An (optional) name for the cluster as a whole. This is mostly for users'
+ convenience for use as desired in administration, but can be used in the
+ Pacemaker configuration in :ref:`rules` (as the ``#cluster-name``
+ :ref:`node attribute <node-attribute-expressions-special>`). It may also
+ be used by higher-level tools when displaying cluster information, and
+ by certain resource agents (for example, the ``ocf:heartbeat:GFS2``
+ agent stores the cluster name in filesystem meta-data).
+ * - .. _dc_version:
+
+ .. index::
+ pair: cluster option; dc-version
+
+ dc-version
+ - :ref:`version <version>`
+ - *detected*
+ - Version of Pacemaker on the cluster's designated controller (DC).
+ Maintained by the cluster, and intended for diagnostic purposes.
+ * - .. _cluster_infrastructure:
+
+ .. index::
+ pair: cluster option; cluster-infrastructure
+
+ cluster-infrastructure
+ - :ref:`text <text>`
+ - *detected*
+ - The messaging layer with which Pacemaker is currently running.
+ Maintained by the cluster, and intended for informational and diagnostic
+ purposes.
+ * - .. _no_quorum_policy:
+
+ .. index::
+ pair: cluster option; no-quorum-policy
+
+ no-quorum-policy
+ - :ref:`enumeration <enumeration>`
+ - stop
+ - What to do when the cluster does not have quorum. Allowed values:
+
+ * ``ignore:`` continue all resource management
+ * ``freeze:`` continue resource management, but don't recover resources
+ from nodes not in the affected partition
+ * ``stop:`` stop all resources in the affected cluster partition
+ * ``demote:`` demote promotable resources and stop all other resources
+ in the affected cluster partition *(since 2.0.5)*
+ * ``suicide:`` fence all nodes in the affected cluster partition
+ * - .. _batch_limit:
+
+ .. index::
+ pair: cluster option; batch-limit
+
+ batch-limit
+ - :ref:`integer <integer>`
+ - 0
+ - The maximum number of actions that the cluster may execute in parallel
+ across all nodes. The ideal value will depend on the speed and load
+ of your network and cluster nodes. If zero, the cluster will impose a
+ dynamically calculated limit only when any node has high load. If -1,
+ the cluster will not impose any limit.
+ * - .. _migration_limit:
+
+ .. index::
+ pair: cluster option; migration-limit
+
+ migration-limit
+ - :ref:`integer <integer>`
+ - -1
+ - The number of :ref:`live migration <live-migration>` actions that the
+ cluster is allowed to execute in parallel on a node. A value of -1 means
+ unlimited.
+ * - .. _symmetric_cluster:
+
+ .. index::
+ pair: cluster option; symmetric-cluster
+
+ symmetric-cluster
+ - :ref:`boolean <boolean>`
+ - true
+ - If true, resources can run on any node by default. If false, a resource
+ is allowed to run on a node only if a
+ :ref:`location constraint <location-constraint>` enables it.
+ * - .. _stop_all_resources:
+
+ .. index::
+ pair: cluster option; stop-all-resources
+
+ stop-all-resources
+ - :ref:`boolean <boolean>`
+ - false
+ - Whether all resources should be disallowed from running (can be useful
+ during maintenance or troubleshooting)
+ * - .. _stop_orphan_resources:
+
+ .. index::
+ pair: cluster option; stop-orphan-resources
+
+ stop-orphan-resources
+ - :ref:`boolean <boolean>`
+ - true
+ - Whether resources that have been deleted from the configuration should
+ be stopped. This value takes precedence over
+ :ref:`is-managed <is_managed>` (that is, even unmanaged resources will
+ be stopped when orphaned if this value is ``true``).
+ * - .. _stop_orphan_actions:
+
+ .. index::
+ pair: cluster option; stop-orphan-actions
+
+ stop-orphan-actions
+ - :ref:`boolean <boolean>`
+ - true
+ - Whether recurring :ref:`operations <operation>` that have been deleted
+ from the configuration should be cancelled
+ * - .. _start_failure_is_fatal:
+
+ .. index::
+ pair: cluster option; start-failure-is-fatal
+
+ start-failure-is-fatal
+ - :ref:`boolean <boolean>`
+ - true
+ - Whether a failure to start a resource on a particular node prevents
+ further start attempts on that node. If ``false``, the cluster will
+ decide whether the node is still eligible based on the resource's
+ current failure count and ``migration-threshold``.
+ * - .. _enable_startup_probes:
+
+ .. index::
+ pair: cluster option; enable-startup-probes
+
+ enable-startup-probes
+ - :ref:`boolean <boolean>`
+ - true
+ - Whether the cluster should check the pre-existing state of resources
+ when the cluster starts
+ * - .. _maintenance_mode:
+
+ .. index::
+ pair: cluster option; maintenance-mode
+
+ maintenance-mode
+ - :ref:`boolean <boolean>`
+ - false
+ - If true, the cluster will not start or stop any resource in the cluster,
+ and any recurring operations (expect those specifying ``role`` as
+ ``Stopped``) will be paused. If true, this overrides the
+ :ref:`maintenance <node_maintenance>` node attribute,
+ :ref:`is-managed <is_managed>` and :ref:`maintenance <rsc_maintenance>`
+ resource meta-attributes, and :ref:`enabled <op_enabled>` operation
+ meta-attribute.
+ * - .. _stonith_enabled:
+
+ .. index::
+ pair: cluster option; stonith-enabled
+
+ stonith-enabled
+ - :ref:`boolean <boolean>`
+ - true
+ - Whether the cluster is allowed to fence nodes (for example, failed nodes
+ and nodes with resources that can't be stopped).
+
+ If true, at least one fence device must be configured before resources
+ are allowed to run.
+
+ If false, unresponsive nodes are immediately assumed to be running no
+ resources, and resource recovery on online nodes starts without any
+ further protection (which can mean *data loss* if the unresponsive node
+ still accesses shared storage, for example). See also the
+ :ref:`requires <requires>` resource meta-attribute.
+ * - .. _stonith_action:
+
+ .. index::
+ pair: cluster option; stonith-action
+
+ stonith-action
+ - :ref:`enumeration <enumeration>`
+ - reboot
+ - Action the cluster should send to the fence agent when a node must be
+ fenced. Allowed values are ``reboot``, ``off``, and (for legacy agents
+ only) ``poweroff``.
+ * - .. _stonith_timeout:
+
+ .. index::
+ pair: cluster option; stonith-timeout
+
+ stonith-timeout
+ - :ref:`duration <duration>`
+ - 60s
+ - How long to wait for ``on``, ``off``, and ``reboot`` fence actions to
+ complete by default.
+ * - .. _stonith_max_attempts:
+
+ .. index::
+ pair: cluster option; stonith-max-attempts
+
+ stonith-max-attempts
+ - :ref:`score <score>`
+ - 10
+ - How many times fencing can fail for a target before the cluster will no
+ longer immediately re-attempt it. Any value below 1 will be ignored, and
+ the default will be used instead.
+ * - .. _stonith_watchdog_timeout:
+
+ .. index::
+ pair: cluster option; stonith-watchdog-timeout
+
+ stonith-watchdog-timeout
+ - :ref:`timeout <timeout>`
+ - 0
+ - If nonzero, and the cluster detects ``have-watchdog`` as ``true``, then
+ watchdog-based self-fencing will be performed via SBD when fencing is
+ required, without requiring a fencing resource explicitly configured.
+
+ If this is set to a positive value, unseen nodes are assumed to
+ self-fence within this much time.
+
+ **Warning:** It must be ensured that this value is larger than the
+ ``SBD_WATCHDOG_TIMEOUT`` environment variable on all nodes. Pacemaker
+ verifies the settings individually on all nodes and prevents startup or
+ shuts down if configured wrongly on the fly. It is strongly recommended
+ that ``SBD_WATCHDOG_TIMEOUT`` be set to the same value on all nodes.
+
+ If this is set to a negative value, and ``SBD_WATCHDOG_TIMEOUT`` is set,
+ twice that value will be used.
+
+ **Warning:** In this case, it is essential (and currently not verified
+ by pacemaker) that ``SBD_WATCHDOG_TIMEOUT`` is set to the same value on
+ all nodes.
+ * - .. _concurrent-fencing:
+
+ .. index::
+ pair: cluster option; concurrent-fencing
+
+ concurrent-fencing
+ - :ref:`boolean <boolean>`
+ - false
+ - Whether the cluster is allowed to initiate multiple fence actions
+ concurrently. Fence actions initiated externally, such as via the
+ ``stonith_admin`` tool or an application such as DLM, or by the fencer
+ itself such as recurring device monitors and ``status`` and ``list``
+ commands, are not limited by this option.
+ * - .. _fence_reaction:
+
+ .. index::
+ pair: cluster option; fence-reaction
+
+ fence-reaction
+ - :ref:`enumeration <enumeration>`
+ - stop
+ - How should a cluster node react if notified of its own fencing? A
+ cluster node may receive notification of its own fencing if fencing is
+ misconfigured, or if fabric fencing is in use that doesn't cut cluster
+ communication. Allowed values are ``stop`` to attempt to immediately
+ stop Pacemaker and stay stopped, or ``panic`` to attempt to immediately
+ reboot the local node, falling back to stop on failure. The default is
+ likely to be changed to ``panic`` in a future release. *(since 2.0.3)*
+ * - .. _priority_fencing_delay:
+
+ .. index::
+ pair: cluster option; priority-fencing-delay
+
+ priority-fencing-delay
+ - :ref:`duration <duration>`
+ - 0
+ - Apply this delay to any fencing targeting the lost nodes with the
+ highest total resource priority in case we don't have the majority of
+ the nodes in our cluster partition, so that the more significant nodes
+ potentially win any fencing match (especially meaningful in a
+ split-brain of a 2-node cluster). A promoted resource instance takes the
+ resource's priority plus 1 if the resource's priority is not 0. Any
+ static or random delays introduced by ``pcmk_delay_base`` and
+ ``pcmk_delay_max`` configured for the corresponding fencing resources
+ will be added to this delay. This delay should be significantly greater
+ than (safely twice) the maximum delay from those parameters. *(since
+ 2.0.4)*
+ * - .. _node_pending_timeout:
+
+ .. index::
+ pair: cluster option; node-pending-timeout
+
+ node-pending-timeout
+ - :ref:`duration <duration>`
+ - 2h
+ - Fence nodes that do not join the controller process group within this
+ much time after joining the cluster, to allow the cluster to continue
+ managing resources. A value of 0 means never fence pending nodes.
+ *(since 2.1.7)*
+ * - .. _cluster_delay:
+
+ .. index::
+ pair: cluster option; cluster-delay
+
+ cluster-delay
+ - :ref:`duration <duration>`
+ - 60s
+ - If the DC requires an action to be executed on another node, it will
+ consider the action failed if it does not get a response from the other
+ node within this time (beyond the action's own timeout). The ideal value
+ will depend on the speed and load of your network and cluster nodes.
+ * - .. _dc_deadtime:
+
+ .. index::
+ pair: cluster option; dc-deadtime
+
+ dc-deadtime
+ - :ref:`duration <duration>`
+ - 20s
+ - How long to wait for a response from other nodes when electing a DC. The
+ ideal value will depend on the speed and load of your network and
+ cluster nodes.
+ * - .. _cluster_ipc_limit:
+
+ .. index::
+ pair: cluster option; cluster-ipc-limit
+
+ cluster-ipc-limit
+ - :ref:`nonnegative integer <nonnegative_integer>`
+ - 500
+ - The maximum IPC message backlog before one cluster daemon will
+ disconnect another. This is of use in large clusters, for which a good
+ value is the number of resources in the cluster multiplied by the number
+ of nodes. The default of 500 is also the minimum. Raise this if you see
+ "Evicting client" log messages for cluster daemon process IDs.
+ * - .. _pe_error_series_max:
+
+ .. index::
+ pair: cluster option; pe-error-series-max
+
+ pe-error-series-max
+ - :ref:`integer <integer>`
+ - -1
+ - The number of scheduler inputs resulting in errors to save. These inputs
+ can be helpful during troubleshooting and when reporting issues. A
+ negative value means save all inputs, and 0 means save none.
+ * - .. _pe_warn_series_max:
+
+ .. index::
+ pair: cluster option; pe-warn-series-max
+
+ pe-warn-series-max
+ - :ref:`integer <integer>`
+ - 5000
+ - The number of scheduler inputs resulting in warnings to save. These
+ inputs can be helpful during troubleshooting and when reporting issues.
+ A negative value means save all inputs, and 0 means save none.
+ * - .. _pe_input_series_max:
+
+ .. index::
+ pair: cluster option; pe-input-series-max
+
+ pe-input-series-max
+ - :ref:`integer <integer>`
+ - 4000
+ - The number of "normal" scheduler inputs to save. These inputs can be
+ helpful during troubleshooting and when reporting issues. A negative
+ value means save all inputs, and 0 means save none.
+ * - .. _enable_acl:
+
+ .. index::
+ pair: cluster option; enable-acl
+
+ enable-acl
+ - :ref:`boolean <boolean>`
+ - false
+ - Whether :ref:`access control lists <acl>` should be used to authorize
+ CIB modifications
+ * - .. _placement_strategy:
+
+ .. index::
+ pair: cluster option; placement-strategy
+
+ placement-strategy
+ - :ref:`enumeration <enumeration>`
+ - default
+ - How the cluster should assign resources to nodes (see
+ :ref:`utilization`). Allowed values are ``default``, ``utilization``,
+ ``balanced``, and ``minimal``.
+ * - .. _node_health_strategy:
+
+ .. index::
+ pair: cluster option; node-health-strategy
+
+ node-health-strategy
+ - :ref:`enumeration <enumeration>`
+ - none
+ - How the cluster should react to :ref:`node health <node-health>`
+ attributes. Allowed values are ``none``, ``migrate-on-red``,
+ ``only-green``, ``progressive``, and ``custom``.
+ * - .. _node_health_base:
+
+ .. index::
+ pair: cluster option; node-health-base
+
+ node-health-base
+ - :ref:`score <score>`
+ - 0
+ - The base health score assigned to a node. Only used when
+ ``node-health-strategy`` is ``progressive``.
+ * - .. _node_health_green:
+
+ .. index::
+ pair: cluster option; node-health-green
+
+ node-health-green
+ - :ref:`score <score>`
+ - 0
+ - The score to use for a node health attribute whose value is ``green``.
+ Only used when ``node-health-strategy`` is ``progressive`` or
+ ``custom``.
+ * - .. _node_health_yellow:
+
+ .. index::
+ pair: cluster option; node-health-yellow
+
+ node-health-yellow
+ - :ref:`score <score>`
+ - 0
+ - The score to use for a node health attribute whose value is ``yellow``.
+ Only used when ``node-health-strategy`` is ``progressive`` or
+ ``custom``.
+ * - .. _node_health_red:
+
+ .. index::
+ pair: cluster option; node-health-red
+
+ node-health-red
+ - :ref:`score <score>`
+ - 0
+ - The score to use for a node health attribute whose value is ``red``.
+ Only used when ``node-health-strategy`` is ``progressive`` or
+ ``custom``.
+ * - .. _cluster_recheck_interval:
+
+ .. index::
+ pair: cluster option; cluster-recheck-interval
+
+ cluster-recheck-interval
+ - :ref:`duration <duration>`
+ - 15min
+ - Pacemaker is primarily event-driven, and looks ahead to know when to
+ recheck the cluster for failure timeouts and most time-based rules
+ *(since 2.0.3)*. However, it will also recheck the cluster after this
+ amount of inactivity. This has two goals: rules with ``date_spec`` are
+ only guaranteed to be checked this often, and it also serves as a
+ fail-safe for some kinds of scheduler bugs. A value of 0 disables this
+ polling.
+ * - .. _shutdown_lock:
+
+ .. index::
+ pair: cluster option; shutdown-lock
+
+ shutdown-lock
+ - :ref:`boolean <boolean>`
+ - false
+ - The default of false allows active resources to be recovered elsewhere
+ when their node is cleanly shut down, which is what the vast majority of
+ users will want. However, some users prefer to make resources highly
+ available only for failures, with no recovery for clean shutdowns. If
+ this option is true, resources active on a node when it is cleanly shut
+ down are kept "locked" to that node (not allowed to run elsewhere) until
+ they start again on that node after it rejoins (or for at most
+ ``shutdown-lock-limit``, if set). Stonith resources and Pacemaker Remote
+ connections are never locked. Clone and bundle instances and the
+ promoted role of promotable clones are currently never locked, though
+ support could be added in a future release. Locks may be manually
+ cleared using the ``--refresh`` option of ``crm_resource`` (both the
+ resource and node must be specified; this works with remote nodes if
+ their connection resource's ``target-role`` is set to ``Stopped``, but
+ not if Pacemaker Remote is stopped on the remote node without disabling
+ the connection resource). *(since 2.0.4)*
+ * - .. _shutdown_lock_limit:
+
+ .. index::
+ pair: cluster option; shutdown-lock-limit
+
+ shutdown-lock-limit
+ - :ref:`duration <duration>`
+ - 0
+ - If ``shutdown-lock`` is true, and this is set to a nonzero time
+ duration, locked resources will be allowed to start after this much time
+ has passed since the node shutdown was initiated, even if the node has
+ not rejoined. (This works with remote nodes only if their connection
+ resource's ``target-role`` is set to ``Stopped``.) *(since 2.0.4)*
+ * - .. _remove_after_stop:
+
+ .. index::
+ pair: cluster option; remove-after-stop
+
+ remove-after-stop
+ - :ref:`boolean <boolean>`
+ - false
+ - *Deprecated* Whether the cluster should remove resources from
+ Pacemaker's executor after they are stopped. Values other than the
+ default are, at best, poorly tested and potentially dangerous. This
+ option is deprecated and will be removed in a future release.
+ * - .. _startup_fencing:
+
+ .. index::
+ pair: cluster option; startup-fencing
+
+ startup-fencing
+ - :ref:`boolean <boolean>`
+ - true
+ - *Advanced Use Only:* Whether the cluster should fence unseen nodes at
+ start-up. Setting this to false is unsafe, because the unseen nodes
+ could be active and running resources but unreachable. ``dc-deadtime``
+ acts as a grace period before this fencing, since a DC must be elected
+ to schedule fencing.
+ * - .. _election_timeout:
+
+ .. index::
+ pair: cluster option; election-timeout
+
+ election-timeout
+ - :ref:`duration <duration>`
+ - 2min
+ - *Advanced Use Only:* If a winner is not declared within this much time
+ of starting an election, the node that initiated the election will
+ declare itself the winner.
+ * - .. _shutdown_escalation:
+
+ .. index::
+ pair: cluster option; shutdown-escalation
+
+ shutdown-escalation
+ - :ref:`duration <duration>`
+ - 20min
+ - *Advanced Use Only:* The controller will exit immediately if a shutdown
+ does not complete within this much time.
+ * - .. _join_integration_timeout:
+
+ .. index::
+ pair: cluster option; join-integration-timeout
+
+ join-integration-timeout
+ - :ref:`duration <duration>`
+ - 3min
+ - *Advanced Use Only:* If you need to adjust this value, it probably
+ indicates the presence of a bug.
+ * - .. _join_finalization_timeout:
+
+ .. index::
+ pair: cluster option; join-finalization-timeout
+
+ join-finalization-timeout
+ - :ref:`duration <duration>`
+ - 30min
+ - *Advanced Use Only:* If you need to adjust this value, it probably
+ indicates the presence of a bug.
+ * - .. _transition_delay:
+
+ .. index::
+ pair: cluster option; transition-delay
+
+ transition-delay
+ - :ref:`duration <duration>`
+ - 0s
+ - *Advanced Use Only:* Delay cluster recovery for the configured interval
+ to allow for additional or related events to occur. This can be useful
+ if your configuration is sensitive to the order in which ping updates
+ arrive. Enabling this option will slow down cluster recovery under all
+ conditions.
diff --git a/doc/sphinx/Pacemaker_Explained/status.rst b/doc/sphinx/Pacemaker_Explained/status.rst
index 2d7dd7e81c..6384edaef6 100644
--- a/doc/sphinx/Pacemaker_Explained/status.rst
+++ b/doc/sphinx/Pacemaker_Explained/status.rst
@@ -1,372 +1,368 @@
.. index::
single: status
single: XML element, status
Status -- Here be dragons
-------------------------
Most users never need to understand the contents of the status section
and can be happy with the output from ``crm_mon``.
However for those with a curious inclination, this section attempts to
provide an overview of its contents.
.. index::
single: node; status
Node Status
###########
In addition to the cluster's configuration, the CIB holds an
up-to-date representation of each cluster node in the ``status`` section.
.. topic:: A bare-bones status entry for a healthy node **cl-virt-1**
.. code-block:: xml
<node_state id="1" uname="cl-virt-1" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
<transient_attributes id="1"/>
<lrm id="1"/>
</node_state>
Users are highly recommended *not* to modify any part of a node's
state *directly*. The cluster will periodically regenerate the entire
section from authoritative sources, so any changes should be done
with the tools appropriate to those sources.
-
+
.. table:: **Authoritative Sources for State Information**
:widths: 1 1
+----------------------+----------------------+
| CIB Object | Authoritative Source |
+======================+======================+
| node_state | pacemaker-controld |
+----------------------+----------------------+
| transient_attributes | pacemaker-attrd |
+----------------------+----------------------+
| lrm | pacemaker-execd |
+----------------------+----------------------+
The fields used in the ``node_state`` objects are named as they are
-largely for historical reasons and are rooted in Pacemaker's origins
-as the resource manager for the older Heartbeat project. They have remained
-unchanged to preserve compatibility with older versions.
+largely for historical reasons, to maintain compatibility with older versions.
.. table:: **Node Status Fields**
:widths: 1 3
+------------------+----------------------------------------------------------+
| Field | Description |
+==================+==========================================================+
| id | .. index: |
| | single: id; node status |
| | single: node; status, id |
| | |
| | Unique identifier for the node. Corosync-based clusters |
| | use a numeric counter. |
+------------------+----------------------------------------------------------+
| uname | .. index:: |
| | single: uname; node status |
| | single: node; status, uname |
| | |
| | The node's name as known by the cluster |
+------------------+----------------------------------------------------------+
| in_ccm | .. index:: |
| | single: in_ccm; node status |
| | single: node; status, in_ccm |
| | |
| | Is the node a member at the cluster communication later? |
| | Allowed values: ``true``, ``false``. |
+------------------+----------------------------------------------------------+
| crmd | .. index:: |
| | single: crmd; node status |
| | single: node; status, crmd |
| | |
| | Is the node a member at the pacemaker layer? Allowed |
| | values: ``online``, ``offline``. |
+------------------+----------------------------------------------------------+
| crm-debug-origin | .. index:: |
| | single: crm-debug-origin; node status |
| | single: node; status, crm-debug-origin |
| | |
| | The name of the source function that made the most |
| | recent change (for debugging purposes). |
+------------------+----------------------------------------------------------+
| join | .. index:: |
| | single: join; node status |
| | single: node; status, join |
| | |
| | Does the node participate in hosting resources? |
| | Allowed values: ``down``, ``pending``, ``member``. |
| | ``banned``. |
+------------------+----------------------------------------------------------+
| expected | .. index:: |
| | single: expected; node status |
| | single: node; status, expected |
| | |
| | Expected value for ``join``. |
+------------------+----------------------------------------------------------+
The cluster uses these fields to determine whether, at the node level, the
node is healthy or is in a failed state and needs to be fenced.
Transient Node Attributes
#########################
Like regular :ref:`node_attributes`, the name/value
pairs listed in the ``transient_attributes`` section help to describe the
node. However they are forgotten by the cluster when the node goes offline.
This can be useful, for instance, when you want a node to be in standby mode
(not able to run resources) just until the next reboot.
In addition to any values the administrator sets, the cluster will
also store information about failed resources here.
.. topic:: A set of transient node attributes for node **cl-virt-1**
.. code-block:: xml
<transient_attributes id="cl-virt-1">
<instance_attributes id="status-cl-virt-1">
<nvpair id="status-cl-virt-1-pingd" name="pingd" value="3"/>
<nvpair id="status-cl-virt-1-probe_complete" name="probe_complete" value="true"/>
<nvpair id="status-cl-virt-1-fail-count-pingd:0.monitor_30000" name="fail-count-pingd:0#monitor_30000" value="1"/>
<nvpair id="status-cl-virt-1-last-failure-pingd:0" name="last-failure-pingd:0" value="1239009742"/>
</instance_attributes>
</transient_attributes>
In the above example, we can see that a monitor on the ``pingd:0`` resource has
failed once, at 09:22:22 UTC 6 April 2009. [#]_.
We also see that the node is connected to three **pingd** peers and that
all known resources have been checked for on this machine (``probe_complete``).
.. index::
single: Operation History
Operation History
#################
-A node's resource history is held in the ``lrm_resources`` tag (a child
-of the ``lrm`` tag). The information stored here includes enough
+A node's resource history is held in the ``lrm_resources`` element (a child
+of the ``lrm`` element). The information stored here includes enough
information for the cluster to stop the resource safely if it is
removed from the ``configuration`` section. Specifically, the resource's
``id``, ``class``, ``type`` and ``provider`` are stored.
.. topic:: A record of the ``apcstonith`` resource
.. code-block:: xml
<lrm_resource id="apcstonith" type="fence_apc_snmp" class="stonith"/>
-Additionally, we store the last job for every combination of
-``resource``, ``action`` and ``interval``. The concatenation of the values in
-this tuple are used to create the id of the ``lrm_rsc_op`` object.
+Additionally, we store history entries for certain actions.
-.. table:: **Contents of an lrm_rsc_op job**
+.. table:: **Attributes of an lrm_rsc_op element**
:class: longtable
:widths: 1 3
+------------------+----------------------------------------------------------+
| Field | Description |
+==================+==========================================================+
| id | .. index:: |
| | single: id; action status |
| | single: action; status, id |
| | |
- | | Identifier for the job constructed from the resource's |
- | | ``operation`` and ``interval``. |
+ | | Identifier for the history entry constructed from the |
+ | | resource ID, action name, and operation interval. |
+------------------+----------------------------------------------------------+
| call-id | .. index:: |
| | single: call-id; action status |
| | single: action; status, call-id |
| | |
- | | The job's ticket number. Used as a sort key to determine |
- | | the order in which the jobs were executed. |
+ | | A node-specific counter used to determine the order in |
+ | | which actions were executed. |
+------------------+----------------------------------------------------------+
| operation | .. index:: |
| | single: operation; action status |
| | single: action; status, operation |
| | |
- | | The action the resource agent was invoked with. |
+ | | The action name the resource agent was invoked with. |
+------------------+----------------------------------------------------------+
| interval | .. index:: |
| | single: interval; action status |
| | single: action; status, interval |
| | |
| | The frequency, in milliseconds, at which the operation |
- | | will be repeated. A one-off job is indicated by 0. |
+ | | will be repeated. One-time execution is indicated by 0. |
+------------------+----------------------------------------------------------+
| op-status | .. index:: |
| | single: op-status; action status |
| | single: action; status, op-status |
| | |
- | | The job's status. Generally this will be either 0 (done) |
- | | or -1 (pending). Rarely used in favor of ``rc-code``. |
+ | | The execution status of this action. The meanings of |
+ | | these codes are internal to Pacemaker. |
+------------------+----------------------------------------------------------+
| rc-code | .. index:: |
| | single: rc-code; action status |
| | single: action; status, rc-code |
| | |
- | | The job's result. Refer to the *Resource Agents* chapter |
- | | of *Pacemaker Administration* for details on what the |
- | | values here mean and how they are interpreted. |
+ | | The resource agent's exit status for this action. Refer |
+ | | to the *Resource Agents* chapter of |
+ | | *Pacemaker Administration* for how these values are |
+ | | interpreted. |
+------------------+----------------------------------------------------------+
| last-rc-change | .. index:: |
| | single: last-rc-change; action status |
| | single: action; status, last-rc-change |
| | |
| | Machine-local date/time, in seconds since epoch, at |
- | | which the job first returned the current value of |
+ | | which the action first returned the current value of |
| | ``rc-code``. For diagnostic purposes. |
+------------------+----------------------------------------------------------+
| exec-time | .. index:: |
| | single: exec-time; action status |
| | single: action; status, exec-time |
| | |
- | | Time, in milliseconds, that the job was running for. |
+ | | Time, in milliseconds, that the action was running for. |
| | For diagnostic purposes. |
+------------------+----------------------------------------------------------+
| queue-time | .. index:: |
| | single: queue-time; action status |
| | single: action; status, queue-time |
| | |
- | | Time, in seconds, that the job was queued for in the |
+ | | Time, in seconds, that the action was queued for in the |
| | local executor. For diagnostic purposes. |
+------------------+----------------------------------------------------------+
| crm_feature_set | .. index:: |
| | single: crm_feature_set; action status |
| | single: action; status, crm_feature_set |
| | |
- | | The version which this job description conforms to. Used |
- | | when processing ``op-digest``. |
+ | | The Pacemaker feature set used to record this entry. |
+------------------+----------------------------------------------------------+
| transition-key | .. index:: |
| | single: transition-key; action status |
| | single: action; status, transition-key |
| | |
- | | A concatenation of the job's graph action number, the |
+ | | A concatenation of the action's graph action number, the |
| | graph number, the expected result and the UUID of the |
| | controller instance that scheduled it. This is used to |
| | construct ``transition-magic`` (below). |
+------------------+----------------------------------------------------------+
| transition-magic | .. index:: |
| | single: transition-magic; action status |
| | single: action; status, transition-magic |
| | |
- | | A concatenation of the job's ``op-status``, ``rc-code`` |
+ | | A concatenation of ``op-status``, ``rc-code`` |
| | and ``transition-key``. Guaranteed to be unique for the |
| | life of the cluster (which ensures it is part of CIB |
| | update notifications) and contains all the information |
| | needed for the controller to correctly analyze and |
- | | process the completed job. Most importantly, the |
- | | decomposed elements tell the controller if the job |
+ | | process the completed action. Most importantly, the |
+ | | decomposed elements tell the controller if the history |
| | entry was expected and whether it failed. |
+------------------+----------------------------------------------------------+
| op-digest | .. index:: |
| | single: op-digest; action status |
| | single: action; status, op-digest |
| | |
| | An MD5 sum representing the parameters passed to the |
- | | job. Used to detect changes to the configuration, to |
+ | | action. Used to detect changes to the configuration, to |
| | restart resources if necessary. |
+------------------+----------------------------------------------------------+
| crm-debug-origin | .. index:: |
| | single: crm-debug-origin; action status |
| | single: action; status, crm-debug-origin |
| | |
| | The origin of the current values. For diagnostic |
| | purposes. |
+------------------+----------------------------------------------------------+
Simple Operation History Example
________________________________
.. topic:: A monitor operation (determines current state of the ``apcstonith`` resource)
.. code-block:: xml
<lrm_resource id="apcstonith" type="fence_apc_snmp" class="stonith">
<lrm_rsc_op id="apcstonith_monitor_0" operation="monitor" call-id="2"
rc-code="7" op-status="0" interval="0"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
op-digest="2e3da9274d3550dc6526fb24bfcbcba0"
transition-key="22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
transition-magic="0:7;22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
last-rc-change="1239008085" exec-time="10" queue-time="0"/>
</lrm_resource>
-In the above example, the job is a non-recurring monitor operation
+In the above example, the action is a non-recurring monitor operation
often referred to as a "probe" for the ``apcstonith`` resource.
The cluster schedules probes for every configured resource on a node when
the node first starts, in order to determine the resource's current state
before it takes any further action.
From the ``transition-key``, we can see that this was the 22nd action of
the 2nd graph produced by this instance of the controller
(2668bbeb-06d5-40f9-936d-24cb7f87006a).
The third field of the ``transition-key`` contains a 7, which indicates
-that the job expects to find the resource inactive. By looking at the ``rc-code``
-property, we see that this was the case.
+that the cluster expects to find the resource inactive. By looking at the
+``rc-code`` property, we see that this was the case.
-As that is the only job recorded for this node, we can conclude that
+As that is the only action recorded for this node, we can conclude that
the cluster started the resource elsewhere.
Complex Operation History Example
_________________________________
-.. topic:: Resource history of a ``pingd`` clone with multiple jobs
+.. topic:: Resource history of a ``pingd`` clone with multiple entries
.. code-block:: xml
<lrm_resource id="pingd:0" type="pingd" class="ocf" provider="pacemaker">
<lrm_rsc_op id="pingd:0_monitor_30000" operation="monitor" call-id="34"
rc-code="0" op-status="0" interval="30000"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
transition-key="10:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
last-rc-change="1239009741" exec-time="10" queue-time="0"/>
<lrm_rsc_op id="pingd:0_stop_0" operation="stop"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" call-id="32"
rc-code="0" op-status="0" interval="0"
transition-key="11:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
last-rc-change="1239009741" exec-time="10" queue-time="0"/>
<lrm_rsc_op id="pingd:0_start_0" operation="start" call-id="33"
rc-code="0" op-status="0" interval="0"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
transition-key="31:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
last-rc-change="1239009741" exec-time="10" queue-time="0" />
<lrm_rsc_op id="pingd:0_monitor_0" operation="monitor" call-id="3"
rc-code="0" op-status="0" interval="0"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
transition-key="23:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
last-rc-change="1239008085" exec-time="20" queue-time="0"/>
</lrm_resource>
-When more than one job record exists, it is important to first sort
+When more than one history entry exists, it is important to first sort
them by ``call-id`` before interpreting them.
Once sorted, the above example can be summarized as:
#. A non-recurring monitor operation returning 7 (not running), with a ``call-id`` of 3
#. A stop operation returning 0 (success), with a ``call-id`` of 32
#. A start operation returning 0 (success), with a ``call-id`` of 33
#. A recurring monitor returning 0 (success), with a ``call-id`` of 34
-The cluster processes each job record to build up a picture of the
+The cluster processes each history entry to build up a picture of the
resource's state. After the first and second entries, it is
considered stopped, and after the third it considered active.
Based on the last operation, we can tell that the resource is
currently active.
Additionally, from the presence of a ``stop`` operation with a lower
``call-id`` than that of the ``start`` operation, we can conclude that the
resource has been restarted. Specifically this occurred as part of
actions 11 and 31 of transition 11 from the controller instance with the key
``2668bbeb...``. This information can be helpful for locating the
relevant section of the logs when looking for the source of a failure.
.. [#] You can use the standard ``date`` command to print a human-readable version
of any seconds-since-epoch value, for example ``date -d @1239009742``.
diff --git a/lib/pacemaker/pcmk_sched_nodes.c b/lib/pacemaker/pcmk_sched_nodes.c
index f7b1428c5c..03f09ef344 100644
--- a/lib/pacemaker/pcmk_sched_nodes.c
+++ b/lib/pacemaker/pcmk_sched_nodes.c
@@ -1,429 +1,434 @@
/*
* Copyright 2004-2023 the Pacemaker project contributors
*
* The version control history for this file may have further details.
*
* This source code is licensed under the GNU General Public License version 2
* or later (GPLv2+) WITHOUT ANY WARRANTY.
*/
#include <crm_internal.h>
#include <crm/msg_xml.h>
#include <crm/common/xml_internal.h>
#include <pacemaker-internal.h>
#include <pacemaker.h>
#include "libpacemaker_private.h"
/*!
* \internal
* \brief Check whether a node is available to run resources
*
* \param[in] node Node to check
* \param[in] consider_score If true, consider a negative score unavailable
* \param[in] consider_guest If true, consider a guest node unavailable whose
* resource will not be active
*
* \return true if node is online and not shutting down, unclean, or in standby
* or maintenance mode, otherwise false
*/
bool
pcmk__node_available(const pcmk_node_t *node, bool consider_score,
bool consider_guest)
{
if ((node == NULL) || (node->details == NULL) || !node->details->online
|| node->details->shutdown || node->details->unclean
|| node->details->standby || node->details->maintenance) {
return false;
}
if (consider_score && (node->weight < 0)) {
return false;
}
// @TODO Go through all callers to see which should set consider_guest
if (consider_guest && pe__is_guest_node(node)) {
pcmk_resource_t *guest = node->details->remote_rsc->container;
if (guest->fns->location(guest, NULL, FALSE) == NULL) {
return false;
}
}
return true;
}
/*!
* \internal
* \brief Copy a hash table of node objects
*
* \param[in] nodes Hash table to copy
*
* \return New copy of nodes (or NULL if nodes is NULL)
*/
GHashTable *
pcmk__copy_node_table(GHashTable *nodes)
{
GHashTable *new_table = NULL;
GHashTableIter iter;
pcmk_node_t *node = NULL;
if (nodes == NULL) {
return NULL;
}
new_table = pcmk__strkey_table(NULL, free);
g_hash_table_iter_init(&iter, nodes);
while (g_hash_table_iter_next(&iter, NULL, (gpointer *) &node)) {
pcmk_node_t *new_node = pe__copy_node(node);
g_hash_table_insert(new_table, (gpointer) new_node->details->id,
new_node);
}
return new_table;
}
/*!
* \internal
* \brief Free a table of node tables
*
* \param[in,out] data Table to free
*
* \note This is a \c GDestroyNotify wrapper for \c g_hash_table_destroy().
*/
static void
destroy_node_tables(gpointer data)
{
g_hash_table_destroy((GHashTable *) data);
}
/*!
* \internal
* \brief Recursively copy the node tables of a resource
*
* Build a hash table containing copies of the allowed nodes tables of \p rsc
* and its entire tree of descendants. The key is the resource ID, and the value
* is a copy of the resource's node table.
*
* \param[in] rsc Resource whose node table to copy
* \param[in,out] copy Where to store the copied node tables
*
* \note \p *copy should be \c NULL for the top-level call.
* \note The caller is responsible for freeing \p copy using
* \c g_hash_table_destroy().
*/
void
pcmk__copy_node_tables(const pcmk_resource_t *rsc, GHashTable **copy)
{
CRM_ASSERT((rsc != NULL) && (copy != NULL));
if (*copy == NULL) {
*copy = pcmk__strkey_table(NULL, destroy_node_tables);
}
g_hash_table_insert(*copy, rsc->id,
pcmk__copy_node_table(rsc->allowed_nodes));
for (const GList *iter = rsc->children; iter != NULL; iter = iter->next) {
pcmk__copy_node_tables((const pcmk_resource_t *) iter->data, copy);
}
}
/*!
* \internal
* \brief Recursively restore the node tables of a resource from backup
*
* Given a hash table containing backup copies of the allowed nodes tables of
* \p rsc and its entire tree of descendants, replace the resources' current
* node tables with the backed-up copies.
*
* \param[in,out] rsc Resource whose node tables to restore
* \param[in] backup Table of backup node tables (created by
* \c pcmk__copy_node_tables())
*
* \note This function frees the resources' current node tables.
*/
void
pcmk__restore_node_tables(pcmk_resource_t *rsc, GHashTable *backup)
{
CRM_ASSERT((rsc != NULL) && (backup != NULL));
g_hash_table_destroy(rsc->allowed_nodes);
// Copy to avoid danger with multiple restores
rsc->allowed_nodes = g_hash_table_lookup(backup, rsc->id);
rsc->allowed_nodes = pcmk__copy_node_table(rsc->allowed_nodes);
for (GList *iter = rsc->children; iter != NULL; iter = iter->next) {
pcmk__restore_node_tables((pcmk_resource_t *) iter->data, backup);
}
}
/*!
* \internal
* \brief Copy a list of node objects
*
* \param[in] list List to copy
* \param[in] reset Set copies' scores to 0
*
* \return New list of shallow copies of nodes in original list
*/
GList *
pcmk__copy_node_list(const GList *list, bool reset)
{
GList *result = NULL;
for (const GList *iter = list; iter != NULL; iter = iter->next) {
pcmk_node_t *new_node = NULL;
pcmk_node_t *this_node = iter->data;
new_node = pe__copy_node(this_node);
if (reset) {
new_node->weight = 0;
}
result = g_list_prepend(result, new_node);
}
return result;
}
/*!
* \internal
* \brief Compare two nodes for assignment preference
*
* Given two nodes, check which one is more preferred by assignment criteria
* such as node score and utilization.
*
* \param[in] a First node to compare
* \param[in] b Second node to compare
- * \param[in] data Node that resource being assigned is active on, if any
+ * \param[in] data Node to prefer if all else equal
*
* \return -1 if \p a is preferred, +1 if \p b is preferred, or 0 if they are
* equally preferred
*/
static gint
compare_nodes(gconstpointer a, gconstpointer b, gpointer data)
{
const pcmk_node_t *node1 = (const pcmk_node_t *) a;
const pcmk_node_t *node2 = (const pcmk_node_t *) b;
- const pcmk_node_t *active = (const pcmk_node_t *) data;
+ const pcmk_node_t *preferred = (const pcmk_node_t *) data;
int node1_score = -INFINITY;
int node2_score = -INFINITY;
int result = 0;
if (a == NULL) {
return 1;
}
if (b == NULL) {
return -1;
}
// Compare node scores
if (pcmk__node_available(node1, false, false)) {
node1_score = node1->weight;
}
if (pcmk__node_available(node2, false, false)) {
node2_score = node2->weight;
}
if (node1_score > node2_score) {
- crm_trace("%s (%d) > %s (%d) : score",
- pe__node_name(node1), node1_score, pe__node_name(node2),
- node2_score);
+ crm_trace("%s before %s (score %d > %d)",
+ pe__node_name(node1), pe__node_name(node2),
+ node1_score, node2_score);
return -1;
}
if (node1_score < node2_score) {
- crm_trace("%s (%d) < %s (%d) : score",
- pe__node_name(node1), node1_score, pe__node_name(node2),
- node2_score);
+ crm_trace("%s after %s (score %d < %d)",
+ pe__node_name(node1), pe__node_name(node2),
+ node1_score, node2_score);
return 1;
}
- crm_trace("%s (%d) == %s (%d) : score",
- pe__node_name(node1), node1_score, pe__node_name(node2),
- node2_score);
-
// If appropriate, compare node utilization
if (pcmk__str_eq(node1->details->data_set->placement_strategy, "minimal",
pcmk__str_casei)) {
goto equal;
}
if (pcmk__str_eq(node1->details->data_set->placement_strategy, "balanced",
pcmk__str_casei)) {
result = pcmk__compare_node_capacities(node1, node2);
if (result < 0) {
- crm_trace("%s > %s : capacity (%d)",
- pe__node_name(node1), pe__node_name(node2), result);
+ crm_trace("%s before %s (greater capacity by %d attributes)",
+ pe__node_name(node1), pe__node_name(node2), result * -1);
return -1;
} else if (result > 0) {
- crm_trace("%s < %s : capacity (%d)",
+ crm_trace("%s after %s (lower capacity by %d attributes)",
pe__node_name(node1), pe__node_name(node2), result);
return 1;
}
}
// Compare number of resources already assigned to node
if (node1->details->num_resources < node2->details->num_resources) {
- crm_trace("%s (%d) > %s (%d) : resources",
- pe__node_name(node1), node1->details->num_resources,
- pe__node_name(node2), node2->details->num_resources);
+ crm_trace("%s before %s (%d resources < %d)",
+ pe__node_name(node1), pe__node_name(node2),
+ node1->details->num_resources, node2->details->num_resources);
return -1;
} else if (node1->details->num_resources > node2->details->num_resources) {
- crm_trace("%s (%d) < %s (%d) : resources",
- pe__node_name(node1), node1->details->num_resources,
- pe__node_name(node2), node2->details->num_resources);
+ crm_trace("%s after %s (%d resources > %d)",
+ pe__node_name(node1), pe__node_name(node2),
+ node1->details->num_resources, node2->details->num_resources);
return 1;
}
// Check whether one node is already running desired resource
- if (active != NULL) {
- if (pe__same_node(active, node1)) {
- crm_trace("%s (%d) > %s (%d) : active",
- pe__node_name(node1), node1->details->num_resources,
- pe__node_name(node2), node2->details->num_resources);
+ if (preferred != NULL) {
+ if (pe__same_node(preferred, node1)) {
+ crm_trace("%s before %s (preferred node)",
+ pe__node_name(node1), pe__node_name(node2));
return -1;
- } else if (pe__same_node(active, node2)) {
- crm_trace("%s (%d) < %s (%d) : active",
- pe__node_name(node1), node1->details->num_resources,
- pe__node_name(node2), node2->details->num_resources);
+ } else if (pe__same_node(preferred, node2)) {
+ crm_trace("%s after %s (not preferred node)",
+ pe__node_name(node1), pe__node_name(node2));
return 1;
}
}
// If all else is equal, prefer node with lowest-sorting name
equal:
- crm_trace("%s = %s", pe__node_name(node1), pe__node_name(node2));
- return strcmp(node1->details->uname, node2->details->uname);
+ result = strcmp(node1->details->uname, node2->details->uname);
+ if (result < 0) {
+ crm_trace("%s before %s (name)",
+ pe__node_name(node1), pe__node_name(node2));
+ return -1;
+ } else if (result > 0) {
+ crm_trace("%s after %s (name)",
+ pe__node_name(node1), pe__node_name(node2));
+ return 1;
+ }
+
+ crm_trace("%s == %s", pe__node_name(node1), pe__node_name(node2));
+ return 0;
}
/*!
* \internal
* \brief Sort a list of nodes by assigment preference
*
* \param[in,out] nodes Node list to sort
* \param[in] active_node Node where resource being assigned is active
*
* \return New head of sorted list
*/
GList *
pcmk__sort_nodes(GList *nodes, pcmk_node_t *active_node)
{
return g_list_sort_with_data(nodes, compare_nodes, active_node);
}
/*!
* \internal
* \brief Check whether any node is available to run resources
*
* \param[in] nodes Nodes to check
*
* \return true if any node in \p nodes is available to run resources,
* otherwise false
*/
bool
pcmk__any_node_available(GHashTable *nodes)
{
GHashTableIter iter;
const pcmk_node_t *node = NULL;
if (nodes == NULL) {
return false;
}
g_hash_table_iter_init(&iter, nodes);
while (g_hash_table_iter_next(&iter, NULL, (void **) &node)) {
if (pcmk__node_available(node, true, false)) {
return true;
}
}
return false;
}
/*!
* \internal
* \brief Apply node health values for all nodes in cluster
*
* \param[in,out] data_set Cluster working set
*/
void
pcmk__apply_node_health(pcmk_scheduler_t *data_set)
{
int base_health = 0;
enum pcmk__health_strategy strategy;
const char *strategy_str = pe_pref(data_set->config_hash,
PCMK__OPT_NODE_HEALTH_STRATEGY);
strategy = pcmk__parse_health_strategy(strategy_str);
if (strategy == pcmk__health_strategy_none) {
return;
}
crm_info("Applying node health strategy '%s'", strategy_str);
// The progressive strategy can use a base health score
if (strategy == pcmk__health_strategy_progressive) {
base_health = pe__health_score(PCMK__OPT_NODE_HEALTH_BASE, data_set);
}
for (GList *iter = data_set->nodes; iter != NULL; iter = iter->next) {
pcmk_node_t *node = (pcmk_node_t *) iter->data;
int health = pe__sum_node_health_scores(node, base_health);
// An overall health score of 0 has no effect
if (health == 0) {
continue;
}
crm_info("Overall system health of %s is %d",
pe__node_name(node), health);
// Use node health as a location score for each resource on the node
for (GList *r = data_set->resources; r != NULL; r = r->next) {
pcmk_resource_t *rsc = (pcmk_resource_t *) r->data;
bool constrain = true;
if (health < 0) {
/* Negative health scores do not apply to resources with
* allow-unhealthy-nodes=true.
*/
constrain = !crm_is_true(g_hash_table_lookup(rsc->meta,
PCMK__META_ALLOW_UNHEALTHY_NODES));
}
if (constrain) {
pcmk__new_location(strategy_str, rsc, health, NULL, node);
} else {
pe_rsc_trace(rsc, "%s is immune from health ban on %s",
rsc->id, pe__node_name(node));
}
}
}
}
/*!
* \internal
* \brief Check for a node in a resource's parent's allowed nodes
*
* \param[in] rsc Resource whose parent should be checked
* \param[in] node Node to check for
*
* \return Equivalent of \p node from \p rsc's parent's allowed nodes if any,
* otherwise NULL
*/
pcmk_node_t *
pcmk__top_allowed_node(const pcmk_resource_t *rsc, const pcmk_node_t *node)
{
GHashTable *allowed_nodes = NULL;
if ((rsc == NULL) || (node == NULL)) {
return NULL;
} else if (rsc->parent == NULL) {
allowed_nodes = rsc->allowed_nodes;
} else {
allowed_nodes = rsc->parent->allowed_nodes;
}
return g_hash_table_lookup(allowed_nodes, node->details->id);
}
diff --git a/lib/pacemaker/pcmk_sched_utilization.c b/lib/pacemaker/pcmk_sched_utilization.c
index e65b2fb2c3..437dd665d8 100644
--- a/lib/pacemaker/pcmk_sched_utilization.c
+++ b/lib/pacemaker/pcmk_sched_utilization.c
@@ -1,466 +1,466 @@
/*
* Copyright 2014-2023 the Pacemaker project contributors
*
* The version control history for this file may have further details.
*
* This source code is licensed under the GNU General Public License version 2
* or later (GPLv2+) WITHOUT ANY WARRANTY.
*/
#include <crm_internal.h>
#include <crm/msg_xml.h>
#include <pacemaker-internal.h>
#include "libpacemaker_private.h"
/*!
* \internal
* \brief Get integer utilization from a string
*
* \param[in] s String representation of a node utilization value
*
* \return Integer equivalent of \p s
* \todo It would make sense to restrict utilization values to nonnegative
* integers, but the documentation just says "integers" and we didn't
* restrict them initially, so for backward compatibility, allow any
* integer.
*/
static int
utilization_value(const char *s)
{
int value = 0;
if ((s != NULL) && (pcmk__scan_min_int(s, &value, INT_MIN) == EINVAL)) {
pe_warn("Using 0 for utilization instead of invalid value '%s'", value);
value = 0;
}
return value;
}
/*
* Functions for comparing node capacities
*/
struct compare_data {
const pcmk_node_t *node1;
const pcmk_node_t *node2;
bool node2_only;
int result;
};
/*!
* \internal
* \brief Compare a single utilization attribute for two nodes
*
- * Compare one utilization attribute for two nodes, incrementing the result if
- * the first node has greater capacity, and decrementing it if the second node
+ * Compare one utilization attribute for two nodes, decrementing the result if
+ * the first node has greater capacity, and incrementing it if the second node
* has greater capacity.
*
* \param[in] key Utilization attribute name to compare
* \param[in] value Utilization attribute value to compare
* \param[in,out] user_data Comparison data (as struct compare_data*)
*/
static void
compare_utilization_value(gpointer key, gpointer value, gpointer user_data)
{
int node1_capacity = 0;
int node2_capacity = 0;
struct compare_data *data = user_data;
const char *node2_value = NULL;
if (data->node2_only) {
if (g_hash_table_lookup(data->node1->details->utilization, key)) {
return; // We've already compared this attribute
}
} else {
node1_capacity = utilization_value((const char *) value);
}
node2_value = g_hash_table_lookup(data->node2->details->utilization, key);
node2_capacity = utilization_value(node2_value);
if (node1_capacity > node2_capacity) {
data->result--;
} else if (node1_capacity < node2_capacity) {
data->result++;
}
}
/*!
* \internal
* \brief Compare utilization capacities of two nodes
*
* \param[in] node1 First node to compare
* \param[in] node2 Second node to compare
*
* \return Negative integer if node1 has more free capacity,
* 0 if the capacities are equal, or a positive integer
* if node2 has more free capacity
*/
int
pcmk__compare_node_capacities(const pcmk_node_t *node1,
const pcmk_node_t *node2)
{
struct compare_data data = {
.node1 = node1,
.node2 = node2,
.node2_only = false,
.result = 0,
};
// Compare utilization values that node1 and maybe node2 have
g_hash_table_foreach(node1->details->utilization, compare_utilization_value,
&data);
// Compare utilization values that only node2 has
data.node2_only = true;
g_hash_table_foreach(node2->details->utilization, compare_utilization_value,
&data);
return data.result;
}
/*
* Functions for updating node capacities
*/
struct calculate_data {
GHashTable *current_utilization;
bool plus;
};
/*!
* \internal
* \brief Update a single utilization attribute with a new value
*
* \param[in] key Name of utilization attribute to update
* \param[in] value Value to add or substract
* \param[in,out] user_data Calculation data (as struct calculate_data *)
*/
static void
update_utilization_value(gpointer key, gpointer value, gpointer user_data)
{
int result = 0;
const char *current = NULL;
struct calculate_data *data = user_data;
current = g_hash_table_lookup(data->current_utilization, key);
if (data->plus) {
result = utilization_value(current) + utilization_value(value);
} else if (current) {
result = utilization_value(current) - utilization_value(value);
}
g_hash_table_replace(data->current_utilization,
strdup(key), pcmk__itoa(result));
}
/*!
* \internal
* \brief Subtract a resource's utilization from node capacity
*
* \param[in,out] current_utilization Current node utilization attributes
* \param[in] rsc Resource with utilization to subtract
*/
void
pcmk__consume_node_capacity(GHashTable *current_utilization,
const pcmk_resource_t *rsc)
{
struct calculate_data data = {
.current_utilization = current_utilization,
.plus = false,
};
g_hash_table_foreach(rsc->utilization, update_utilization_value, &data);
}
/*!
* \internal
* \brief Add a resource's utilization to node capacity
*
* \param[in,out] current_utilization Current node utilization attributes
* \param[in] rsc Resource with utilization to add
*/
void
pcmk__release_node_capacity(GHashTable *current_utilization,
const pcmk_resource_t *rsc)
{
struct calculate_data data = {
.current_utilization = current_utilization,
.plus = true,
};
g_hash_table_foreach(rsc->utilization, update_utilization_value, &data);
}
/*
* Functions for checking for sufficient node capacity
*/
struct capacity_data {
const pcmk_node_t *node;
const char *rsc_id;
bool is_enough;
};
/*!
* \internal
* \brief Check whether a single utilization attribute has sufficient capacity
*
* \param[in] key Name of utilization attribute to check
* \param[in] value Amount of utilization required
* \param[in,out] user_data Capacity data (as struct capacity_data *)
*/
static void
check_capacity(gpointer key, gpointer value, gpointer user_data)
{
int required = 0;
int remaining = 0;
const char *node_value_s = NULL;
struct capacity_data *data = user_data;
node_value_s = g_hash_table_lookup(data->node->details->utilization, key);
required = utilization_value(value);
remaining = utilization_value(node_value_s);
if (required > remaining) {
crm_debug("Remaining capacity for %s on %s (%d) is insufficient "
"for resource %s usage (%d)",
(const char *) key, pe__node_name(data->node), remaining,
data->rsc_id, required);
data->is_enough = false;
}
}
/*!
* \internal
* \brief Check whether a node has sufficient capacity for a resource
*
* \param[in] node Node to check
* \param[in] rsc_id ID of resource to check (for debug logs only)
* \param[in] utilization Required utilization amounts
*
* \return true if node has sufficient capacity for resource, otherwise false
*/
static bool
have_enough_capacity(const pcmk_node_t *node, const char *rsc_id,
GHashTable *utilization)
{
struct capacity_data data = {
.node = node,
.rsc_id = rsc_id,
.is_enough = true,
};
g_hash_table_foreach(utilization, check_capacity, &data);
return data.is_enough;
}
/*!
* \internal
* \brief Sum the utilization requirements of a list of resources
*
* \param[in] orig_rsc Resource being assigned (for logging purposes)
* \param[in] rscs Resources whose utilization should be summed
*
* \return Newly allocated hash table with sum of all utilization values
* \note It is the caller's responsibility to free the return value using
* g_hash_table_destroy().
*/
static GHashTable *
sum_resource_utilization(const pcmk_resource_t *orig_rsc, GList *rscs)
{
GHashTable *utilization = pcmk__strkey_table(free, free);
for (GList *iter = rscs; iter != NULL; iter = iter->next) {
pcmk_resource_t *rsc = (pcmk_resource_t *) iter->data;
rsc->cmds->add_utilization(rsc, orig_rsc, rscs, utilization);
}
return utilization;
}
/*!
* \internal
* \brief Ban resource from nodes with insufficient utilization capacity
*
* \param[in,out] rsc Resource to check
*
* \return Allowed node for \p rsc with most spare capacity, if there are no
* nodes with enough capacity for \p rsc and all its colocated resources
*/
const pcmk_node_t *
pcmk__ban_insufficient_capacity(pcmk_resource_t *rsc)
{
bool any_capable = false;
char *rscs_id = NULL;
pcmk_node_t *node = NULL;
const pcmk_node_t *most_capable_node = NULL;
GList *colocated_rscs = NULL;
GHashTable *unassigned_utilization = NULL;
GHashTableIter iter;
CRM_CHECK(rsc != NULL, return NULL);
// The default placement strategy ignores utilization
if (pcmk__str_eq(rsc->cluster->placement_strategy, "default",
pcmk__str_casei)) {
return NULL;
}
// Check whether any resources are colocated with this one
colocated_rscs = rsc->cmds->colocated_resources(rsc, NULL, NULL);
if (colocated_rscs == NULL) {
return NULL;
}
rscs_id = crm_strdup_printf("%s and its colocated resources", rsc->id);
// If rsc isn't in the list, add it so we include its utilization
if (g_list_find(colocated_rscs, rsc) == NULL) {
colocated_rscs = g_list_append(colocated_rscs, rsc);
}
// Sum utilization of colocated resources that haven't been assigned yet
unassigned_utilization = sum_resource_utilization(rsc, colocated_rscs);
// Check whether any node has enough capacity for all the resources
g_hash_table_iter_init(&iter, rsc->allowed_nodes);
while (g_hash_table_iter_next(&iter, NULL, (void **) &node)) {
if (!pcmk__node_available(node, true, false)) {
continue;
}
if (have_enough_capacity(node, rscs_id, unassigned_utilization)) {
any_capable = true;
}
// Keep track of node with most free capacity
if ((most_capable_node == NULL)
|| (pcmk__compare_node_capacities(node, most_capable_node) < 0)) {
most_capable_node = node;
}
}
if (any_capable) {
// If so, ban resource from any node with insufficient capacity
g_hash_table_iter_init(&iter, rsc->allowed_nodes);
while (g_hash_table_iter_next(&iter, NULL, (void **) &node)) {
if (pcmk__node_available(node, true, false)
&& !have_enough_capacity(node, rscs_id,
unassigned_utilization)) {
pe_rsc_debug(rsc, "%s does not have enough capacity for %s",
pe__node_name(node), rscs_id);
resource_location(rsc, node, -INFINITY, "__limit_utilization__",
rsc->cluster);
}
}
most_capable_node = NULL;
} else {
// Otherwise, ban from nodes with insufficient capacity for rsc alone
g_hash_table_iter_init(&iter, rsc->allowed_nodes);
while (g_hash_table_iter_next(&iter, NULL, (void **) &node)) {
if (pcmk__node_available(node, true, false)
&& !have_enough_capacity(node, rsc->id, rsc->utilization)) {
pe_rsc_debug(rsc, "%s does not have enough capacity for %s",
pe__node_name(node), rsc->id);
resource_location(rsc, node, -INFINITY, "__limit_utilization__",
rsc->cluster);
}
}
}
g_hash_table_destroy(unassigned_utilization);
g_list_free(colocated_rscs);
free(rscs_id);
pe__show_node_scores(true, rsc, "Post-utilization", rsc->allowed_nodes,
rsc->cluster);
return most_capable_node;
}
/*!
* \internal
* \brief Create a new load_stopped pseudo-op for a node
*
* \param[in,out] node Node to create op for
*
* \return Newly created load_stopped op
*/
static pcmk_action_t *
new_load_stopped_op(pcmk_node_t *node)
{
char *load_stopped_task = crm_strdup_printf(PCMK_ACTION_LOAD_STOPPED "_%s",
node->details->uname);
pcmk_action_t *load_stopped = get_pseudo_op(load_stopped_task,
node->details->data_set);
if (load_stopped->node == NULL) {
load_stopped->node = pe__copy_node(node);
pe__clear_action_flags(load_stopped, pcmk_action_optional);
}
free(load_stopped_task);
return load_stopped;
}
/*!
* \internal
* \brief Create utilization-related internal constraints for a resource
*
* \param[in,out] rsc Resource to create constraints for
* \param[in] allowed_nodes List of allowed next nodes for \p rsc
*/
void
pcmk__create_utilization_constraints(pcmk_resource_t *rsc,
const GList *allowed_nodes)
{
const GList *iter = NULL;
pcmk_action_t *load_stopped = NULL;
pe_rsc_trace(rsc, "Creating utilization constraints for %s - strategy: %s",
rsc->id, rsc->cluster->placement_strategy);
// "stop rsc then load_stopped" constraints for current nodes
for (iter = rsc->running_on; iter != NULL; iter = iter->next) {
load_stopped = new_load_stopped_op(iter->data);
pcmk__new_ordering(rsc, stop_key(rsc), NULL, NULL, NULL, load_stopped,
pcmk__ar_if_on_same_node_or_target, rsc->cluster);
}
// "load_stopped then start/migrate_to rsc" constraints for allowed nodes
for (iter = allowed_nodes; iter; iter = iter->next) {
load_stopped = new_load_stopped_op(iter->data);
pcmk__new_ordering(NULL, NULL, load_stopped, rsc, start_key(rsc), NULL,
pcmk__ar_if_on_same_node_or_target, rsc->cluster);
pcmk__new_ordering(NULL, NULL, load_stopped,
rsc,
pcmk__op_key(rsc->id, PCMK_ACTION_MIGRATE_TO, 0),
NULL,
pcmk__ar_if_on_same_node_or_target, rsc->cluster);
}
}
/*!
* \internal
* \brief Output node capacities if enabled
*
* \param[in] desc Prefix for output
* \param[in,out] data_set Cluster working set
*/
void
pcmk__show_node_capacities(const char *desc, pcmk_scheduler_t *data_set)
{
if (!pcmk_is_set(data_set->flags, pcmk_sched_show_utilization)) {
return;
}
for (const GList *iter = data_set->nodes; iter != NULL; iter = iter->next) {
const pcmk_node_t *node = (const pcmk_node_t *) iter->data;
pcmk__output_t *out = data_set->priv;
out->message(out, "node-capacity", node, desc);
}
}
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Mon, Apr 21, 6:13 PM (1 d, 29 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
1665145
Default Alt Text
(156 KB)
Attached To
Mode
rP Pacemaker
Attached
Detach File
Event Timeline
Log In to Comment