diff --git a/doc/Pacemaker_Administration/en-US/Book_Info.xml b/doc/Pacemaker_Administration/en-US/Book_Info.xml
index 8622da75c6..fd1bc36d72 100644
--- a/doc/Pacemaker_Administration/en-US/Book_Info.xml
+++ b/doc/Pacemaker_Administration/en-US/Book_Info.xml
@@ -1,36 +1,36 @@
%BOOK_ENTITIES;
]>
Pacemaker Administration
Managing Pacemaker Clusters
1
- 0
+ 1
This document has instructions and tips for system
administrators who need to manage high-availability
clusters using Pacemaker.
diff --git a/doc/Pacemaker_Administration/en-US/Ch-Agents.txt b/doc/Pacemaker_Administration/en-US/Ch-Agents.txt
index c5afcb6b4a..0d8ff1f1fb 100644
--- a/doc/Pacemaker_Administration/en-US/Ch-Agents.txt
+++ b/doc/Pacemaker_Administration/en-US/Ch-Agents.txt
@@ -1,338 +1,350 @@
:compat-mode: legacy
= Resource Agents =
+== Resource Agent Actions ==
+
+If one resource depends on another resource via constraints, the cluster will
+interpret an expected result as sufficient to continue with dependent actions.
+This may cause timing issues if the resource agent start returns before the
+service is not only launched but fully ready to perform its function, or if the
+resource agent stop returns before the service has fully released all its
+claims on system resources. At a minimum, the start or stop should not return
+before a status command would return the expected (started or stopped) result.
+
== OCF Resource Agents ==
=== Location of Custom Scripts ===
indexterm:[OCF Resource Agents]
OCF Resource Agents are found in +/usr/lib/ocf/resource.d/pass:[provider]+
When creating your own agents, you are encouraged to create a new
directory under +/usr/lib/ocf/resource.d/+ so that they are not
confused with (or overwritten by) the agents shipped by existing providers.
So, for example, if you choose the provider name of bigCorp and want
a new resource named bigApp, you would create a resource agent called
+/usr/lib/ocf/resource.d/bigCorp/bigApp+ and define a resource:
[source,XML]
----
----
=== Actions ===
All OCF resource agents are required to implement the following actions.
.Required Actions for OCF Agents
[width="95%",cols="3m,3,7",options="header",align="center"]
|=========================================================
|Action
|Description
|Instructions
|start
|Start the resource
|Return 0 on success and an appropriate error code otherwise. Must not
report success until the resource is fully active.
indexterm:[start,OCF Action]
indexterm:[OCF,Action,start]
|stop
|Stop the resource
|Return 0 on success and an appropriate error code otherwise. Must not
report success until the resource is fully stopped.
indexterm:[stop,OCF Action]
indexterm:[OCF,Action,stop]
|monitor
|Check the resource's state
|Exit 0 if the resource is running, 7 if it is stopped, and anything
else if it is failed.
indexterm:[monitor,OCF Action]
indexterm:[OCF,Action,monitor]
NOTE: The monitor script should test the state of the resource on the local machine only.
|meta-data
|Describe the resource
|Provide information about this resource as an XML snippet. Exit with 0.
indexterm:[meta-data,OCF Action]
indexterm:[OCF,Action,meta-data]
NOTE: This is _not_ performed as root.
|validate-all
|Verify the supplied parameters
|Return 0 if parameters are valid, 2 if not valid, and 6 if resource is not configured.
indexterm:[validate-all,OCF Action]
indexterm:[OCF,Action,validate-all]
|=========================================================
Additional requirements (not part of the OCF specification) are placed on
agents that will be used for advanced concepts such as clone resources.
.Optional Actions for OCF Resource Agents
[width="95%",cols="2m,6,3",options="header",align="center"]
|=========================================================
|Action
|Description
|Instructions
|promote
|Promote the local instance of a promotable clone resource to the master (primary) state.
|Return 0 on success
indexterm:[promote,OCF Action]
indexterm:[OCF,Action,promote]
|demote
|Demote the local instance of a promotable clone resource to the slave (secondary) state.
|Return 0 on success
indexterm:[demote,OCF Action]
indexterm:[OCF,Action,demote]
|notify
|Used by the cluster to send the agent pre- and post-notification
events telling the resource what has happened and will happen.
|Must not fail. Must exit with 0
indexterm:[notify,OCF Action]
indexterm:[OCF,Action,notify]
|=========================================================
One action specified in the OCF specs, +recover+, is not currently used by the
cluster. It is intended to be a variant of the +start+ action that tries to
recover a resource locally.
[IMPORTANT]
====
If you create a new OCF resource agent, use indexterm:[ocf-tester]`ocf-tester`
to verify that the agent complies with the OCF standard properly.
====
=== How are OCF Return Codes Interpreted? ===
The first thing the cluster does is to check the return code against
the expected result. If the result does not match the expected value,
then the operation is considered to have failed, and recovery action is
initiated.
There are three types of failure recovery:
.Types of recovery performed by the cluster
[width="95%",cols="1m,4,4",options="header",align="center"]
|=========================================================
|Type
|Description
|Action Taken by the Cluster
|soft
|A transient error occurred
|Restart the resource or move it to a new location
indexterm:[soft,OCF error]
indexterm:[OCF,error,soft]
|hard
|A non-transient error that may be specific to the current node occurred
|Move the resource elsewhere and prevent it from being retried on the current node
indexterm:[hard,OCF error]
indexterm:[OCF,error,hard]
|fatal
|A non-transient error that will be common to all cluster nodes (e.g. a bad configuration was specified)
|Stop the resource and prevent it from being started on any cluster node
indexterm:[fatal,OCF error]
indexterm:[OCF,error,fatal]
|=========================================================
[[s-ocf-return-codes]]
=== OCF Return Codes ===
The following table outlines the different OCF return codes and the type of
recovery the cluster will initiate when a failure code is received.
Although counterintuitive, even actions that return 0
(aka. +OCF_SUCCESS+) can be considered to have failed, if 0 was not
the expected return value.
.OCF Return Codes and their Recovery Types
[width="95%",cols="1m,<4m,<6,1m",options="header",align="center"]
|=========================================================
|RC
|OCF Alias
|Description
|RT
|0
|OCF_SUCCESS
|Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.
indexterm:[Return Code,OCF_SUCCESS]
indexterm:[Return Code,0,OCF_SUCCESS]
|soft
|1
|OCF_ERR_GENERIC
|Generic "there was a problem" error code.
indexterm:[Return Code,OCF_ERR_GENERIC]
indexterm:[Return Code,1,OCF_ERR_GENERIC]
|soft
|2
|OCF_ERR_ARGS
|The resource's configuration is not valid on this machine. E.g. it refers to a location not found on the node.
indexterm:[Return Code,OCF_ERR_ARGS]
indexterm:[Return Code,2,OCF_ERR_ARGS]
|hard
|3
|OCF_ERR_UNIMPLEMENTED
|The requested action is not implemented.
indexterm:[Return Code,OCF_ERR_UNIMPLEMENTED]
indexterm:[Return Code,3,OCF_ERR_UNIMPLEMENTED]
|hard
|4
|OCF_ERR_PERM
|The resource agent does not have sufficient privileges to complete the task.
indexterm:[Return Code,OCF_ERR_PERM]
indexterm:[Return Code,4,OCF_ERR_PERM]
|hard
|5
|OCF_ERR_INSTALLED
|The tools required by the resource are not installed on this machine.
indexterm:[Return Code,OCF_ERR_INSTALLED]
indexterm:[Return Code,5,OCF_ERR_INSTALLED]
|hard
|6
|OCF_ERR_CONFIGURED
|The resource's configuration is invalid. E.g. required parameters are missing.
indexterm:[Return Code,OCF_ERR_CONFIGURED]
indexterm:[Return Code,6,OCF_ERR_CONFIGURED]
|fatal
|7
|OCF_NOT_RUNNING
|The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action.
indexterm:[Return Code,OCF_NOT_RUNNING]
indexterm:[Return Code,7,OCF_NOT_RUNNING]
|N/A
|8
|OCF_RUNNING_MASTER
|The resource is running in master mode.
indexterm:[Return Code,OCF_RUNNING_MASTER]
indexterm:[Return Code,8,OCF_RUNNING_MASTER]
|soft
|9
|OCF_FAILED_MASTER
|The resource is in master mode but has failed. The resource will be demoted,
stopped and then started (and possibly promoted) again.
indexterm:[Return Code,OCF_FAILED_MASTER]
indexterm:[Return Code,9,OCF_FAILED_MASTER]
|soft
|other
|N/A
|Custom error code.
indexterm:[Return Code,other]
|soft
|=========================================================
Exceptions to the recovery handling described above:
* Probes (non-recurring monitor actions) that find a resource active
(or in master mode) will not result in recovery action unless it is
also found active elsewhere.
* The recovery action taken when a resource is found active more than
once is determined by the resource's +multiple-active+ property.
* Recurring actions that return +OCF_ERR_UNIMPLEMENTED+
do not cause any type of recovery.
-== Init Script LSB Compliance ==
+== LSB Resource Agents (Init Scripts) ==
+
+=== LSB Compliance ===
The relevant part of the
http://refspecs.linuxfoundation.org/lsb.shtml[LSB specifications]
includes a description of all the return codes listed here.
Assuming `some_service` is configured correctly and currently
inactive, the following sequence will help you determine if it is
LSB-compatible:
. Start (stopped):
+
----
# /etc/init.d/some_service start ; echo "result: $?"
----
+
.. Did the service start?
- .. Did the command print *result: 0* (in addition to its usual output)?
+ .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (running):
+
----
# /etc/init.d/some_service status ; echo "result: $?"
----
+
.. Did the script accept the command?
.. Did the script indicate the service was running?
- .. Did the command print *result: 0* (in addition to its usual output)?
+ .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Start (running):
+
----
# /etc/init.d/some_service start ; echo "result: $?"
----
+
.. Is the service still running?
- .. Did the command print *result: 0* (in addition to its usual output)?
+ .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Stop (running):
+
----
# /etc/init.d/some_service stop ; echo "result: $?"
----
+
.. Was the service stopped?
- .. Did the command print *result: 0* (in addition to its usual output)?
+ .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (stopped):
+
----
# /etc/init.d/some_service status ; echo "result: $?"
----
+
.. Did the script accept the command?
.. Did the script indicate the service was not running?
- .. Did the command print *result: 3* (in addition to its usual output)?
+ .. Did the echo command print *result: 3* (in addition to the init script's usual output)?
+
. Stop (stopped):
+
----
# /etc/init.d/some_service stop ; echo "result: $?"
----
+
.. Is the service still stopped?
- .. Did the command print *result: 0* (in addition to its usual output)?
+ .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (failed):
+
.. This step is not readily testable and relies on manual inspection of the script.
+
The script can use one of the error codes (other than 3) listed in the
LSB spec to indicate that it is active but failed. This tells the
cluster that before moving the resource to another node, it needs to
stop it on the existing one first.
If the answer to any of the above questions is no, then the script is
not LSB-compliant. Your options are then to either fix the script or
write an OCF agent based on the existing script.
diff --git a/doc/Pacemaker_Administration/en-US/Ch-Troubleshooting.txt b/doc/Pacemaker_Administration/en-US/Ch-Troubleshooting.txt
new file mode 100644
index 0000000000..f01d2f04cf
--- /dev/null
+++ b/doc/Pacemaker_Administration/en-US/Ch-Troubleshooting.txt
@@ -0,0 +1,64 @@
+:compat-mode: legacy
+= Troubleshooting Cluster Problems =
+
+== Logging ==
+
+Pacemaker by default logs messages of notice severity and higher to the system
+log, and messages of info severity and higher to the detail log, which by
+default is /var/log/pacemaker/pacemaker.log.
+
+Logging options can be controlled via environment variables at Pacemaker
+start-up. Where these are set varies by operating system (often
++/etc/sysconfig/pacemaker+ or +/etc/default/pacemaker+).
+
+Because cluster problems are often highly complex, involving multiple machines,
+cluster daemons, and managed services, Pacemaker logs rather verbosely to
+provide as much context as possible. It is an ongoing priority to make these
+logs more user-friendly, but by necessity there is a lot of obscure, low-level
+information that can make them difficult to follow.
+
+The default log rotation configuration shipped with Pacemaker (typically
+installed in /etc/logrotate.d/pacemaker) rotates the log when it reaches 100MB
+in size, or weekly, whichever comes first.
+
+If you configure debug or (Heaven forbid) trace-level logging, the logs can
+grow enormous quite quickly. Because rotated logs are by default named with the
+year, month, and day only, this can cause name collisions if your logs exceed
+100MB in a single day. You can add +dateformat -%Y%m%d-%H+ to the rotation
+configuration to avoid this.
+
+== Transitions ==
+
+A key concept in understanding how a Pacemaker cluster functions is a
+'transition'. A transition is a set of actions that need to be taken to bring
+the cluster from its current state to the desired state (as expressed by the
+configuration).
+
+Whenever a relevant event happens (a node joining or leaving the cluster,
+a resource failing, etc.), the controller will ask the scheduler to recalculate
+the status of the cluster, which generates a new transition. The controller
+then performs the actions in the transition in the proper order.
+
+Each transition can be identified in the logs by a line like:
+
+----
+Nov 30 20:28:16 rhel7-1 pacemaker-schedulerd[36417] (process_pe_message) notice: Calculated transition 19, saving inputs in /var/lib/pacemaker/pengine/pe-input-1463.bz2
+----
+
+The file listed as the "inputs" is a snapshot of the cluster configuration and
+state at that moment (the CIB). This file can help determine why particular
+actions were scheduled. The `crm_simulate` command, described in
+<>, can be used to replay the file.
+
+== Further Information About Troubleshooting ==
+
+Andrew Beekhof wrote a series of articles about troubleshooting in his blog,
+ http://blog.clusterlabs.org/[The Cluster Guy]:
+
+* http://blog.clusterlabs.org/blog/2013/debugging-pacemaker[Debugging Pacemaker]
+* http://blog.clusterlabs.org/blog/2013/debugging-pengine[Debugging the Policy Engine]
+* http://blog.clusterlabs.org/blog/2013/pacemaker-logging[Pacemaker Logging]
+
+The articles were written for an earlier version of Pacemaker, so many of the
+specific names and log messages to look for have changed, but the concepts are
+still valid.
diff --git a/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt b/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt
index 166a98c4f7..a7e60e3a90 100644
--- a/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt
+++ b/doc/Pacemaker_Administration/en-US/Ch-Upgrading.txt
@@ -1,455 +1,456 @@
:compat-mode: legacy
= Upgrading a Pacemaker Cluster =
== Pacemaker Versioning ==
Pacemaker has an overall release version, plus separate version numbers for
certain internal components.
* *Pacemaker release version:* This version consists of three numbers
(_x.y.z_).
+
The major version number (the _x_ in _x.y.z_) increases when at least some
rolling upgrades are not possible from the previous major version. For example,
a rolling upgrade from 1.0.8 to 1.1.15 should always be supported, but a
rolling upgrade from 1.0.8 to 2.0.0 may not be possible.
+
The minor version (the _y_ in _x.y.z_) increases when there are significant
changes in cluster default behavior, tool behavior, and/or the API interface
(for software that utilizes Pacemaker libraries). The main benefit is to alert
you to pay closer attention to the release notes, to see if you might be
affected.
+
The release counter (the _z_ in _x.y.z_) is increased with all public releases
of Pacemaker, which typically include both bug fixes and new features.
* *CRM feature set:* This version number applies to the communication between
full cluster nodes, and is used to avoid problems in mixed-version clusters.
+
The major version number increases when nodes with different versions would not
work (rolling upgrades are not allowed). The minor version number increases
when mixed-version clusters are allowed only during rolling upgrades. The
minor-minor version number is ignored, but allows resource agents to detect
cluster support for various features. footnote:[
Before CRM feature set 3.1.0 (Pacemaker 2.0.0), the minor-minor
version number was treated the same as the minor version.
]
+
Pacemaker ensures that the longest-running node is the cluster's DC. This
ensures new features are not enabled until all nodes are upgraded to support
them.
* *LRMD protocol version:* This version applies to communication between a
Pacemaker Remote node and the cluster. It increases when an older cluster
node would have problems hosting the connection to a newer Pacemaker Remote
node. To avoid these problems, Pacemaker Remote nodes will accept connections
only from cluster nodes with the same or newer LRMD protocol version.
+
Unlike with CRM feature set differences between full cluster nodes,
mixed LRMD protocol versions between Pacemaker Remote nodes and full cluster
nodes are fine, as long as the Pacemaker Remote nodes have the older version.
This can be useful, for example, to host a legacy application in an
older operating system version used as a Pacemaker Remote node.
* *XML schema version:* Pacemaker’s configuration syntax — what's allowed in
the Configuration Information Base (CIB) — has its own version. This allows
the configuration syntax to evolve over time while still allowing clusters
with older configurations to work without change.
== Upgrading Cluster Software ==
There are three approaches to upgrading a cluster, each with advantages and
disadvantages.
.Upgrade Methods
[width="95%",cols="s,6*",options="header",align="center"]
|=========================================================
|Method
|Available between all versions
|Can be used with Pacemaker Remote nodes
|Service outage during upgrade
|Service recovery during upgrade
|Exercises failover logic
|Allows change of messaging layer
indexterm:[Cluster,switching between stacks]
indexterm:[Changing cluster stack]
footnote:[Currently, Corosync version 2 and greater is the only supported
cluster stack, but other stacks have been supported by past versions, and may
be supported by future versions.]
|Complete cluster shutdown
indexterm:[upgrade,shutdown]
indexterm:[shutdown upgrade]
|yes
|yes
|always
|N/A
|no
|yes
|Rolling (node by node)
indexterm:[upgrade,rolling]
indexterm:[rolling upgrade]
|no
|yes
|always
footnote:[Any active resources will be moved off the node being upgraded,
so there will be at least a brief outage unless all resources can be
migrated "live".]
|yes
|yes
|no
|Detach and reattach
indexterm:[upgrade,reattach]
indexterm:[reattach upgrade]
|yes
|no
|only due to failure
|no
|no
|yes
|=========================================================
=== Complete Cluster Shutdown ===
In this scenario, one shuts down all cluster nodes and resources,
then upgrades all the nodes before restarting the cluster.
. On each node:
.. Shutdown the cluster software (pacemaker and the messaging layer).
.. Upgrade the Pacemaker software. This may also include upgrading the
messaging layer and/or the underlying operating system.
.. Check the configuration with the `crm_verify` tool.
. On each node:
.. Start the cluster software.
Currently, only Corosync version 2 and greater is supported as the cluster
layer, but if another stack is supported in the future, the stack does not
need to be the same one before the upgrade.
One variation of this approach is to build a new cluster on new hosts.
This allows the new version to be tested beforehand, and minimizes downtime by
having the new nodes ready to be placed in production as soon as the old nodes
are shut down.
=== Rolling (node by node) ===
In this scenario, each node is removed from the cluster, upgraded, and then
brought back online, until all nodes are running the newest version.
Special considerations when planning a rolling upgrade:
* If you plan to upgrade other cluster software -- such as the messaging layer --
at the same time, consult that software's documentation for its compatibility
with a rolling upgrade.
* If the major version number is changing in the Pacemaker version you are
upgrading to, a rolling upgrade may not be possible. Read the new version's
release notes (as well the information here) for what limitations may exist.
* If the CRM feature set is changing in the Pacemaker version you are upgrading
to, you should run a mixed-version cluster only during a small rolling
upgrade window. If one of the older nodes drops out of the cluster for any
reason, it will not be able to rejoin until it is upgraded.
* If the LRMD protocol version is changing, all cluster nodes should be
upgraded before upgrading any Pacemaker Remote nodes.
See the ClusterLabs wiki's
http://clusterlabs.org/wiki/ReleaseCalendar[Release Calendar] to figure out
whether the CRM feature set and/or LRMD protocol version changed between the
the Pacemaker release versions in your rolling upgrade.
To perform a rolling upgrade, on each node in turn:
. Put the node into standby mode, and wait for any active resources
to be moved cleanly to another node. (This step is optional, but
allows you to deal with any resource issues before the upgrade.)
. Shutdown the cluster software (pacemaker and the messaging layer) on the node.
. Upgrade the Pacemaker software. This may also include upgrading the
messaging layer and/or the underlying operating system.
. If this is the first node to be upgraded, check the configuration
with the `crm_verify` tool.
. Start the messaging layer.
This must be the same messaging layer (currently only Corosync version 2 and
greater is supported) that the rest of the cluster is using.
[NOTE]
====
Even if a rolling upgrade from the current version of the cluster to the newest
version is not directly possible, it may be possible to perform a rolling
upgrade in multiple steps, by upgrading to an intermediate version first.
.Version Compatibility Table
[width="95%",cols="2*",options="header",align="center"]
|=========================================================
|Version being Installed
|Oldest Compatible Version
|Pacemaker 2.y.z
|Pacemaker 1.1.11
footnote:[Rolling upgrades from Pacemaker 1.1.z to 2.y.z are possible only if
the cluster uses corosync version 2 or greater as its messaging layer, and the
Cluster Information Base (CIB) uses schema 1.0 or higher in its validate-with
property.]
|Pacemaker 1.y.z
|Pacemaker 1.0.0
|Pacemaker 0.7.z
|Pacemaker 0.6.z
|=========================================================
====
=== Detach and Reattach ===
The reattach method is a variant of a complete cluster shutdown, where the
resources are left active and get re-detected when the cluster is restarted.
This method may not be used if the cluster contains any Pacemaker Remote nodes.
. Tell the cluster to stop managing services. This is required to allow the
services to remain active after the cluster shuts down.
+
----
# crm_attribute --name maintenance-mode --update true
----
. On each node, shutdown the cluster software (pacemaker and the messaging
layer), and upgrade the Pacemaker software. This may also include upgrading
the messaging layer. While the underlying operating system may be upgraded
at the same time, that will be more likely to cause outages in the detached
services (certainly, if a reboot is required).
. Check the configuration with the `crm_verify` tool.
. On each node, start the cluster software.
Currently, only Corosync version 2 and greater is supported as the cluster
layer, but if another stack is supported in the future, the stack does not
need to be the same one before the upgrade.
. Verify that the cluster re-detected all resources correctly.
. Allow the cluster to resume managing resources again:
+
----
# crm_attribute --name maintenance-mode --delete
----
== Upgrading the Configuration ==
indexterm:[upgrade,Configuration]
indexterm:[Configuration,upgrading]
The CIB schema version can change from one Pacemaker version to another.
After cluster software is upgraded, the cluster will continue to use
the older schema version that it was previously using. This can be useful, for
example, when administrators have written tools that modify the configuration,
and are based on the older syntax.
footnote:[As of Pacemaker 2.0.0, only schema versions pacemaker-1.0 and higher
are supported (excluding pacemaker-1.1, which was an experimental schema
now known as pacemaker-next).]
However, when using an older syntax, new features may be unavailable, and there
is a performance impact, since the cluster must do a non-persistent
configuration upgrade before each transition. So while using the old syntax is
possible, it is not advisable to continue using it indefinitely.
Even if you wish to continue using the old syntax, it is a good idea to
follow the upgrade procedure outlined below, except for the last step, to ensure
that the new software has no problems with your existing configuration (since it
will perform much the same task internally).
If you are brave, it is sufficient simply to run `cibadmin --upgrade`.
A more cautious approach would proceed like this:
. Create a shadow copy of the configuration. The later commands will automatically
operate on this copy, rather than the live configuration.
+
-----
# crm_shadow --create shadow
-----
. Verify the configuration is valid with the new software (which may be
stricter about syntax mistakes, or may have dropped support for deprecated
features):
indexterm:[Configuration,verify]
indexterm:[verify,Configuration]
+
-----
# crm_verify --live-check
-----
. Fix any errors or warnings.
. Perform the upgrade:
+
-----
# cibadmin --upgrade
-----
. If this step fails, there are three main possibilities:
.. The configuration was not valid to start with (did you do steps 2 and 3?).
.. The transformation failed - http://bugs.clusterlabs.org/[report a bug] or
mailto:users@clusterlabs.org?subject=Transformation%20failed%20during%20upgrade[email the project].
.. The transformation was successful but produced an invalid result.
+
If the result of the transformation is invalid, you may see a number of errors
from the validation library. If these are not helpful, visit the
http://clusterlabs.org/wiki/Validation_FAQ[Validation FAQ wiki page] and/or try
the manual upgrade procedure described below.
+
. Check the changes:
+
-----
# crm_shadow --diff
-----
+
If at this point there is anything about the upgrade that you wish to fine-tune
(for example, to change some of the automatic IDs), now is the time to do so:
+
-----
# crm_shadow --edit
-----
+
This will open the configuration in your favorite editor (whichever is
specified by the standard *$EDITOR* environment variable).
+
. Preview how the cluster will react:
+
------
# crm_simulate --live-check --save-dotfile shadow.dot -S
-# graphviz shadow.dot
+# dot -Tsvg shadow.dot -o shadow.svg
------
+
+You can then view shadow.svg with any compatible image viewer or web browser.
Verify that either no resource actions will occur or that you are
happy with any that are scheduled. If the output contains actions you
do not expect (possibly due to changes to the score calculations), you
may need to make further manual changes. See
<> for further details on how to interpret
-the output of `crm_simulate` and `graphviz`.
+the output of `crm_simulate` and `dot`.
+
. Upload the changes:
+
-----
# crm_shadow --commit shadow --force
-----
+
In the unlikely event this step fails, please report a bug.
[NOTE]
====
indexterm:[Configuration,upgrade manually]
It is also possible to perform the configuration upgrade steps manually:
. Locate the +upgrade*.xsl+ conversion scripts provided with the source code. These will often
be installed in a location such as +/usr/share/pacemaker+, or may be obtained from
the https://github.com/ClusterLabs/pacemaker/tree/master/xml[source repository].
. Run the conversion scripts that apply to your older version, for example:
indexterm:[XML,convert]
+
-----
# xsltproc /path/to/upgrade06.xsl config06.xml > config10.xml
-----
+
. Locate the +pacemaker.rng+ script (from the same location as the xsl files).
. Check the XML validity: indexterm:[validate configuration]indexterm:[Configuration,validate XML]
+
----
# xmllint --relaxng /path/to/pacemaker.rng config10.xml
----
The advantage of this method is that it can be performed without the
cluster running, and any validation errors are often more informative.
====
== What Changed in 2.0 ==
The main goal of the 2.0 release was to remove support for deprecated syntax,
along with some small changes in default configuration behavior and tool
behavior. Highlights:
* Only Corosync version 2 and greater is now supported as the underlying
cluster layer. Support for Heartbeat and Corosync 1 (including CMAN) is
removed.
* The Pacemaker detail log file is now stored in
/var/log/pacemaker/pacemaker.log by default.
* The record-pending cluster property now defaults to true, which
allows status tools such as crm_mon to show operations that are in
progress.
* Support for a number of deprecated build options, environment variables,
and configuration settings has been removed.
* The +master+ tag has been deprecated in favor of using a +clone+ tag with the
new +promotable+ meta-attribute set to +true+. "Master/slave" clone resources
are now referred to as "promotable" clone resources, though it will take
longer for the full terminology change to be completed.
* The public API for Pacemaker libraries that software applications can use
has changed significantly.
For a detailed list of changes, see the release notes and the
https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes[Pacemaker 2.0 Changes]
page on the ClusterLabs wiki.
== What Changed in 1.0 ==
=== New ===
* Failure timeouts.
* New section for resource and operation defaults.
* Tool for making offline configuration changes.
* +Rules, instance_attributes, meta_attributes+ and sets of operations can be defined once and referenced in multiple places.
* The CIB now accepts XPath-based create/modify/delete operations. See the pass:[cibadmin] help text.
* Multi-dimensional colocation and ordering constraints.
* The ability to connect to the CIB from non-cluster machines.
* Allow recurring actions to be triggered at known times.
=== Changed ===
* Syntax
** All resource and cluster options now use dashes (-) instead of underscores (_)
** +master_slave+ was renamed to +master+
** The +attributes+ container tag was removed
** The operation field +pre-req+ has been renamed +requires+
** All operations must have an +interval+, +start+/+stop+ must have it set to zero
* The +stonith-enabled+ option now defaults to true.
* The cluster will refuse to start resources if +stonith-enabled+ is true (or unset) and no STONITH resources have been defined
* The attributes of colocation and ordering constraints were renamed for clarity.
* +resource-failure-stickiness+ has been replaced by +migration-threshold+.
* The parameters for command-line tools have been made consistent
* Switched to 'RelaxNG' schema validation and 'libxml2' parser
** id fields are now XML IDs which have the following limitations:
*** id's cannot contain colons (:)
*** id's cannot begin with a number
*** id's must be globally unique (not just unique for that tag)
** Some fields (such as those in constraints that refer to resources) are IDREFs.
+
This means that they must reference existing resources or objects in
order for the configuration to be valid. Removing an object which is
referenced elsewhere will therefore fail.
+
** The CIB representation, from which a MD5 digest is calculated to verify CIBs on the nodes, has changed.
+
This means that every CIB update will require a full refresh on any
upgraded nodes until the cluster is fully upgraded to 1.0. This will
result in significant performance degradation and it is therefore
highly inadvisable to run a mixed 1.0/0.6 cluster for any longer than
absolutely necessary.
+
* Ping node information no longer needs to be added to _ha.cf_.
+
Simply include the lists of hosts in your ping resource(s).
=== Removed ===
* Syntax
** It is no longer possible to set resource meta options as top-level
attributes. Use meta attributes instead.
** Resource and operation defaults are no longer read from
+crm_config+.
diff --git a/doc/Pacemaker_Administration/en-US/Pacemaker_Administration.xml b/doc/Pacemaker_Administration/en-US/Pacemaker_Administration.xml
index 07a6b77ddc..03ce6bcbc0 100644
--- a/doc/Pacemaker_Administration/en-US/Pacemaker_Administration.xml
+++ b/doc/Pacemaker_Administration/en-US/Pacemaker_Administration.xml
@@ -1,18 +1,19 @@
%BOOK_ENTITIES;
]>
-
+
+
diff --git a/doc/Pacemaker_Administration/en-US/Revision_History.xml b/doc/Pacemaker_Administration/en-US/Revision_History.xml
index 56d3c70687..eaaacd6457 100644
--- a/doc/Pacemaker_Administration/en-US/Revision_History.xml
+++ b/doc/Pacemaker_Administration/en-US/Revision_History.xml
@@ -1,28 +1,45 @@
%BOOK_ENTITIES;
]>
Revision History
+
+ 1-1
+ Tue Dec 4 2018
+
+ KenGaillot
+ kgaillot@redhat.com
+
+
+ JanPokorný
+ jpokorny@redhat.com
+
+
+ Add "Troubleshooting" chapter, minor
+ clarifications and reformatting
+
+
+
1-0
Tue Jan 23 2018
KenGaillot
kgaillot@redhat.com
Move administration-oriented information from
- Pacemaker Explained into its own
- book
+ Pacemaker Explained into its own
+ book