diff --git a/doc/Pacemaker_Explained/en-US/Book_Info.xml b/doc/Pacemaker_Explained/en-US/Book_Info.xml index de4ddbebe4..5321593452 100644 --- a/doc/Pacemaker_Explained/en-US/Book_Info.xml +++ b/doc/Pacemaker_Explained/en-US/Book_Info.xml @@ -1,35 +1,35 @@ Configuration Explained An A-Z guide to Pacemaker's Configuration Options Pacemaker 1.1 - 7 - 1 + 8 + 0 The purpose of this document is to definitively explain the concepts used to configure Pacemaker. To achieve this, it will focus exclusively on the XML syntax used to configure Pacemaker's Cluster Information Base (CIB). diff --git a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt index 9527b1ab18..e3f4634c5a 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Advanced-Options.txt @@ -1,821 +1,823 @@ = Advanced Configuration = [[s-remote-connection]] == Connecting from a Remote Machine == indexterm:[Cluster,Remote connection] indexterm:[Cluster,Remote administration] Provided Pacemaker is installed on a machine, it is possible to connect to the cluster even if the machine itself is not in the same cluster. To do this, one simply sets up a number of environment variables and runs the same commands as when working on a cluster node. .Environment Variables Used to Connect to Remote Instances of the CIB [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Environment Variable |Default |Description |CIB_user |$USER |The user to connect as. Needs to be part of the +hacluster+ group on the target host. indexterm:[Environment Variable,CIB_user] |CIB_passwd | |The user's password. Read from the command line if unset. indexterm:[Environment Variable,CIB_passwd] |CIB_server |localhost |The host to contact indexterm:[Environment Variable,CIB_server] |CIB_port | |The port on which to contact the server; required. indexterm:[Environment Variable,CIB_port] |CIB_encrypted |TRUE |Whether to encrypt network traffic indexterm:[Environment Variable,CIB_encrypted] |========================================================= So, if *c001n01* is an active cluster node and is listening on port 1234 for connections, and *someuser* is a member of the *hacluster* group, then the following would prompt for *someuser*'s password and return the cluster's current configuration: ---- # export CIB_port=1234; export CIB_server=c001n01; export CIB_user=someuser; # cibadmin -Q ---- For security reasons, the cluster does not listen for remote connections by default. If you wish to allow remote access, you need to set the +remote-tls-port+ (encrypted) or +remote-clear-port+ (unencrypted) CIB properties (i.e., those kept in the +cib+ tag, like +num_updates+ and +epoch+). .Extra top-level CIB properties for remote access [width="95%",cols="1m,1,3<",options="header",align="center"] |========================================================= |Field |Default |Description |remote-tls-port |_none_ |Listen for encrypted remote connections on this port. indexterm:[remote-tls-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-tls-port] |remote-clear-port |_none_ |Listen for plaintext remote connections on this port. indexterm:[remote-clear-port,Remote Connection Option] indexterm:[Remote Connection,Option,remote-clear-port] |========================================================= [[s-recurring-start]] == Specifying When Recurring Actions are Performed == By default, recurring actions are scheduled relative to when the resource started. So if your resource was last started at 14:32 and you have a backup set to be performed every 24 hours, then the backup will always run in the middle of the business day -- hardly desirable. To specify a date and time that the operation should be relative to, set the operation's +interval-origin+. The cluster uses this point to calculate the correct +start-delay+ such that the operation will occur at _origin + (interval * N)_. So, if the operation's interval is 24h, its interval-origin is set to 02:00 and it is currently 14:32, then the cluster would initiate the operation with a start delay of 11 hours and 28 minutes. If the resource is moved to another node before 2am, then the operation is cancelled. The value specified for +interval+ and +interval-origin+ can be any date/time conforming to the http://en.wikipedia.org/wiki/ISO_8601[ISO8601 standard]. By way of example, to specify an operation that would run on the first Monday of 2009 and every Monday after that, you would add: .Specifying a Base for Recurring Action Intervals ===== [source,XML] ===== == Moving Resources == indexterm:[Moving,Resources] indexterm:[Resource,Moving] === Moving Resources Manually === There are primarily two occasions when you would want to move a resource from its current location: when the whole node is under maintenance, and when a single resource needs to be moved. ==== Standby Mode ==== Since everything eventually comes down to a score, you could create constraints for every resource to prevent them from running on one node. While pacemaker configuration can seem convoluted at times, not even we would require this of administrators. Instead, one can set a special node attribute which tells the cluster "don't let anything run here". There is even a helpful tool to help query and set it, called `crm_standby`. To check the standby status of the current machine, run: ---- # crm_standby -G ---- A value of +on+ indicates that the node is _not_ able to host any resources, while a value of +off+ says that it _can_. You can also check the status of other nodes in the cluster by specifying the `--node` option: ---- # crm_standby -G --node sles-2 ---- To change the current node's standby status, use `-v` instead of `-G`: ---- # crm_standby -v on ---- Again, you can change another host's value by supplying a hostname with `--node`. ==== Moving One Resource ==== When only one resource is required to move, we could do this by creating location constraints. However, once again we provide a user-friendly shortcut as part of the `crm_resource` command, which creates and modifies the extra constraints for you. If +Email+ were running on +sles-1+ and you wanted it moved to a specific location, the command would look something like: ---- # crm_resource -M -r Email -H sles-2 ---- Behind the scenes, the tool will create the following location constraint: [source,XML] It is important to note that subsequent invocations of `crm_resource -M` are not cumulative. So, if you ran these commands ---- # crm_resource -M -r Email -H sles-2 # crm_resource -M -r Email -H sles-3 ---- then it is as if you had never performed the first command. To allow the resource to move back again, use: ---- # crm_resource -U -r Email ---- Note the use of the word _allow_. The resource can move back to its original location but, depending on +resource-stickiness+, it might stay where it is. To be absolutely certain that it moves back to +sles-1+, move it there before issuing the call to `crm_resource -U`: ---- # crm_resource -M -r Email -H sles-1 # crm_resource -U -r Email ---- Alternatively, if you only care that the resource should be moved from its current location, try: ---- # crm_resource -B -r Email ---- Which will instead create a negative constraint, like [source,XML] This will achieve the desired effect, but will also have long-term consequences. As the tool will warn you, the creation of a +-INFINITY+ constraint will prevent the resource from running on that node until `crm_resource -U` is used. This includes the situation where every other cluster node is no longer available! In some cases, such as when +resource-stickiness+ is set to +INFINITY+, it is possible that you will end up with the problem described in <>. The tool can detect some of these cases and deals with them by creating both positive and negative constraints. E.g. +Email+ prefers +sles-1+ with a score of +-INFINITY+ +Email+ prefers +sles-2+ with a score of +INFINITY+ which has the same long-term consequences as discussed earlier. [[s-failure-migration]] === Moving Resources Due to Failure === Normally, if a running resource fails, pacemaker will try to start it again on the same node. However if a resource fails repeatedly, it is possible that there is an underlying problem on that node, and you might desire trying a different node in such a case. indexterm:[migration-threshold] indexterm:[failure-timeout] indexterm:[start-failure-is-fatal] Pacemaker allows you to set your preference via the +migration-threshold+ resource option. footnote:[ The naming of this option was perhaps unfortunate as it is easily confused with live migration, the process of moving a resource from one node to another without stopping it. Xen virtual guests are the most common example of resources that can be migrated in this manner. ] Simply define +migration-threshold=pass:[N]+ for a resource and it will migrate to a new node after 'N' failures. There is no threshold defined by default. To determine the resource's current failure status and limits, run `crm_mon --failcounts`. By default, once the threshold has been reached, the troublesome node will no longer be allowed to run the failed resource until the administrator manually resets the resource's failcount using `crm_failcount` (after hopefully first fixing the failure's cause). Alternatively, it is possible to expire them by setting the +failure-timeout+ option for the resource. For example, a setting of +migration-threshold=2+ and +failure-timeout=60s+ would cause the resource to move to a new node after 2 failures, and allow it to move back (depending on stickiness and constraint scores) after one minute. There are two exceptions to the migration threshold concept: when a resource either fails to start or fails to stop. If the cluster property +start-failure-is-fatal+ is set to +true+ (which is the default), start failures cause the failcount to be set to +INFINITY+ and thus always cause the resource to move immediately. Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout. [IMPORTANT] Please read <> to understand how timeouts work before configuring a +failure-timeout+. === Moving Resources Due to Connectivity Changes === You can configure the cluster to move resources when external connectivity is lost in two steps. ==== Tell Pacemaker to Monitor Connectivity ==== First, add an *ocf:pacemaker:ping* resource to the cluster. The *ping* resource uses the system utility of the same name to a test whether list of machines (specified by DNS hostname or IPv4/IPv6 address) are reachable and uses the results to maintain a node attribute called +pingd+ by default. footnote:[ The attribute name is customizable, in order to allow multiple ping groups to be defined. ] [NOTE] =========== Older versions of Heartbeat required users to add ping nodes to +ha.cf+, but this is no longer required. Older versions of Pacemaker used a different agent *ocf:pacemaker:pingd* which is now deprecated in favor of *ping*. If your version of Pacemaker does not contain the *ping* resource agent, download the latest version from https://github.com/ClusterLabs/pacemaker/tree/master/extra/resources/ping =========== Normally, the ping resource should run on all cluster nodes, which means that you'll need to create a clone. A template for this can be found below along with a description of the most interesting parameters. .Common Options for a 'ping' Resource [width="95%",cols="1m,4<",options="header",align="center"] |========================================================= |Field |Description |dampen |The time to wait (dampening) for further changes to occur. Use this to prevent a resource from bouncing around the cluster when cluster nodes notice the loss of connectivity at slightly different times. indexterm:[dampen,Ping Resource Option] indexterm:[Ping Resource,Option,dampen] |multiplier |The number of connected ping nodes gets multiplied by this value to get a score. Useful when there are multiple ping nodes configured. indexterm:[multiplier,Ping Resource Option] indexterm:[Ping Resource,Option,multiplier] |host_list |The machines to contact in order to determine the current connectivity status. Allowed values include resolvable DNS host names, IPv4 and IPv6 addresses. indexterm:[host_list,Ping Resource Option] indexterm:[Ping Resource,Option,host_list] |========================================================= .An example ping cluster resource that checks node connectivity once every minute ===== [source,XML] ------------ ------------ ===== [IMPORTANT] =========== You're only half done. The next section deals with telling Pacemaker how to deal with the connectivity status that +ocf:pacemaker:ping+ is recording. =========== ==== Tell Pacemaker How to Interpret the Connectivity Data ==== [IMPORTANT] ====== Before attempting the following, make sure you understand <>. ====== There are a number of ways to use the connectivity data. The most common setup is for people to have a single ping target (e.g. the service network's default gateway), to prevent the cluster from running a resource on any unconnected node. .Don't run a resource on unconnected nodes ===== [source,XML] ------- ------- ===== A more complex setup is to have a number of ping targets configured. You can require the cluster to only run resources on nodes that can connect to all (or a minimum subset) of them. .Run only on nodes connected to three or more ping targets. ===== [source,XML] ------- ... ... ... ------- ===== Alternatively, you can tell the cluster only to _prefer_ nodes with the best connectivity. Just be sure to set +multiplier+ to a value higher than that of +resource-stickiness+ (and don't set either of them to +INFINITY+). .Prefer the node with the most connected ping nodes ===== [source,XML] ------- ------- ===== It is perhaps easier to think of this in terms of the simple constraints that the cluster translates it into. For example, if *sles-1* is connected to all five ping nodes but *sles-2* is only connected to two, then it would be as if you instead had the following constraints in your configuration: .How the cluster translates the above location constraint ===== [source,XML] ------- ------- ===== The advantage is that you don't have to manually update any constraints whenever your network connectivity changes. You can also combine the concepts above into something even more complex. The example below shows how you can prefer the node with the most connected ping nodes provided they have connectivity to at least three (again assuming that +multiplier+ is set to 1000). .A more complex example of choosing a location based on connectivity ===== [source,XML] ------- ------- ===== [[s-migrating-resources]] === Migrating Resources === Normally, when the cluster needs to move a resource, it fully restarts the resource (i.e. stops the resource on the current node and starts it on the new node). However, some types of resources, such as Xen virtual guests, are able to move to another location without loss of state (often referred to as live migration or hot migration). In pacemaker, this is called resource migration. Pacemaker can be configured to migrate a resource when moving it, rather than restarting it. Not all resources are able to migrate; see the Migration Checklist below, and those that can, won't do so in all situations. Conceptually, there are two requirements from which the other prerequisites follow: * The resource must be active and healthy at the old location; and * everything required for the resource to run must be available on both the old and new locations. The cluster is able to accommodate both 'push' and 'pull' migration models by requiring the resource agent to support two special actions: +migrate_to+ (performed on the current location) and +migrate_from+ (performed on the destination). In push migration, the process on the current location transfers the resource to the new location where is it later activated. In this scenario, most of the work would be done in the +migrate_to+ action and, if anything, the activation would occur during +migrate_from+. Conversely for pull, the +migrate_to+ action is practically empty and +migrate_from+ does most of the work, extracting the relevant resource state from the old location and activating it. There is no wrong or right way for a resource agent to implement migration, as long as it works. .Migration Checklist * The resource may not be a clone. * The resource must use an OCF style agent. * The resource must not be in a failed or degraded state. * The resource agent must support +migrate_to+ and +migrate_from+ actions, and advertise them in its metadata. * The resource must have the +allow-migrate+ meta-attribute set to +true+ (which is not the default). If an otherwise migratable resource depends on another resource via an ordering constraint, there are special situations in which it will be restarted rather than migrated. For example, if the resource depends on a clone, and at the time the resource needs to be moved, the clone has instances that are stopping and instances that are starting, then the resource will be restarted. The Policy Engine is not yet able to model this situation correctly and so takes the safer (if less optimal) path. In pacemaker 1.1.11 and earlier, a migratable resource will be restarted when moving if it directly or indirectly depends on 'any' primitive or group resources. Even in newer versions, if a migratable resource depends on a non-migratable resource, and both need to be moved, the migratable resource will be restarted. [[s-node-health]] == Tracking Node Health == A node may be functioning adequately as far as cluster membership is concerned, and yet be "unhealthy" in some respect that makes it an undesirable location for resources. For example, a disk drive may be reporting SMART errors, or the CPU may be highly loaded. Pacemaker offers a way to automatically move resources off unhealthy nodes. === Node Health Attributes === Pacemaker will treat any node attribute whose name starts with +#health+ as an indicator of node health. Node health attributes may have one of the following values: .Allowed Values for Node Health Attributes [width="95%",cols="1,3<",options="header",align="center"] |========================================================= |Value |Intended significance |+red+ |This indicator is unhealthy indexterm:[Node health,red] |+yellow+ |This indicator is becoming unhealthy indexterm:[Node health,yellow] |+green+ |This indicator is healthy indexterm:[Node health,green] |'integer' |A numeric score to apply to all resources on this node (0 or positive is healthy, negative is unhealthy) indexterm:[Node health,score] |========================================================= === Node Health Strategy === Pacemaker assigns a node health score to each node, as the sum of the values of all its node health attributes. This score will be used as a location constraint applied to this node for all resources. The +node-health-strategy+ cluster option controls how Pacemaker responds to changes in node health attributes, and how it translates +red+, +yellow+, and +green+ to scores. Allowed values are: .Node Health Strategies [width="95%",cols="1m,3<",options="header",align="center"] |========================================================= |Value |Effect |none |Do not track node health attributes at all. indexterm:[Node health,none] |migrate-on-red |Assign the value of +-INFINITY+ to +red+, and 0 to +yellow+ and +green+. This will cause all resources to move off the node if any attribute is +red+. indexterm:[Node health,migrate-on-red] |only-green |Assign the value of +-INFINITY+ to +red+ and +yellow+, and 0 to +green+. This will cause all resources to move off the node if any attribute is +red+ or +yellow+. indexterm:[Node health,only-green] |progressive |Assign the value of the +node-health-red+ cluster option to +red+, the value of +node-health-yellow+ to +yellow+, and the value of +node-health-green+ to - +green+. This strategy gives the administrator finer control over how - important each value is. + +green+. Each node is additionally assigned a score of +node-health-base+ + (this allows resources to start even if some attributes are +yellow+). This + strategy gives the administrator finer control over how important each value + is. indexterm:[Node health,progressive] |custom |Track node health attributes using the same values as +progressive+ for +red+, +yellow+, and +green+, but do not take them into account. The administrator is expected to implement a policy by defining rules (see <>) referencing node health attributes. indexterm:[Node health,custom] |========================================================= === Measuring Node Health === Since Pacemaker calculates node health based on node attributes, any method that sets node attributes may be used to measure node health. The most common ways are resource agents or separate daemons. Pacemaker provides examples that can be used directly or as a basis for custom code. The +ocf:pacemaker:HealthCPU+ and +ocf:pacemaker:HealthSMART+ resource agents set node health attributes based on CPU and disk parameters. The +ipmiservicelogd+ daemon sets node health attributes based on IPMI values (the +ocf:pacemaker:SystemHealth+ resource agent can be used to manage the daemon as a cluster resource). [[s-reusing-config-elements]] == Reusing Rules, Options and Sets of Operations == Sometimes a number of constraints need to use the same set of rules, and resources need to set the same options and parameters. To simplify this situation, you can refer to an existing object using an +id-ref+ instead of an id. So if for one resource you have [source,XML] ------ ------ Then instead of duplicating the rule for all your other resources, you can instead specify: .Referencing rules from other constraints ===== [source,XML] ------- ------- ===== [IMPORTANT] =========== The cluster will insist that the +rule+ exists somewhere. Attempting to add a reference to a non-existing rule will cause a validation failure, as will attempting to remove a +rule+ that is referenced elsewhere. =========== The same principle applies for +meta_attributes+ and +instance_attributes+ as illustrated in the example below: .Referencing attributes, options, and operations from other resources ===== [source,XML] ------- ------- ===== == Reloading Services After a Definition Change == The cluster automatically detects changes to the definition of services it manages. The normal response is to stop the service (using the old definition) and start it again (with the new definition). This works well, but some services are smarter and can be told to use a new set of options without restarting. To take advantage of this capability, the resource agent must: . Accept the +reload+ operation and perform any required actions. _The actions here depend completely on your application!_ + .The DRBD agent's logic for supporting +reload+ ===== [source,Bash] ------- case $1 in start) drbd_start ;; stop) drbd_stop ;; reload) drbd_reload ;; monitor) drbd_monitor ;; *) drbd_usage exit $OCF_ERR_UNIMPLEMENTED ;; esac exit $? ------- ===== . Advertise the +reload+ operation in the +actions+ section of its metadata + .The DRBD Agent Advertising Support for the +reload+ Operation ===== [source,XML] ------- 1.1 Master/Slave OCF Resource Agent for DRBD ... ------- ===== . Advertise one or more parameters that can take effect using +reload+. + Any parameter with the +unique+ set to 0 is eligible to be used in this way. + .Parameter that can be changed using reload ===== [source,XML] ------- Full path to the drbd.conf file. Path to drbd.conf ------- ===== Once these requirements are satisfied, the cluster will automatically know to reload the resource (instead of restarting) when a non-unique field changes. [NOTE] ====== Metadata will not be re-read unless the resource needs to be started. This may mean that the resource will be restarted the first time, even though you changed a parameter with +unique=0+. ====== [NOTE] ====== If both a unique and non-unique field are changed simultaneously, the resource will still be restarted. ====== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt index 2f5bec7b11..05dc42e5f1 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Constraints.txt @@ -1,836 +1,846 @@ = Resource Constraints = indexterm:[Resource,Constraints] == Scores == Scores of all kinds are integral to how the cluster works. Practically everything from moving a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in some way. Scores are calculated per resource and node. Any node with a negative score for a resource can't run that resource. The cluster places a resource on the node with the highest score for it. === Infinity Math === Pacemaker implements +INFINITY+ (or equivalently, ++INFINITY+) internally as a score of 1,000,000. Addition and subtraction with it follow these three basic rules: * Any value + +INFINITY+ = +INFINITY+ * Any value - +INFINITY+ = +-INFINITY+ * +INFINITY+ - +INFINITY+ = +-INFINITY+ [NOTE] ====== What if you want to use a score higher than 1,000,000? Typically this possibility arises when someone wants to base the score on some external metric that might go above 1,000,000. The short answer is you can't. The long answer is it is sometimes possible work around this limitation creatively. You may be able to set the score to some computed value based on the external metric rather than use the metric directly. For nodes, you can store the metric as a node attribute, and query the attribute when computing the score (possibly as part of a custom resource agent). ====== == Deciding Which Nodes a Resource Can Run On == indexterm:[Location Constraints] indexterm:[Resource,Constraints,Location] 'Location constraints' tell the cluster which nodes a resource can run on. There are two alternative strategies. One way is to say that, by default, resources can run anywhere, and then the location constraints specify nodes that are not allowed (an 'opt-out' cluster). The other way is to start with nothing able to run anywhere, and use location constraints to selectively enable allowed nodes (an 'opt-in' cluster). Whether you should choose opt-in or opt-out depends on your personal preference and the make-up of your cluster. If most of your resources can run on most of the nodes, then an opt-out arrangement is likely to result in a simpler configuration. On the other-hand, if most resources can only run on a small subset of nodes, an opt-in configuration might be simpler. === Location Properties === .Properties of a rsc_location Constraint [width="95%",cols="2m,1,5>), the + submatches can be referenced as +%0+ through +%9+ in the rule's + +score-attribute+ or a rule expression's +attribute+ '(since 1.1.16)' +indexterm:[rsc-pattern,Location Constraints] +indexterm:[Constraints,Location,rsc-pattern] + |node | |A node's name indexterm:[node,Location Constraints] indexterm:[Constraints,Location,node] |score | |Positive values indicate the resource should run on this node. Negative values indicate the resource should not run on this node. Values of \+/- +INFINITY+ change "should"/"should not" to "must"/"must not". indexterm:[score,Location Constraints] indexterm:[Constraints,Location,score] |resource-discovery |always |Whether Pacemaker should perform resource discovery (that is, check whether the resource is already running) for this resource on this node. This should normally be left as the default, so that rogue instances of a service can be stopped when they are running where they are not supposed to be. However, there are two situations where disabling resource discovery is a good idea: when a service is not installed on a node, discovery might return an error (properly written OCF agents will not, so this is usually only seen with other agent types); and when Pacemaker Remote is used to scale a cluster to hundreds of nodes, limiting resource discovery to allowed nodes can significantly boost performance. '(since 1.1.13)' * +always:+ Always perform resource discovery for the specified resource on this node. * +never:+ Never perform resource discovery for the specified resource on this node. This option should generally be used with a -INFINITY score, although that is not strictly required. * +exclusive:+ Perform resource discovery for the specified resource only on this node (and other nodes similarly marked as +exclusive+). Multiple location constraints using +exclusive+ discovery for the same resource across different nodes creates a subset of nodes resource-discovery is exclusive to. If a resource is marked for +exclusive+ discovery on one or more nodes, that resource is only allowed to be placed within that subset of nodes. indexterm:[Resource Discovery,Location Constraints] indexterm:[Constraints,Location,Resource Discovery] |========================================================= [WARNING] ========= Setting resource-discovery to +never+ or +exclusive+ removes Pacemaker's ability to detect and stop unwanted instances of a service running where it's not supposed to be. It is up to the system administrator (you!) to make sure that the service can 'never' be active on nodes without resource-discovery (such as by leaving the relevant software uninstalled). ========= === Asymmetrical "Opt-In" Clusters === indexterm:[Asymmetrical Opt-In Clusters] indexterm:[Cluster Type,Asymmetrical Opt-In] To create an opt-in cluster, start by preventing resources from running anywhere by default: ---- # crm_attribute --name symmetric-cluster --update false ---- Then start enabling nodes. The following fragment says that the web server prefers *sles-1*, the database prefers *sles-2* and both can fail over to *sles-3* if their most preferred node fails. .Opt-in location constraints for two resources ====== [source,XML] ------- ------- ====== === Symmetrical "Opt-Out" Clusters === indexterm:[Symmetrical Opt-Out Clusters] indexterm:[Cluster Type,Symmetrical Opt-Out] To create an opt-out cluster, start by allowing resources to run anywhere by default: ---- # crm_attribute --name symmetric-cluster --update true ---- Then start disabling nodes. The following fragment is the equivalent of the above opt-in configuration. .Opt-out location constraints for two resources ====== [source,XML] ------- ------- ====== [[node-score-equal]] === What if Two Nodes Have the Same Score === If two nodes have the same score, then the cluster will choose one. This choice may seem random and may not be what was intended, however the cluster was not given enough information to know any better. .Constraints where a resource prefers two nodes equally ====== [source,XML] ------- ------- ====== In the example above, assuming no other constraints and an inactive cluster, +Webserver+ would probably be placed on +sles-1+ and +Database+ on +sles-2+. It would likely have placed +Webserver+ based on the node's uname and +Database+ based on the desire to spread the resource load evenly across the cluster. However other factors can also be involved in more complex configurations. [[s-resource-ordering]] == Specifying the Order in which Resources Should Start/Stop == indexterm:[Resource,Constraints,Ordering] indexterm:[Resource,Start Order] indexterm:[Ordering Constraints] 'Ordering constraints' tell the cluster the order in which resources should start. [IMPORTANT] ==== Ordering constraints affect 'only' the ordering of resources; they do 'not' require that the resources be placed on the same node. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint 'and' a colocation constraint (see <>), or alternatively, a group (see <>). ==== === Ordering Properties === .Properties of a rsc_order Constraint [width="95%",cols="1m,1,4> resources. === Optional and mandatory ordering === Here is an example of ordering constraints where +Database+ 'must' start before +Webserver+, and +IP+ 'should' start before +Webserver+ if they both need to be started: .Optional and mandatory ordering constraints ====== [source,XML] ------- ------- ====== Because the above example lets +symmetrical+ default to TRUE, +Webserver+ must be stopped before +Database+ can be stopped, and +Webserver+ should be stopped before +IP+ if they both need to be stopped. [[s-resource-colocation]] == Placing Resources Relative to other Resources == indexterm:[Resource,Constraints,Colocation] indexterm:[Resource,Location Relative to other Resources] 'Colocation constraints' tell the cluster that the location of one resource depends on the location of another one. Colocation has an important side-effect: it affects the order in which resources are assigned to a node. Think about it: You can't place A relative to B unless you know where B is. footnote:[ While the human brain is sophisticated enough to read the constraint in any order and choose the correct one depending on the situation, the cluster is not quite so smart. Yet. ] So when you are creating colocation constraints, it is important to consider whether you should colocate A with B, or B with A. Another thing to keep in mind is that, assuming A is colocated with B, the cluster will take into account A's preferences when deciding which node to choose for B. For a detailed look at exactly how this occurs, see http://clusterlabs.org/doc/Colocation_Explained.pdf[Colocation Explained]. [IMPORTANT] ==== Colocation constraints affect 'only' the placement of resources; they do 'not' require that the resources be started in a particular order. If you want resources to be started on the same node 'and' in a specific order, you need both an ordering constraint (see <>) 'and' a colocation constraint, or alternatively, a group (see <>). ==== === Colocation Properties === .Properties of a rsc_colocation Constraint [width="95%",cols="2m,5<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the constraint. indexterm:[id,Colocation Constraints] indexterm:[Constraints,Colocation,id] |rsc |The name of a resource that should be located relative to +with-rsc+. indexterm:[rsc,Colocation Constraints] indexterm:[Constraints,Colocation,rsc] |with-rsc |The name of the resource used as the colocation target. The cluster will decide where to put this resource first and then decide where to put +rsc+. indexterm:[with-rsc,Colocation Constraints] indexterm:[Constraints,Colocation,with-rsc] |score |Positive values indicate the resources should run on the same node. Negative values indicate the resources should run on different nodes. Values of \+/- +INFINITY+ change "should" to "must". indexterm:[score,Colocation Constraints] indexterm:[Constraints,Colocation,score] |========================================================= === Mandatory Placement === Mandatory placement occurs when the constraint's score is ++INFINITY+ or +-INFINITY+. In such cases, if the constraint can't be satisfied, then the +rsc+ resource is not permitted to run. For +score=INFINITY+, this includes cases where the +with-rsc+ resource is not active. If you need resource +A+ to always run on the same machine as resource +B+, you would add the following constraint: .Mandatory colocation constraint for two resources ==== [source,XML] ==== Remember, because +INFINITY+ was used, if +B+ can't run on any of the cluster nodes (for whatever reason) then +A+ will not be allowed to run. Whether +A+ is running or not has no effect on +B+. Alternatively, you may want the opposite -- that +A+ 'cannot' run on the same machine as +B+. In this case, use +score="-INFINITY"+. .Mandatory anti-colocation constraint for two resources ==== [source,XML] ==== Again, by specifying +-INFINITY+, the constraint is binding. So if the only place left to run is where +B+ already is, then +A+ may not run anywhere. As with +INFINITY+, +B+ can run even if +A+ is stopped. However, in this case +A+ also can run if +B+ is stopped, because it still meets the constraint of +A+ and +B+ not running on the same node. === Advisory Placement === If mandatory placement is about "must" and "must not", then advisory placement is the "I'd prefer if" alternative. For constraints with scores greater than +-INFINITY+ and less than +INFINITY+, the cluster will try to accommodate your wishes but may ignore them if the alternative is to stop some of the cluster resources. As in life, where if enough people prefer something it effectively becomes mandatory, advisory colocation constraints can combine with other elements of the configuration to behave as if they were mandatory. .Advisory colocation constraint for two resources ==== [source,XML] ==== [[s-resource-sets]] == Resource Sets == 'Resource sets' allow multiple resources to be affected by a single constraint. .A set of 3 resources ==== [source,XML] ---- ---- ==== Resource sets are valid inside +rsc_location+, +rsc_order+ (see <>), +rsc_colocation+ (see <>), and +rsc_ticket+ (see <>) constraints. A resource set has a number of properties that can be set, though not all have an effect in all contexts. .Properties of a resource_set [width="95%",cols="2m,1,5 ------- ====== .Visual representation of the four resources' start order for the above constraints image::images/resource-set.png["Ordered set",width="16cm",height="2.5cm",align="center"] === Ordered Set === To simplify this situation, resource sets (see <>) can be used within ordering constraints: .A chain of ordered resources expressed as a set ====== [source,XML] ------- ------- ====== While the set-based format is not less verbose, it is significantly easier to get right and maintain. [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A then B+, or +B then A+. ========= === Ordering Multiple Sets === The syntax can be expanded to allow sets of resources to be ordered relative to each other, where the members of each individual set may be ordered or unordered (controlled by the +sequential+ property). In the example below, +A+ and +B+ can both start in parallel, as can +C+ and +D+, however +C+ and +D+ can only start once _both_ +A+ _and_ +B+ are active. .Ordered sets of unordered resources ====== [source,XML] ------- ------- ====== .Visual representation of the start order for two ordered sets of unordered resources image::images/two-sets.png["Two ordered sets",width="13cm",height="7.5cm",align="center"] Of course either set -- or both sets -- of resources can also be internally ordered (by setting +sequential="true"+) and there is no limit to the number of sets that can be specified. .Advanced use of set ordering - Three ordered sets, two of which are internally unordered ====== [source,XML] ------- ------- ====== .Visual representation of the start order for the three sets defined above image::images/three-sets.png["Three ordered sets",width="16cm",height="7.5cm",align="center"] [IMPORTANT] ==== An ordered set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== === Resource Set OR Logic === The unordered set logic discussed so far has all been "AND" logic. To illustrate this take the 3 resource set figure in the previous section. Those sets can be expressed, +(A and B) then \(C) then (D) then (E and F)+. Say for example we want to change the first set, +(A and B)+, to use "OR" logic so the sets look like this: +(A or B) then \(C) then (D) then (E and F)+. This functionality can be achieved through the use of the +require-all+ option. This option defaults to TRUE which is why the "AND" logic is used by default. Setting +require-all=false+ means only one resource in the set needs to be started before continuing on to the next set. .Resource Set "OR" logic: Three ordered sets, where the first set is internally unordered with "OR" logic ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== An ordered set with +require-all=false+ makes sense only in conjunction with +sequential=false+. Think of it like this: +sequential=false+ modifies the set to be an unordered set using "AND" logic by default, and adding +require-all=false+ flips the unordered set's "AND" logic to "OR" logic. ==== [[s-resource-sets-colocation]] == Colocating Sets of Resources == Another common situation is for an administrator to create a set of colocated resources. One way to do this would be to define a resource group (see <>), but that cannot always accurately express the desired state. Another way would be to define each relationship as an individual constraint, but that causes a constraint explosion as the number of resources and combinations grow. An example of this approach: .Chain of colocated resources ====== [source,XML] ------- ------- ====== To make things easier, resource sets (see <>) can be used within colocation constraints. As with the chained version, a resource that can't be active prevents any resource that must be colocated with it from being active. For example, if +B+ is not able to run, then both +C+ and by inference +D+ must also remain stopped. Here is an example +resource_set+: .Equivalent colocation chain expressed using +resource_set+ ====== [source,XML] ------- ------- ====== [IMPORTANT] ========= If you use a higher-level tool, pay attention to how it exposes this functionality. Depending on the tool, creating a set +A B+ may be equivalent to +A with B+, or +B with A+. ========= This notation can also be used to tell the cluster that sets of resources must be colocated relative to each other, where the individual members of each set may or may not depend on each other being active (controlled by the +sequential+ property). In this example, +A+, +B+, and +C+ will each be colocated with +D+. +D+ must be active, but any of +A+, +B+, or +C+ may be inactive without affecting any other resources. .Using colocated sets to specify a common peer ====== [source,XML] ------- ------- ====== [IMPORTANT] ==== A colocated set with +sequential=false+ makes sense only if there is another set in the constraint. Otherwise, the constraint has no effect. ==== There is no inherent limit to the number and size of the sets used. The only thing that matters is that in order for any member of one set in the constraint to be active, all members of sets listed after it must also be active (and naturally on the same node); and if a set has +sequential="true"+, then in order for one member of that set to be active, all members listed before it must also be active. If desired, you can restrict the dependency to instances of multistate resources that are in a specific role, using the set's +role+ property. .Colocation chain in which the members of the middle set have no interdependencies, and the last listed set (which the cluster places first) is restricted to instances in master status. ====== [source,XML] ------- ------- ====== .Visual representation the above example (resources to the left are placed first) image::images/three-sets-complex.png["Colocation chain",width="16cm",height="9cm",align="center"] [NOTE] ==== Pay close attention to the order in which resources and sets are listed. While the colocation dependency for members of any one set is last-to-first, the colocation dependency for multiple sets is first-to-last. In the above example, +B+ is colocated with +A+, but +colocated-set-1+ is colocated with +colocated-set-2+. Unlike ordered sets, colocated sets do not use the +require-all+ option. ==== diff --git a/doc/Pacemaker_Explained/en-US/Ch-Options.txt b/doc/Pacemaker_Explained/en-US/Ch-Options.txt index f8a3daffc8..894458d0ce 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Options.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Options.txt @@ -1,435 +1,441 @@ = Cluster-Wide Configuration = == CIB Properties == Certain settings are defined by CIB properties (that is, attributes of the +cib+ tag) rather than with the rest of the cluster configuration in the +configuration+ section. The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location. .CIB Properties [width="95%",cols="2m,5<",options="header",align="center"] |========================================================= |Field |Description | admin_epoch | indexterm:[Configuration Version,Cluster] indexterm:[Cluster,Option,Configuration Version] indexterm:[admin_epoch,Cluster Option] indexterm:[Cluster,Option,admin_epoch] When a node joins the cluster, the cluster performs a check to see which node has the best configuration. It asks the node with the highest (+admin_epoch+, +epoch+, +num_updates+) tuple to replace the configuration on all the nodes -- which makes setting them, and setting them correctly, very important. +admin_epoch+ is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete. _Never set this value to zero_. In such cases, the cluster cannot tell the difference between your configuration and the "empty" one used when nothing is found on disk. | epoch | indexterm:[epoch,Cluster Option] indexterm:[Cluster,Option,epoch] The cluster increments this every time the configuration is updated (usually by the administrator). | num_updates | indexterm:[num_updates,Cluster Option] indexterm:[Cluster,Option,num_updates] The cluster increments this every time the configuration or status is updated (usually by the cluster) and resets it to 0 when epoch changes. | validate-with | indexterm:[validate-with,Cluster Option] indexterm:[Cluster,Option,validate-with] Determines the type of XML validation that will be done on the configuration. If set to +none+, the cluster will not verify that updates conform to the DTD (nor reject ones that don't). This option can be useful when operating a mixed-version cluster during an upgrade. |cib-last-written | indexterm:[cib-last-written,Cluster Property] indexterm:[Cluster,Property,cib-last-written] Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. |have-quorum | indexterm:[have-quorum,Cluster Property] indexterm:[Cluster,Property,have-quorum] Indicates if the cluster has quorum. If false, this may mean that the cluster cannot start resources or fence other nodes (see +no-quorum-policy+ below). Maintained by the cluster. |dc-uuid | indexterm:[dc-uuid,Cluster Property] indexterm:[Cluster,Property,dc-uuid] Indicates which cluster node is the current leader. Used by the cluster when placing resources and determining the order of some events. Maintained by the cluster. |========================================================= === Working with CIB Properties === Although these fields can be written to by the user, in most cases the cluster will overwrite any values specified by the user with the "correct" ones. To change the ones that can be specified by the user, for example +admin_epoch+, one should use: ---- # cibadmin --modify --xml-text '' ---- A complete set of CIB properties will look something like this: .Attributes set for a cib object ====== [source,XML] ------- ------- ====== [[s-cluster-options]] == Cluster Options == Cluster options, as you might expect, control how the cluster behaves when confronted with certain situations. They are grouped into sets within the +crm_config+ section, and, in advanced configurations, there may be more than one set. (This will be described later in the section on <> where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once. You can obtain an up-to-date list of cluster options, including their default values, by running the `man pengine` and `man crmd` commands. .Cluster Options [width="95%",cols="5m,2,11>). | enable-startup-probes | TRUE | indexterm:[enable-startup-probes,Cluster Option] indexterm:[Cluster,Option,enable-startup-probes] Should the cluster check for active resources during startup? | maintenance-mode | FALSE | indexterm:[maintenance-mode,Cluster Option] indexterm:[Cluster,Option,maintenance-mode] Should the cluster refrain from monitoring, starting and stopping resources? | stonith-enabled | TRUE | indexterm:[stonith-enabled,Cluster Option] indexterm:[Cluster,Option,stonith-enabled] Should failed nodes and nodes with resources that can't be stopped be shot? If you value your data, set up a STONITH device and enable this. If true, or unset, the cluster will refuse to start resources unless one or more STONITH resources have been configured. If false, unresponsive nodes are immediately assumed to be running no resources, and resource takeover to online nodes starts without any further protection (which means _data loss_ if the unresponsive node still accesses shared storage, for example). See also the +requires+ meta-attribute in <>. | stonith-action | reboot | indexterm:[stonith-action,Cluster Option] indexterm:[Cluster,Option,stonith-action] Action to send to STONITH device. Allowed values are +reboot+ and +off+. The value +poweroff+ is also allowed, but is only used for legacy devices. | stonith-timeout | 60s | indexterm:[stonith-timeout,Cluster Option] indexterm:[Cluster,Option,stonith-timeout] How long to wait for STONITH actions (reboot, on, off) to complete | concurrent-fencing | FALSE | indexterm:[concurrent-fencing,Cluster Option] indexterm:[Cluster,Option,concurrent-fencing] Is the cluster allowed to initiate multiple fence actions concurrently? | cluster-delay | 60s | indexterm:[cluster-delay,Cluster Option] indexterm:[Cluster,Option,cluster-delay] Estimated maximum round-trip delay over the network (excluding action execution). If the TE requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node in this time (after considering the action's own timeout). The "correct" value will depend on the speed and load of your network and cluster nodes. | dc-deadtime | 20s | indexterm:[dc-deadtime,Cluster Option] indexterm:[Cluster,Option,dc-deadtime] How long to wait for a response from other nodes during startup. The "correct" value will depend on the speed/load of your network and the type of switches used. | cluster-recheck-interval | 15min | indexterm:[cluster-recheck-interval,Cluster Option] indexterm:[Cluster,Option,cluster-recheck-interval] Polling interval for time-based changes to options, resource parameters and constraints. The Cluster is primarily event-driven, but your configuration can have elements that take effect based on the time of day. To ensure these changes take effect, we can optionally poll the cluster's status for changes. A value of 0 disables polling. Positive values are an interval (in seconds unless other SI units are specified, e.g. 5min). | pe-error-series-max | -1 | indexterm:[pe-error-series-max,Cluster Option] indexterm:[Cluster,Option,pe-error-series-max] The number of PE inputs resulting in ERRORs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-warn-series-max | -1 | indexterm:[pe-warn-series-max,Cluster Option] indexterm:[Cluster,Option,pe-warn-series-max] The number of PE inputs resulting in WARNINGs to save. Used when reporting problems. A value of -1 means unlimited (report all). | pe-input-series-max | -1 | indexterm:[pe-input-series-max,Cluster Option] indexterm:[Cluster,Option,pe-input-series-max] The number of "normal" PE inputs to save. Used when reporting problems. A value of -1 means unlimited (report all). | node-health-strategy | none | indexterm:[node-health-strategy,Cluster Option] indexterm:[Cluster,Option,node-health-strategy] How the cluster should react to node health attributes (see <>). Allowed values are +none+, +migrate-on-red+, +only-green+, +progressive+, and +custom+. +| node-health-base | 0 | +indexterm:[node-health-base,Cluster Option] +indexterm:[Cluster,Option,node-health-base] + The base health score assigned to a node. Only used when + +node-health-strategy+ is +progressive+. '(since 1.1.16)' + | node-health-green | 0 | indexterm:[node-health-green,Cluster Option] indexterm:[Cluster,Option,node-health-green] The score to use for a node health attribute whose value is +green+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-yellow | 0 | indexterm:[node-health-yellow,Cluster Option] indexterm:[Cluster,Option,node-health-yellow] The score to use for a node health attribute whose value is +yellow+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | node-health-red | 0 | indexterm:[node-health-red,Cluster Option] indexterm:[Cluster,Option,node-health-red] The score to use for a node health attribute whose value is +red+. Only used when +node-health-strategy+ is +progressive+ or +custom+. | remove-after-stop | FALSE | indexterm:[remove-after-stop,Cluster Option] indexterm:[Cluster,Option,remove-after-stop] _Advanced Use Only:_ Should the cluster remove resources from the LRM after they are stopped? Values other than the default are, at best, poorly tested and potentially dangerous. | startup-fencing | TRUE | indexterm:[startup-fencing,Cluster Option] indexterm:[Cluster,Option,startup-fencing] _Advanced Use Only:_ Should the cluster shoot unseen nodes? Not using the default is very unsafe! | election-timeout | 2min | indexterm:[election-timeout,Cluster Option] indexterm:[Cluster,Option,election-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | shutdown-escalation | 20min | indexterm:[shutdown-escalation,Cluster Option] indexterm:[Cluster,Option,shutdown-escalation] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-integration-timeout | 3min | indexterm:[crmd-integration-timeout,Cluster Option] indexterm:[Cluster,Option,crmd-integration-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-finalization-timeout | 30min | indexterm:[crmd-finalization-timeout,Cluster Option] indexterm:[Cluster,Option,crmd-finalization-timeout] _Advanced Use Only:_ If you need to adjust this value, it probably indicates the presence of a bug. | crmd-transition-delay | 0s | indexterm:[crmd-transition-delay,Cluster Option] indexterm:[Cluster,Option,crmd-transition-delay] _Advanced Use Only:_ Delay cluster recovery for the configured interval to allow for additional/related events to occur. Useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions. |default-resource-stickiness | 0 | indexterm:[default-resource-stickiness,Cluster Option] indexterm:[Cluster,Option,default-resource-stickiness] _Deprecated:_ See <> instead | is-managed-default | TRUE | indexterm:[is-managed-default,Cluster Option] indexterm:[Cluster,Option,is-managed-default] _Deprecated:_ See <> instead | default-action-timeout | 20s | indexterm:[default-action-timeout,Cluster Option] indexterm:[Cluster,Option,default-action-timeout] _Deprecated:_ See <> instead |========================================================= === Querying and Setting Cluster Options === indexterm:[Querying,Cluster Option] indexterm:[Setting,Cluster Option] indexterm:[Cluster,Querying Options] indexterm:[Cluster,Setting Options] Cluster options can be queried and modified using the `crm_attribute` tool. To get the current value of +cluster-delay+, you can run: ---- # crm_attribute --query --name cluster-delay ---- which is more simply written as ---- # crm_attribute -G -n cluster-delay ---- If a value is found, you'll see a result like this: ---- # crm_attribute -G -n cluster-delay scope=crm_config name=cluster-delay value=60s ---- If no value is found, the tool will display an error: ---- # crm_attribute -G -n clusta-deway scope=crm_config name=clusta-deway value=(null) Error performing operation: No such device or address ---- To use a different value (for example, 30 seconds), simply run: ---- # crm_attribute --name cluster-delay --update 30s ---- To go back to the cluster's default value, you can delete the value, for example: ---- # crm_attribute --name cluster-delay --delete Deleted crm_config option: id=cib-bootstrap-options-cluster-delay name=cluster-delay ---- === When Options are Listed More Than Once === If you ever see something like the following, it means that the option you're modifying is present more than once. .Deleting an option that is listed twice ======= ------ # crm_attribute --name batch-limit --delete Multiple attributes match name=batch-limit in crm_config: Value: 50 (set=cib-bootstrap-options, id=cib-bootstrap-options-batch-limit) Value: 100 (set=custom, id=custom-batch-limit) Please choose from one of the matches above and supply the 'id' with --id ------- ======= In such cases, follow the on-screen instructions to perform the requested action. To determine which value is currently being used by the cluster, refer to <>. diff --git a/doc/Pacemaker_Explained/en-US/Revision_History.xml b/doc/Pacemaker_Explained/en-US/Revision_History.xml index c781a023e8..bdf18a409c 100644 --- a/doc/Pacemaker_Explained/en-US/Revision_History.xml +++ b/doc/Pacemaker_Explained/en-US/Revision_History.xml @@ -1,96 +1,108 @@ Revision History 1-0 19 Oct 2009 AndrewBeekhofandrew@beekhof.net Import from Pages.app 2-0 26 Oct 2009 AndrewBeekhofandrew@beekhof.net Cleanup and reformatting of docbook xml complete 3-0 Tue Nov 12 2009 AndrewBeekhofandrew@beekhof.net Split book into chapters and pass validation Re-organize book for use with Publican 4-0 Mon Oct 8 2012 AndrewBeekhofandrew@beekhof.net Converted to asciidoc (which is converted to docbook for use with Publican) 5-0 Mon Feb 23 2015 KenGaillotkgaillot@redhat.com Update for clarity, stylistic consistency and current command-line syntax 6-0 Tue Dec 8 2015 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.14 7-0 Tue May 3 2016 KenGaillotkgaillot@redhat.com Update for Pacemaker 1.1.15 7-1 Fri Oct 28 2016 KenGaillotkgaillot@redhat.com Overhaul upgrade documentation, and document node health strategies + + 8-0 + Tue Oct 25 2016 + KenGaillotkgaillot@redhat.com + + + + Update for Pacemaker 1.1.16 + + + + diff --git a/doc/Pacemaker_Remote/en-US/Book_Info.xml b/doc/Pacemaker_Remote/en-US/Book_Info.xml index 1e3675b9d1..64e6b32237 100644 --- a/doc/Pacemaker_Remote/en-US/Book_Info.xml +++ b/doc/Pacemaker_Remote/en-US/Book_Info.xml @@ -1,75 +1,75 @@ %BOOK_ENTITIES; ]> Pacemaker Remote Scaling High Availablity Clusters - 6 + 7 0 The document exists as both a reference and deployment guide for the Pacemaker Remote service. The example commands in this document will use: &DISTRO; &DISTRO_VERSION; as the host operating system Pacemaker Remote to perform resource management within guest nodes and remote nodes KVM for virtualization libvirt to manage guest nodes Corosync to provide messaging and membership services on cluster nodes Pacemaker to perform resource management on cluster nodes pcs as the cluster configuration toolset The concepts are the same for other distributions, virtualization platforms, toolsets, and messaging layers, and should be easily adaptable. diff --git a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt index df5e1eae23..72bd15d9d0 100644 --- a/doc/Pacemaker_Remote/en-US/Ch-Intro.txt +++ b/doc/Pacemaker_Remote/en-US/Ch-Intro.txt @@ -1,205 +1,201 @@ = Scaling a Pacemaker Cluster = == Overview == In a basic Pacemaker high-availability cluster,footnote:[See the http://www.clusterlabs.org/doc/[Pacemaker documentation], especially 'Clusters From Scratch' and 'Pacemaker Explained', for basic information about high-availability using Pacemaker] each node runs the full cluster stack of corosync and all Pacemaker components. This allows great flexibility but limits scalability to around 16 nodes. To allow for scalability to dozens or even hundreds of nodes, Pacemaker allows nodes not running the full cluster stack to integrate into the cluster and have the cluster manage their resources as if they were a cluster node. == Terms == cluster node:: A node running the full high-availability stack of corosync and all Pacemaker components. Cluster nodes may run cluster resources, run all Pacemaker command-line tools (`crm_mon`, `crm_resource` and so on), execute fencing actions, count toward cluster quorum, and serve as the cluster's Designated Controller (DC). (((cluster node))) (((node,cluster node))) pacemaker_remote:: A small service daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. Nodes running pacemaker_remote may run cluster resources and most command-line tools, but cannot perform other functions of full cluster nodes such as fencing execution, quorum voting or DC eligibility. The pacemaker_remote daemon is an enhanced version of Pacemaker's local resource management daemon (LRMD). (((pacemaker_remote))) remote node:: A physical host running pacemaker_remote. Remote nodes have a special resource that manages communication with the cluster. This is sometimes referred to as the 'baremetal' case. (((remote node))) (((node,remote node))) guest node:: A virtual host running pacemaker_remote. Guest nodes differ from remote nodes mainly in that the guest node is itself a resource that the cluster manages. (((guest node))) (((node,guest node))) [NOTE] ====== 'Remote' in this document refers to the node not being a part of the underlying corosync cluster. It has nothing to do with physical proximity. Remote nodes and guest nodes are subject to the same latency requirements as cluster nodes, which means they are typically in the same data center. ====== [NOTE] ====== It is important to distinguish the various roles a virtual machine can serve in Pacemaker clusters: * A virtual machine can run the full cluster stack, in which case it is a cluster node and is not itself managed by the cluster. * A virtual machine can be managed by the cluster as a resource, without the cluster having any awareness of the services running inside the virtual machine. The virtual machine is 'opaque' to the cluster. * A virtual machine can be a cluster resource, and run pacemaker_remote to make it a guest node, allowing the cluster to manage services inside it. The virtual machine is 'transparent' to the cluster. ====== == Support in Pacemaker Versions == It is recommended to run Pacemaker 1.1.12 or later when using pacemaker_remote due to important bug fixes. An overview of changes in pacemaker_remote -capability by version: +capability by version (aside from bug fixes, which are included in every +version): + +.1.1.16 +* Support for watchdog-based fencing (sbd) on remote nodes .1.1.15 * If pacemaker_remote is stopped on an active node, it will wait for the cluster to migrate all resources off before exiting, rather than exit immediately and get fenced. -* Bug fixes .1.1.14 * Resources that create guest nodes can be included in groups * reconnect_interval option for remote nodes -* Bug fixes, including a memory leak .1.1.13 * Support for maintenance mode * Remote nodes can recover without being fenced when the cluster node hosting their connection fails * Running pacemaker_remote within LXC environments is deprecated due to newly added Pacemaker support for isolated resources * +#kind+ built-in node attribute for use with rules -* Bug fixes .1.1.12 * Support for permanent node attributes * Support for migration -* Bug fixes .1.1.11 * Support for IPv6 * Support for remote nodes * Support for transient node attributes * Support for clusters with mixed endian architectures -* Bug fixes - -.1.1.10 -* Bug fixes .1.1.9 * Initial version to include pacemaker_remote * Limited to guest nodes in KVM/LXC environments using only IPv4; all nodes' architectures must have same endianness == Guest Nodes == (((guest node))) (((node,guest node))) *"I want a Pacemaker cluster to manage virtual machine resources, but I also want Pacemaker to be able to manage the resources that live within those virtual machines."* Without pacemaker_remote, the possibilities for implementing the above use case have significant limitations: * The cluster stack could be run on the physical hosts only, which loses the ability to monitor resources within the guests. * A separate cluster could be on the virtual guests, which quickly hits scalability issues. * The cluster stack could be run on the guests using the same cluster as the physical hosts, which also hits scalability issues and complicates fencing. With pacemaker_remote: * The physical hosts are cluster nodes (running the full cluster stack). * The virtual machines are guest nodes (running the pacemaker_remote service). Nearly zero configuration is required on the virtual machine. * The cluster stack on the cluster nodes launches the virtual machines and immediately connects to the pacemaker_remote service on them, allowing the virtual machines to integrate into the cluster. The key difference here between the guest nodes and the cluster nodes is that the guest nodes do not run the cluster stack. This means they will never become the DC, initiate fencing actions or participate in quorum voting. On the other hand, this also means that they are not bound to the scalability limits associated with the cluster stack (no 16-node corosync member limits to deal with). That isn't to say that guest nodes can scale indefinitely, but it is known that guest nodes scale horizontally much further than cluster nodes. Other than the quorum limitation, these guest nodes behave just like cluster nodes with respect to resource management. The cluster is fully capable of managing and monitoring resources on each guest node. You can build constraints against guest nodes, put them in standby, or do whatever else you'd expect to be able to do with cluster nodes. They even show up in `crm_mon` output as nodes. To solidify the concept, below is an example that is very similar to an actual deployment we test in our developer environment to verify guest node scalability: * 16 cluster nodes running the full corosync + pacemaker stack * 64 Pacemaker-managed virtual machine resources running pacemaker_remote configured as guest nodes * 64 Pacemaker-managed webserver and database resources configured to run on the 64 guest nodes With this deployment, you would have 64 webservers and databases running on 64 virtual machines on 16 hardware nodes, all of which are managed and monitored by the same Pacemaker deployment. It is known that pacemaker_remote can scale to these lengths and possibly much further depending on the specific scenario. == Remote Nodes == (((remote node))) (((node,remote node))) *"I want my traditional high-availability cluster to scale beyond the limits imposed by the corosync messaging layer."* Ultimately, the primary advantage of remote nodes over cluster nodes is scalability. There are likely some other use cases related to geographically distributed HA clusters that remote nodes may serve a purpose in, but those use cases are not well understood at this point. Like guest nodes, remote nodes will never become the DC, initiate fencing actions or participate in quorum voting. That is not to say, however, that fencing of a remote node works any differently than that of a cluster node. The Pacemaker policy engine understands how to fence remote nodes. As long as a fencing device exists, the cluster is capable of ensuring remote nodes are fenced in the exact same way as cluster nodes. == Expanding the Cluster Stack == With pacemaker_remote, the traditional view of the high-availability stack can be expanded to include a new layer: .Traditional HA Stack image::images/pcmk-ha-cluster-stack.png["Traditional Pacemaker+Corosync Stack",width="17cm",height="9cm",align="center"] .HA Stack With Guest Nodes image::images/pcmk-ha-remote-stack.png["Pacemaker+Corosync Stack With pacemaker_remote",width="20cm",height="10cm",align="center"] diff --git a/doc/Pacemaker_Remote/en-US/Revision_History.xml b/doc/Pacemaker_Remote/en-US/Revision_History.xml index b3d1fd285d..d0ad93af96 100644 --- a/doc/Pacemaker_Remote/en-US/Revision_History.xml +++ b/doc/Pacemaker_Remote/en-US/Revision_History.xml @@ -1,49 +1,55 @@ %BOOK_ENTITIES; ]> Revision History 1-0 Tue Mar 19 2013 DavidVosseldavidvossel@gmail.com Import from Pages.app 2-0 Tue May 13 2013 DavidVosseldavidvossel@gmail.com Added Future Features Section 3-0 Fri Oct 18 2013 DavidVosseldavidvossel@gmail.com Added Baremetal remote-node feature documentation 4-0 Tue Aug 25 2015 KenGaillotkgaillot@redhat.com Targeted CentOS 7.1 and Pacemaker 1.1.12+, updated for current terminology and practice 5-0 Tue Dec 8 2015 KenGaillotkgaillot@redhat.com Updated for Pacemaker 1.1.14 6-0 Tue May 3 2016 KenGaillotkgaillot@redhat.com Updated for Pacemaker 1.1.15 + + 7-0 + Mon Oct 31 2016 + KenGaillotkgaillot@redhat.com + Updated for Pacemaker 1.1.16 + diff --git a/xml/Readme.md b/xml/Readme.md index 88d9137828..fdcf2aef7a 100644 --- a/xml/Readme.md +++ b/xml/Readme.md @@ -1,84 +1,85 @@ # Schema Reference Besides the version of Pacemaker itself, the XML schema of the Pacemaker configuration has its own version. ## Versioned Schema Evolution A versioned schema offers transparent backward/forward compatibility. - It reflects the timeline of schema-backed features (introduction, changes to the syntax, possibly deprecation) through the versioned stable schema increments, while keeping schema versions used by default by older Pacemaker versions untouched. - Pacemaker internally uses the latest stable schema version, and relies on supplemental transformations to promote cluster configurations based on older, incompatible schema versions into the desired form. - It allows experimental features with a possibly unstable configuration interface to be developed using the special `next` version of the schema. ## Mapping Pacemaker Versions to Schema Versions | Pacemaker | Latest Schema | Changed | --------- | ------------- | ---------------------------------------------- +| `1.1.16` | `2.6` | `constraints` | `1.1.15` | `2.5` | `alerts` | `1.1.14` | `2.4` | `fencing` | `1.1.13` | `2.3` | `constraints` | `1.1.12` | `2.0` | `nodes`, `nvset`, `resources`, `tags` + `acls` | `1.1.8`+ | `1.2` | # Updating schema files # ## Experimental features ## Experimental features go into `${base}-next.rng` Create from the most recent `${base}-${X}.${Y}.rng` if it does not already exist ## Stable features ## The current stable version is determined at runtime when __xml_build_schema_list() interrogates the CRM_DTD_DIRECTORY. It will have the form `pacemaker-${X}.${Y}` and the highest `${X}.${Y}` wins. ### Simple Additions When the new syntax is a simple addition to the previous one, create a new entry with `${Y} = ${Yold} + 1` ### Feature Removal or otherwise Incompatible Changes When the new syntax is not a simple addition to the previous one, create a new entry with `${X} = ${Xold} + 1` and `${Y} = 0`. An XSLT file is also required that converts an old syntax to the new one and must be named `upgrade-${Xold}.${Yold}.xsl`. See `xml/upgrade06.xsl` for an example. ### General Proceedure 1. Copy the most recent version of `${base}-*.rng` to `${base}-${X}.${Y}.rng` 1. Commit the copy, eg. `"Clone the latest ${base} schema in preparation for changes"`. This way the actual change will be obvious in the commit history. 1. Modify `${base}-${X}.${Y}.rng` as required 1. Add an XSLT file if required and update `xslt_SCRIPTS` in `xml/Makefile.am` 1. Commit ## Admin Tasks New features will not be available until the admin 1. Updates all the nodes 1. Runs the equivalent of `cibadmin --upgrade` ## Random Notes From the source directory, run `make -C xml diff` to see the changes in the current schema (compared to the previous ones) and also the pending changes in `pacemaker-next`. Alternatively, if the intention is to grok the overall historical schema evolution, use `make -C xml fulldiff`.