diff --git a/dev-guides/ra-dev-guide.txt b/dev-guides/ra-dev-guide.txt index 372129460..3d38852dd 100644 --- a/dev-guides/ra-dev-guide.txt +++ b/dev-guides/ra-dev-guide.txt @@ -1,1339 +1,1339 @@ The OCF Resource Agent Developer's Guide ======================================== Florian Haas == Introduction This document is to serve as a guide and reference for all developers, maintainers, and contributors working on OCF (Open Cluster Framework) compliant cluster resource agents. It explains the anatomy and general functionality of a resource agent, illustrates the resource agent API, and provides valuable hints and tips to resource agent authors. === What is a resource agent? A resource agent is an executable that manages a cluster resource. No formal definition of a cluster resource exists, other than "anything a cluster manages is a resource." Cluster resources can be as diverse as IP addresses, file systems, database services, and entire virtual machines -- to name just a few examples. === Who or what uses a resource agent? Any Open Cluster Framework (OCF) compliant cluster management application is capable of managing resources using the resource agents described in this document. At the time of writing, two OCF compliant cluster management applications exist for the Linux platform: * _Pacemaker_, a cluster manager supporting both the Corosync and Heartbeat cluster messaging frameworks. Pacemaker evolved out of the Linux-HA project. * _RGmanager_, the cluster manager bundled in Red Hat Cluster Suite. It supports the Corosync cluster messaging framework exclusively. === Which language is a resource agent written in? An OCF compliant resource agent can be implemented in _any_ programming language. The API is not language specific. However, most resource agents are implemented as shell scripts, which is why this guide primarily uses example code written in shell language. == API definitions === Environment variables A resource agent receives all configuration information about the resource it manages via environment variables. The names of these environment variables are always the name of the resource parameter, prefixed with +OCF_RESKEY_+. For example, if the resource has an +ip+ parameter set to +192.168.1.1+, then the resource agent will have access to an environment variable +OCF_RESKEY_ip+ holding that value. For any resource parameter that is not required to be set by the user -- that is, its parameter definition in the resource agent metadata does not specify +required="true"+ -- then the resource agent must * Provide a reasonable default. This should be advertised in the metadata. By convention, the resource agent uses a variable named +OCF_RESKEY__default+ that holds this default. * Alternatively, cater correctly for the value being empty. In addition, the cluster manager may also support _meta_ resource parameters. These do not apply directly to the resource configuration, but rather specify _how_ the cluster resource manager is expected to manage the resource. For example, the Pacemaker cluster manager uses the +target-role+ meta parameter to specify whether the resource should be started or stopped. Meta parameters are passed into the resource agent in the +OCF_RESKEY_CRM_meta_+ namespace, with any hypens converted to underscores. Thus, the +target-role+ attribute maps to an environment variable named +OCF_RESKEY_CRM_meta_target_role+. === Actions Any resource agent must support one command-line argument which specifies the action the resource agent is about to execute. The following actions must be supported by any resource agent: * +start+ -- starts the resource. * +stop+ -- shuts down the resource. * +monitor+ -- queries the resource for its state. * +meta-data+ -- dumps the resource agent metadata. In addition, resource agents may optionally support the following actions: * +promote+ -- turns a resource into the +Master+ role (Master/Slave resources only). * +demote+ -- turns a resource into the +Slave+ role (Master/Slave resources only). * +migrate_to+ and +migrate_from+ -- implement live migration of resources. * +validate-all+ -- validates a resource's configuration. * +usage+ or +help+ -- displays a usage message when the resource agent is invoked from the command line, rather than by the cluster manager. * +status+ -- historical (deprecated) synonym for +monitor+. === Return codes For any invocation, resource agents must exit with a defined return code that informs the caller of the outcome of the invoked action. The following return codes are available for any action: * +OCF_SUCCESS+ (0) -- the action completed successfully. * +OCF_ERR_GENERIC+ (1) -- the action returned a generic error. A resource agent should use this exit code only when none of the more specific error codes, defined below, accurately describes the problem. * +OCF_ERR_ARGS+ (2) -- the resource agent was invoked with incorrect arguments. * +OCF_ERR_UNIMPLEMENTED+ (3) -- the resource agent was instructed to execute an optional action that the agent does not implement. * +OCF_ERR_PERM+ (4) -- the action failed due to insufficient permissions. * +OCF_ERR_INSTALLED+ (5) -- the action failed because a required component is missing on the node where the action was executed. * +OCF_ERR_CONFIGURED+ (6) -- the action failed because the user misconfigured the resource. The following addition exit codes may be returned by the +monitor+ action only: * +OCF_NOT_RUNNING+ (7) -- the resource was found not to be running. * +OCF_RUNNING_MASTER+ (8) -- the resource was found to be running in the +Master+ role (Master/Slave resources only). * +OCF_FAILED_MASTER+ (9) -- the resource was found to have failed in the +Master+ role (Master/Slave resources only). === Timeouts Action timeouts are enforced outside the resource agent proper. It is the cluster manager's responsibility to monitor how long a resource agent action has been running, and terminate it if it does not meet its completion deadline. Thus, resource agents need not themselves check for any timeout expiry. They can, however, _advise_ the cluster manager about sensible timeout values. === Metadata Every resource agent must describe its own purpose and supported parameters in a set of XML metadata. This metadata is used by cluster management applications for on-line help, and resource agent man pages are generated from it as well. The following is a fictitious set of metadata from an imaginary resource agent: [source,xml] -------------------------------------------------------------------------- 0.1 This is a fictitious example resource agent written for the OCF Resource Agent Developers Guide. Example resource agent for budding OCF RA developers Number of eggs, an example numeric parameter Number of eggs Enable superfrobnication, an example boolean parameter Enable superfrobnication Data directory, an example string parameter Data directory -------------------------------------------------------------------------- The +resource-agent+ element, of which there must only be one per resource agent, defines the resource agent +name+ and +version+. The +longdesc+ and +shortdesc+ elements in +resource-agent+ provide a long and short description of the resource agent's functionality. While +shortdesc+ is a one-line description of what the resource agent does and is usually used in terse listings, +longdesc+ should give a full-blown description of the resource agent in as much detail as possible. The +parameters+ element describes the resource agent parameters, and should hold any number of +parameter+ children -- one for each parameter that the resource agent supports. Every +parameter+ should, like the +resource-agent+ as a whole, come with a +shortdesc+ and a +longdesc+, and also a +content+ child that describes the parameter's expected content. On the +content+ element, there may be four different attributes: * +type+ describes the parameter type (+string+, +integer+, or +boolean+). If unset, +type+ defaults to +string+. * +required+ indicates whether setting the parameter is mandatory (+required="true"+) or optional (+required="false"+). * For optional parameters, it is customary to provide a sensible default via the +default+ attribute. * Finally, the +unique+ attribute (allowed values: +true+ or +false+) indicates that a specific value must be unique across the cluster, for this parameter of this particular resource type. For example, a highly available floating IP address is declared +unique+ -- as that one IP address should run only once throughout the cluster, avoiding duplicates. The +actions+ list defines the actions that the resource agent advertises as supported. Every +action+ should list its own +timeout+ value. This is a hint to the user what _minimal_ timeout should be configured for the action. This is meant to cater for the fact that some resources are quick to start and stop (IP addresses or filesystems, for example), some may take several minutes to do so (such as databases). In addition, recurring actions (such as +monitor+) should also specify a recommended minimum +interval+, which is the time between two consecutive invocations of the same action. Like +timeout+, this value does not constitute a default -- it is merely a hint for the user which action interval to configure, at minimum. == Resource agent structure A typical (shell-based) resource agent contains standard structural items, in the order as listed in this section. It describes the expected behavior of a resource agent with respect to the various actions it supports, using a fictitous resource agent named +foobar+ as an example. === Resource agent interpreter Any resource agent implemented as a script must specify its interpreter using standard "shebang" (+#!+) header syntax. [source,bash] -------------------------------------------------------------------------- #!/bin/sh -------------------------------------------------------------------------- If a resource agent is written in shell, specifying the generic shell interpreter (+#!/bin/sh+) is generally preferred, though not required. Resource agents declared as +/bin/sh+ compatible must not use constructs native to a specific shell (such as, for example, +${!variable}+ syntax native to +bash+). It is considered a regression to introduce a patch that will make a previously +sh+ compatible resource agent suitable only for +bash+, +ksh+, or any other non-generic shell. It is, however, perfectly acceptable for a new resource agent to explicitly define a specific shell, such as +/bin/bash+, as its interpreter. === Author and license information The resource agent should contain a comment listing the resource agent author(s) and/or copyright holder(s), and stating the license that applies to the resource agent: [source,bash] -------------------------------------------------------------------------- # # Resource Agent for managing foobar resources. # # License: GNU General Public License (GPL) # (c) 2008-2010 John Doe, Jane Roe, # and Linux-HA contributors -------------------------------------------------------------------------- When a resource agent refers to a license for which multiple versions exist, it is assumed that the current version applies. === Initialization Any shell resource agent should source the +.ocf-shellfuncs+ function library. With the syntax below, this is done in terms of +$OCF_FUNCTIONS_DIR+, which -- for testing purposes, and also for generating documentation -- may be overridden from the command line. [source,bash] -------------------------------------------------------------------------- # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat} . ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs -------------------------------------------------------------------------- Defaults for resource agent parameters should be set by initializing variables with the suffix +_default+: [source,bash] -------------------------------------------------------------------------- # Defaults OCF_RESKEY_superfrobnicate_default=0 : ${OCF_RESKEY_superfrobnicate=${OCF_RESKEY_superfrobnicate_default}} -------------------------------------------------------------------------- NOTE: The resource agent should make sure that it sets a default for any parameter not marked as +required+ in the metadata. === Functions implementing resource agent actions What follows next are the functions implementing the resource agent's advertised actions. The individual actions are described in detail in <<_resource_agent_behavior>>. === Execution block This is the part of the resource agent that actually executes when the resource agent is invoked. It typically follows a fairly standard structure: [source,bash] -------------------------------------------------------------------------- # Make sure meta-data and usage always succeed case $__OCF_ACTION in meta-data) meta_data # OCF variables are not set when querying meta-data exit 0 ;; usage|help) foobar_usage exit $OCF_SUCCESS ;; esac # Anything other than meta-data and usage must pass validation foobar_validate || exit $? # Translate each action into the appropriate function call case $__OCF_ACTION in start) foobar_start;; stop) foobar_stop;; status|monitor) foobar_monitor;; promote) foobar_promote;; demote) foobar_demote;; reload) ocf_log info "Reloading..." foobar_start ;; validate-all) ;; *) foobar_usage exit $OCF_ERR_UNIMPLEMENTED ;; esac rc=$? # The resource agent may optionally log a debug message ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION returned $rc" exit $rc -------------------------------------------------------------------------- == Resource agent behavior Each action is typically implemented in a separate function or method in the resource agent. By convention, these are usually named +_+, so the function implementing the +start+ action in +foobar+ would be named +foobar_start()+. As a general rule, whenever the resource agent encounters an error -that is not able to recover, it is permitted to immediately exit, +that it is not able to recover, it is permitted to immediately exit, throw an exception, or otherwise cease execution. Examples for this include configuration issues, missing binaries, permission problems, etc. It is not necessary to pass these errors up the call stack. It is the cluster manager's responsibility to initiate the appropriate recovery action based on the user's configuration. The resource agent should not guess at said configuration. === +start+ action When invoked with the +start+ action, the resource agent must start the resource if it is not yet running. This means that the agent must verify the resource's configuration, query its state, and then start it only if it is not running. A common way of doing this would be to invoke the +validate_all+ and +monitor+ function first, as in the following example: [source,bash] -------------------------------------------------------------------------- foobar_start() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # if resource is already running, bail out early if foobar_monitor; then ocf_log info "Resource is already running" return $OCF_SUCCESS fi # actually start up the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ... # After the resource has been started, check whether it started up # correctly. If the resource starts asynchronously, the agent may # spin on the monitor function here -- if the resource does not # start up within the defined timeout, the cluster manager will # consider the start action failed while ! foobar_monitor; do ocf_log debug "Resource has not started yet, waiting" sleep 1 done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- === +stop+ action When invoked with the +stop+ action, the resource agent must stop the resource, if it is running. This means that the agent must verify the resource configuration, query its state, and then stop it only if it is currently running. A common way of doing this would be to invoke the +validate_all+ and +monitor+ function first. It is important to understand that +stop+ is a force operation -- the resource agent must do everything in its power to shut down, the resource, short of rebooting the node or shutting it off. Consider the following example: [source,bash] -------------------------------------------------------------------------- foobar_stop() { local rc # exit immediately if configuration is not valid foobar_validate_all || exit $? foobar_monitor rc=$? case "$rc" in) "$OCF_SUCCESS") # Currently running. Normal, expected behavior. ocf_log debug "Resource is currently running" ;; "$OCF_RUNNING_MASTER") # Running as a Master. Need to demote before stopping. ocf_log info "Resource is currently running as Master" foobar_demote ;; "$OCF_NOT_RUNNING") # Currently not running. Nothing to do. ocf_log info "Resource is already stopped" return $OCF_SUCCESS ;; esac # actually shut down the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ... # After the resource has been stopped, check whether it shut down # correctly. If the resource stops asynchronously, the agent may # spin on the monitor function here -- if the resource does not # shut down within the defined timeout, the cluster manager will # consider the stop action failed while foobar_monitor; do ocf_log debug "Resource has not stopped yet, waiting" sleep 1 done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- NOTE: The expected exit code for a successful stop operation is +$OCF_SUCCESS+, _not_ +$OCF_NOT_RUNNING+. === +monitor+ action The +monitor+ action queries the current status of a resource. It must discern between three different states: * resource is currently running (return +$OCF_SUCCESS+); * resource has stopped gracefully (return +$OCF_NOT_RUNNING+); * resource has run into a problem and must be considered failed (return the appropriate +$OCF_ERR_+ code to indicate the nature of the problem). [source,bash] -------------------------------------------------------------------------- foobar_monitor() { local rc # exit immediately if configuration is not valid foobar_validate_all || exit $? ocf_run frobnicate --test # This example assumes the following exit code convention # for frobnicate: # 0: running, and fully caught up with master # 1: gracefully stopped # any other: error case "$?" in 0) rc=$OCF_SUCCESS ocf_log debug "Resource is running" ;; 1) rc=$OCF_NOT_RUNNING ocf_log debug "Resource is not running" ;; *) ocf_log err "Resource has failed" exit $OCF_ERR_GENERIC esac return $rc } -------------------------------------------------------------------------- Stateful (master/slave) resource agents may use a more elaborate monitoring scheme where they can provide "hints" to the cluster manager identifying which instance is best suited to assume the +Master+ role. <<_specifying_a_master_preference>> explains the details. NOTE: The cluster manager may invoke the +monitor+ action for a _probe_, which is a test whether the resource is currently running. Normally, the monitor operation would behave exactly the same during a probe and a "real" monitor action. If a specific resource does require special treatment for probes, however, the +ocf_is_probe+ convenience function is available in the OCF shell functions library for that purpose. === +validate-all+ action The +validate-all+ action tests for correct resource agent configuration and a working environment. +validate-all+ should exit with one of the following return codes: * +$OCF_SUCCESS+ -- all is well, the configuration is valid and usable. * +$OCF_ERR_CONFIGURED+ -- the user has misconfigured the resource. * +$OCF_ERR_INSTALLED+ -- the resource has possibly been configured correctly, but a vital component is missing on the node where +validate-all+ is being executed. * +$OCF_ERR_PERM+ -- the resource is configured correctly and is not missing any required components, but is suffering from a permission issue (such as not being able to create a necessary file). +validate-all+ is usually wrapped in a function that is not only called when explicitly invoking the corresponding action, but also -- as a sanity check -- from just about any other function. Therefore, the resource agent author must keep in mind that the function may be invoked during the +start+, +stop+, and +monitor+ operations, and also during probes. Probes pose a separate challenge for validation. During a probe (when the cluster manager may expect the resource _not_ to be running on the node where the probe is executed), some required components may be _expected_ to not be available on the affected node. For example, this includes any shared data on storage devices not available for reading during the probe. The +validate-all+ function may thus need to treat probes specially, using the +ocf_is_probe+ convenience function: [source,bash] -------------------------------------------------------------------------- foobar_validate_all() { # Test for configuration errors first if ! ocf_is_decimal $OCF_RESKEY_eggs; then ocf_log err "eggs is not numeric!" exit $OCF_ERR_CONFIGURED fi # Test for required binaries check_binary frobnicate # Check for data directory (this may be on shared storage, so # disable this test during probes) if ! ocf_is_probe; then if ! [ -d $OCF_RESKEY_datadir ]; then ocf_log err "$OCF_RESKEY_datadir does not exist or is not a directory!" exit $OCF_ERR_INSTALLED fi fi return $OCF_SUCCESS } -------------------------------------------------------------------------- === +meta-data+ action The +meta-data+ action dumps the resource agent metadata to standard output. The output must follow the metadata format as specified in <<_metadata>>. [source,bash] -------------------------------------------------------------------------- foobar_meta_data { cat < 0.1 ... EOF } -------------------------------------------------------------------------- === +promote+ action The +promote+ action is optional. It must only be supported by _stateful_ resource agents, which means agents that discern between two distinct _roles_: +Master+ and +Slave+. +Slave+ is functionally identical to the +Started+ state in a stateless resource agent. Thus, while a regular (stateless) resource agent only needs to implement +start+ and +stop+, a stateful resource agent must also support the +promote+ action to be able to make a transition between the +Started+ (+Slave+) and +Master+ roles. [source,bash] -------------------------------------------------------------------------- foobar_promote() { local rc # exit immediately if configuration is not valid foobar_validate_all || exit $? # test the resource's current state foobar_monitor rc=$? case "$rc" in) "$OCF_SUCCESS") # Running as slave. Normal, expected behavior. ocf_log debug "Resource is currently running as Slave" ;; "$OCF_RUNNING_MASTER") # Already a master. Unexpected, but not a problem. ocf_log info "Resource is already running as Master" return $OCF_SUCCESS ;; "$OCF_NOT_RUNNING") # Currently not running. Need to start before promoting. ocf_log info "Resource is currently not running" foobar_start ;; *) # Failed resource. Let the cluster manager recover. ocf_log err "Unexpected error, cannot promote" exit $rc ;; esac # actually promote the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ocf_run frobnicate --master-mode || exit $OCF_ERR_GENERIC # After the resource has been promoted, check whether the # promotion worked. If the resource promotion is asynchronous, the # agent may spin on the monitor function here -- if the resource # does not assume the Master role within the defined timeout, the # cluster manager will consider the promote action failed. while true; do foobar_monitor if [ $? -eq $OCF_RUNNING_MASTER ]; then ocf_log debug "Resource promoted" break else ocf_log debug "Resource still awaiting promotion" sleep 1 fi done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- === +demote+ action The +demote+ action is optional. It must only be supported by _stateful_ resource agents, which means agents that discern between two distict _roles_: +Master+ and +Slave+. +Slave+ is functionally identical to the +Started+ state in a stateless resource agent. Thus, while a regular (stateless) resource agent only needs to implement +start+ and +stop+, a stateful resource agent must also support the +demote+ action to be able to make a transition between the +Master+ and +Started+ (+Slave+) roles. [source,bash] -------------------------------------------------------------------------- foobar_demote() { local rc # exit immediately if configuration is not valid foobar_validate_all || exit $? # test the resource's current state foobar_monitor rc=$? case "$rc" in) "$OCF_RUNNING_MASTER") # Running as master. Normal, expected behavior. ocf_log debug "Resource is currently running as Master" ;; "$OCF_SUCCESS") # Alread running as slave. Nothing to do. ocf_log debug "Resource is currently running as Slave" return $OCF_SUCCESS ;; "$OCF_NOT_RUNNING") # Currently not running. Getting a demote action # in this state is unexpected. Exit with an error # and let the cluster manager recover. ocf_log err "Resource is currently not running" exit $OCF_ERR_GENERIC ;; *) # Failed resource. Let the cluster manager recover. ocf_log err "Unexpected error, cannot demote" exit $rc ;; esac # actually promote the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ocf_run frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC # After the resource has been promoted, check whether the # promotion worked. If the resource promotion is asynchronous, the # agent may spin on the monitor function here -- if the resource # does not assume the Master role within the defined timeout, the # cluster manager will consider the promote action failed. while true; do foobar_monitor if [ $? -eq $OCF_RUNNING_MASTER ]; then ocf_log debug "Resource still awaiting promotion" sleep 1 else ocf_log debug "Resource demoted" break fi done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- === +migrate_to+ action The +migrate_to+ action can serve one of two purposes: * Initiate a native _push_ type migration for the resource. In other words, instruct the resource to move _to_ a specific node from the node it is currently running on. The resource agent knows about its destination node via the +$OCF_RESKEY_CRM_meta_migrate_target+ environment variable. * Freeze the resource in a _freeze/thaw_ (also known as _suspend/resume_) type migration. In this mode, the resource does not need any information about its destination node at this point. The example below illustrates a push type migration: [source,bash] -------------------------------------------------------------------------- foobar_migrate_to() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # if resource is not running, bail out early if ! foobar_monitor; then ocf_log err "Resource is not running" exit $OCF_ERR_GENERIC fi # actually start up the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ocf_run frobnicate --migrate \ --dest=$OCF_RESKEY_CRM_meta_migrate_target \ || exit OCF_ERR_GENERIC ... # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- In contrast, a freeze/thaw type migration may implement its freeze operation like this: [source,bash] -------------------------------------------------------------------------- foobar_migrate_to() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # if resource is not running, bail out early if ! foobar_monitor; then ocf_log err "Resource is not running" exit $OCF_ERR_GENERIC fi # actually start up the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ocf_run frobnicate --freeze || exit OCF_ERR_GENERIC ... # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- === +migrate_from+ action The +migrate_from+ action can serve one of two purposes: * Complete a native _push_ type migration for the resource. In other words, check whether the migration has succeeded properly, and the resource is running on the local node. The resource agent knows about its the migration source via the +$OCF_RESKEY_CRM_meta_migrate_source+ environment variable. * Thaw the resource in a _freeze/thaw_ (also known as _suspend/resume_) type migration. In this mode, the resource usually not need any information about its source node at this point. The example below illustrates a push type migration: [source,bash] -------------------------------------------------------------------------- foobar_migrate_from() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # After the resource has been migrated, check whether it resumed # correctly. If the resource starts asynchronously, the agent may # spin on the monitor function here -- if the resource does not # run within the defined timeout, the cluster manager will # consider the migrate_from action failed while ! foobar_monitor; do ocf_log debug "Resource has not yet migrated, waiting" sleep 1 done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- In contrast, a freeze/thaw type migration may implement its thaw operation like this: [source,bash] -------------------------------------------------------------------------- foobar_migrate_from() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # actually start up the resource here (make sure to immediately # exit with an $OCF_ERR_ error code if anything goes seriously # wrong) ocf_run frobnicate --thaw || exit OCF_ERR_GENERIC # After the resource has been migrated, check whether it resumed # correctly. If the resource starts asynchronously, the agent may # spin on the monitor function here -- if the resource does not # run within the defined timeout, the cluster manager will # consider the migrate_from action failed while ! foobar_monitor; do ocf_log debug "Resource has not yet migrated, waiting" sleep 1 done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- === +notify+ action With notifications, instances of clones (and of master/slave resources, which are an extended kind of clones) can inform each other about their state. When notifications are enabled, any action on any instance of a clone carries a +pre+ and +post+ notification. Then, the cluster manager invokes the +notify+ operation on _all_ clone instances. For +notify+ operations, additional environment variables are passed into the resource agent during execution: * +$OCF_RESKEY_CRM_meta_notify_type+ -- the notification type (+pre+ or +post+) * +$OCF_RESKEY_CRM_meta_notify_operation+ -- the operation (action) that the notification is about (+start+, +stop+, +promote+, +demote+ etc.) * +$OCF_RESKEY_CRM_meta_start_uname+ -- node name of the node where the resource is being started (+start+ notifications only) * +$OCF_RESKEY_CRM_meta_stop_uname+ -- node name of the node where the resource is being stopped (+stop+ notifications only) * +$OCF_RESKEY_CRM_meta_master_uname+ -- node name of the node where the resource currently _is in_ the Master role * +$OCF_RESKEY_CRM_meta_promote_uname+ -- node name of the node where the resource currently _is being promoted to_ the Master role (+promote+ notifications only) * +$OCF_RESKEY_CRM_meta_demote_uname+ -- node name of the node where the resource currently _is being demoted to_ the Slave role (+demote+ notifications only) Notifications come in particularly handy for master/slave resources using a "pull" scheme, where the master is a publisher and the slave a subscriber. Since the master is obviously only available as such when a promotion has occurred, the slaves can use a "pre-promote" notification to configure themselves to subscribe to the right publisher. Likewise, the subscribers may want to unsubscribe from the publisher after it has relinquished its master status, and a "post-demote" notification can be used for that purpose. Consider the example below to illustrate the concept. [source,bash] -------------------------------------------------------------------------- foobar_notify() { local type_op type_op="${OCF_RESKEY_CRM_meta_notify_type}-${OCF_RESKEY_CRM_meta_notify_operation}" ocf_log debug "Received $type_op notification." case "$type_op" in 'pre-promote') ocf_run frobnicate --slave-mode \ --master=$OCF_RESKEY_CRM_meta_notify_promote_uname \ || exit $OCF_ERR_GENERIC ;; 'post-demote') ocf_run frobnicate --unset-slave-mode || exit $OCF_ERR_GENERIC ;; esac return $OCF_SUCCESS } -------------------------------------------------------------------------- NOTE: A master/slave resource agent may support a _multi-master_ configuration, where there is possibly more than one master at any given time. If that is the case, then the +$OCF_RESKEY_CRM_meta_*_uname+ variables may each contain a space-separated lists of hostnames, rather than a single host name as shown in the example. Under those circumstances the resource agent would have to properly iterate over this list. == Script variables This section outlines variables typically available to resource agents, primarily for convenience purposes. For additional variables available while the agent is being executed, refer to <<_environment_variables>> and <<_return_codes>>. === +$OCF_ROOT+ The root of the OCF resource agent hierarchy. This should never be changed by a resource agent. This is usually +/usr/lib/ocf+. === +$OCF_FUNCTIONS_DIR+ The directory where the resource agents shell function library, +.ocf-shellfuncs+, resides. This is usually defined in terms of +$OCF_ROOT+ and should never be changed by a resource agent. This variable may, however, be overridden from the command line while testing a new or modified resource agent. === +$OCF_RESOURCE_INSTANCE+ The resource instance name. For primitive (non-clone, non-stateful) resources, this is simply the resource name. For clones and stateful resources, this is the primitive name, followed by a colon an the clone instance number (such as +p_foobar:0+). === +$__OCF_ACTION+ The currently invoked action. This is exactly the first command-line argument that the cluster manager specifies when it invokes the resource agent. === +$__SCRIPT_NAME+ The name of the resource agent. This is exactly the base name of the resource agent script, with leading directory names removed. === +$HA_RSCTMP+ A temporary directory for use by resource agents. The cluster manager guarantees that this directory is emptied on system startup, so this directory will not contain any stale data after a node reboot. == Convenience functions === Logging: +ocf_log+ Resource agents should use the +ocf_log+ function for logging purposes. This convenient logging wrapper is invoked as follows: [source,bash] -------------------------------------------------------------------------- ocf_log "Log message" -------------------------------------------------------------------------- It supports following the following severity levels: * +debug+ -- for debugging messages. Most logging configurations suppress this level by default. * +info+ -- for informational messages about the agent's behavior or status. * +warn+ -- for warnings. This is for any messages which reflect unexpected behavior that does _not_ constitute an unrecoverable error. * +err+ -- for errors. As a general rule, this logging level should only be used immediately prior to an +exit+ with the appropriate error code. * +crit+ -- for critical errors. As with +err+, this logging level should not be used unless the resource agent also exits with an error code. Very rarely used. === Testing for binaries: +have_binary+ and +check_binary+ A resource agent may need to test for the availability of a specific executable. The +have_binary+ convenience function comes in handy here: [source,bash] -------------------------------------------------------------------------- if ! have_binary frobnicate; then ocf_log warn "Missing frobnicate binary, frobnication disabled!" fi -------------------------------------------------------------------------- If a missing binary is a fatal problem for the resource, then the +check_binary+ function should be used: [source,bash] -------------------------------------------------------------------------- check_binary frobnicate -------------------------------------------------------------------------- Using +check_binary+ is a shorthand method for testing for the existence (and executability) of the specified binary, and exiting with +$OCF_ERR_INSTALLED+ if it cannot be found or executed. NOTE: Both +have_binary+ and +check_binary+ honor +$PATH+ when the binary to test for is not specified as a full path. It is usually wise to _not_ test for a full path, as binary installations path may vary by distribution or user policy. === Executing commands and capturing their output: +ocf_run+ Whenever a resource agent needs to execute a command and capture its output, it should use the +ocf_run+ convenience function, invoked as in this example: [source,bash] -------------------------------------------------------------------------- ocf_run "frobnicate --spam=eggs" || exit $OCF_ERR_GENERIC -------------------------------------------------------------------------- With the command specified above, the resource agent will invoke +frobnicate --spam=eggs+ and capture its output and exit code. If the exit code is nonzero (indicating an error), +ocf_run+ logs the command output with the +err+ logging severity, and the resource agent subsequently exits. If the resource agent wishes to capture the output of _both_ a successful and a failed command execution, it can use the +-v+ flag with +ocf_run+. In the example below, +ocf_run+ will log any output from the command with the +info+ severity if the command exit code is zero (indicating success), and with +err+ if it is nonzero. [source,bash] -------------------------------------------------------------------------- ocf_run -v "frobnicate --spam=eggs" || exit $OCF_ERR_GENERIC -------------------------------------------------------------------------- Finally, if the resource agent wants to log the output of a command with a nonzero exit code with a severity _other_ than error, it may do so by adding the +-info+ or +-warn+ option to +ocf_run+: [source,bash] -------------------------------------------------------------------------- ocf_run -warn "frobnicate --spam=eggs" -------------------------------------------------------------------------- === Locks: +ocf_take_lock+ and +ocf_release_lock_on_exit+ Occasionally, there may be different resources of the same type in a cluster configuration that should not execute actions in parallel. When a resource agent needs to guard against parallel execution on the same machine, it can use the +ocf_take_lock+ and +ocf_release_lock_on_exit+ convenience functions: [source,bash] -------------------------------------------------------------------------- LOCKFILE=${HA_RSCTMP}/foobar ocf_release_lock_on_exit $LOCKFILE foobar_start() { ... ocf_take_lock $LOCKFILE ... } -------------------------------------------------------------------------- +ocf_take_lock+ attempts to acquire the designated +$LOCKFILE+. When it is unavailable, it sleeps a random amount of time between 0 and 1 seconds, and retries. +ocf_release_lock_on_exit+ releases the lock file when the agent exits (for any reason). === Testing for numerical values: +ocf_is_decimal+ Specifically for parameter validation, it can be helpful to test whether a given value is numeric. The +ocf_is_decimal+ function exists for that purpose: -------------------------------------------------------------------------- foobar_validate_all() { if ! ocf_is_decimal $OCF_RESKEY_eggs; then ocf_log err "eggs is not numeric!" exit $OCF_ERR_CONFIGURED fi ... } -------------------------------------------------------------------------- === Testing for boolean values: +ocf_is_true+ When a resource agent defines a boolean parameter, the value for this parameter may be specified by the user as +0+/+1+, +true+/+false+, or +on+/+off+. Since it is tedious to test for all these values from within the resource agent, the agent should instead use the +ocf_is_true+ convenience function: [source,bash] -------------------------------------------------------------------------- if ocf_is_true $OCF_RESKEY_superfrobnicate; then ocf_run "frobnicate --super" fi -------------------------------------------------------------------------- NOTE: If +ocf_is_true+ is used against an empty or non-existant variable, it always returns an exit code of +1+, which is equivalent to +false+. === Pseudo resources: +ha_pseudo_resource+ "Pseudo resources" are those where the resource agent in fact does not actually start or stop something akin to a runnable process, but merely executes a single action and then needs some form of tracing whether that action has been executed or not. The +portblock+ resource agent is an example of this. Resource agents for pseudo resources can use a convenience function, +ha_pseudo_resource+, which makes use of _tracking files_ to keep tabs on the status of a resource. If +foobar+ was designed to manage a pseudo resource, then its +start+ action could look like this: [source,bash] -------------------------------------------------------------------------- foobar_start() { # exit immediately if configuration is not valid foobar_validate_all || exit $? # if resource is already running, bail out early if foobar_monitor; then ocf_log info "Resource is already running" return $OCF_SUCCESS fi # start the pseudo resource ha_pseudo_resource ${OCF_RESOURCE_INSTANCE} start # After the resource has been started, check whether it started up # correctly. If the resource starts asynchronously, the agent may # spin on the monitor function here -- if the resource does not # start up within the defined timeout, the cluster manager will # consider the start action failed while ! foobar_monitor; do ocf_log debug "Resource has not started yet, waiting" sleep 1 done # only return $OCF_SUCCESS if _everything_ succeeded as expected return $OCF_SUCCESS } -------------------------------------------------------------------------- == Special considerations === Licensing Whenever possible, resource agent contributors are _encouraged_ to use the GNU General Public License (GPL), version 2 and later, for any new resource agents. The shell functions library does not strictly mandate this, however, as it is licensed under the GNU Lesser General Public License (LGPL), version 2.1 and later (so it can be used by non-GPL agents). The resource agent _must_ explicitly state its own license in the agent source code. === Locale settings When sourcing +.ocf-shellfuncs+ as explained in <<_initialization>>, any resource agent automatically sets +LANG+ and +LC_ALL+ to the +C+ locale. Resource agents can thus expect to always operate in the +C+ locale, and need not reset +LANG+ or any of the +LC_+ environment variables themselves. === Specifying a master preference Stateful (master/slave) resources may influence their own _master preference_ -- they can provide hints to the cluster manager which is the the best instance to promote to the +Master+ role. For this purpose, +crm_master+ comes in handy. This convenience wrapper around the +crm_attribute+ sets a node attribute named +master-<<_literal_ocf_resource_instance_literal,$OCF_RESOURCE_INSTANCE>>+ for the node it is being executed on, and fills this attribute with the specified value. The cluster manager is then expected to translate this into a promotion score for the corresponding instance, and base its promotion preference on that score. Stateful resource agents typically execute +crm_master+ during the <<_literal_monitor_literal_action,+monitor+>> and/or <<_literal_notify_literal_action,+notify+>> action. The following example assumes that the +foobar+ resource agent can test the application's status by executing a binary that returns certain exit codes based on whether * the resource is either in the master role, or is a slave that is fully caught up with the master (at any rate, it has current data), or * the resource is in the slave role, but through some form of asynchronous replication has "fallen behind" the master, or * the resource has gracefully stopped, or * the resource has unexpectedly failed. [source,bash] -------------------------------------------------------------------------- foobar_monitor() { local rc # exit immediately if configuration is not valid foobar_validate_all || exit $? ocf_run frobnicate --test # This example assumes the following exit code convention # for frobnicate: # 0: running, and fully caught up with master # 1: gracefully stopped # 2: running, but lagging behind master # any other: error case "$?" in 0) rc=$OCF_SUCCESS ocf_log debug "Resource is running" # Set a high master preference. The current master # will always get this, plus 1. Any current slaves # will get a high preference so that if the master # fails, they are next in line to take over. crm_master -l reboot -v 100 ;; 1) rc=$OCF_NOT_RUNNING ocf_log debug "Resource is not running" # Remove the master preference for this node crm_master -l reboot -D ;; 2) rc=$OCF_SUCCESS ocf_log debug "Resource is lagging behind master" # Set a low master preference: if the master fails # right now, and there is another slave that does # not lag behind the master, its higher master # preference will win and that slave will become # the new master crm_master -l reboot -v 5 ;; *) ocf_log err "Resource has failed" exit $OCF_ERR_GENERIC esac return $rc } --------------------------------------------------------------------------