diff --git a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt index 02525d6f08..a3c02cbc4c 100644 --- a/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt +++ b/doc/Pacemaker_Explained/en-US/Ch-Stonith.txt @@ -1,833 +1,859 @@ = STONITH = //// We prefer [[ch-stonith]], but older versions of asciidoc don't deal well with that construct for chapter headings //// anchor:ch-stonith[Chapter 13, STONITH] indexterm:[STONITH, Configuration] == What Is STONITH? == STONITH (an acronym for "Shoot The Other Node In The Head"), also called 'fencing', protects your data from being corrupted by rogue nodes or concurrent access. Just because a node is unresponsive, this doesn't mean it isn't accessing your data. The only way to be 100% sure that your data is safe, is to use STONITH so we can be certain that the node is truly offline, before allowing the data to be accessed from another node. STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service elsewhere. == What STONITH Device Should You Use? == It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one. The biggest mistake people make in choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. In such cases, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault. Likewise, any device that relies on the machine being active (such as SSH-based "devices" used during testing) are inappropriate. == Special Treatment of STONITH Resources == STONITH resources are somewhat special in Pacemaker. STONITH may be initiated by pacemaker or by other parts of the cluster (such as resources like DRBD or DLM). To accommodate this, pacemaker does not require the STONITH resource to be in the 'started' state in order to be used, thus allowing reliable use of STONITH devices in such a case. [NOTE] ==== In pacemaker versions 1.1.9 and earlier, this feature either did not exist or did not work well. Only "running" STONITH resources could be used by Pacemaker for fencing, and if another component tried to fence a node while Pacemaker was moving STONITH resources, the fencing could fail. ==== All nodes have access to STONITH devices' definitions and instantiate them on-the-fly when needed, but preference is given to 'verified' instances, which are the ones that are 'started' according to the cluster's knowledge. In the case of a cluster split, the partition with a verified instance will have a slight advantage, because the STONITH daemon in the other partition will have to hear from all its current peers before choosing a node to perform the fencing. Fencing resources do work the same as regular resources in some respects: * +target-role+ can be used to enable or disable the resource * Location constraints can be used to prevent a specific node from using the resource [IMPORTANT] =========== Currently there is a limitation that fencing resources may only have one set of meta-attributes and one set of instance attributes. This can be revisited if it becomes a significant limitation for people. =========== .Properties of Fencing Resources [width="95%",cols="5m,2,3,10 ---- ==== Based on that, we would create a STONITH resource fragment that might look like this: .An IPMI-based STONITH Resource ==== [source,XML] ---- ---- ==== Finally, we need to enable STONITH: ---- # crm_attribute -t crm_config -n stonith-enabled -v true ---- == Advanced STONITH Configurations == Some people consider that having one fencing device is a single point of failure footnote:[Not true, since a node or resource must fail before fencing even has a chance to]; others prefer removing the node from the storage and network instead of turning it off. Whatever the reason, Pacemaker supports fencing nodes with multiple devices through a feature called 'fencing topologies'. Simply create the individual devices as you normally would, then define one or more +fencing-level+ entries in the +fencing-topology+ section of the configuration. * Each fencing level is attempted in order of ascending +index+. * If a device fails, processing terminates for the current level. No further devices in that level are exercised, and the next level is attempted instead. * If the operation succeeds for all the listed devices in a level, the level is deemed to have passed. * The operation is finished when a level has passed (success), or all levels have been attempted (failed). * If the operation failed, the next step is determined by the Policy Engine and/or `crmd`. Some possible uses of topologies include: * Try poison-pill and fail back to power * Try disk and network, and fall back to power if either fails * Initiate a kdump and then poweroff the node .Properties of Fencing Levels [width="95%",cols="1m,6<",options="header",align="center"] |========================================================= |Field |Description |id |A unique name for the level indexterm:[id,fencing-level] indexterm:[Fencing,fencing-level,id] |target |The node to which this level applies indexterm:[target,fencing-level] indexterm:[Fencing,fencing-level,target] |index |The order in which to attempt the levels. Levels are attempted in ascending order 'until one succeeds'. indexterm:[index,fencing-level] indexterm:[Fencing,fencing-level,index] |devices |A comma-separated list of devices that must all be tried for this level indexterm:[devices,fencing-level] indexterm:[Fencing,fencing-level,devices] |========================================================= .Fencing topology with different devices for different nodes ==== [source,XML] ---- ... ... ---- ==== === Example Dual-Layer, Dual-Device Fencing Topologies === The following example illustrates an advanced use of +fencing-topology+ in a cluster with the following properties: * 3 nodes (2 active prod-mysql nodes, 1 prod_mysql-rep in standby for quorum purposes) * the active nodes have an IPMI-controlled power board reached at 192.0.2.1 and 192.0.2.2 * the active nodes also have two independent PSUs (Power Supply Units) connected to two independent PDUs (Power Distribution Units) reached at 198.51.100.1 (port 10 and port 11) and 203.0.113.1 (port 10 and port 11) * the first fencing method uses the `fence_ipmi` agent * the second fencing method uses the `fence_apc_snmp` agent targetting 2 fencing devices (one per PSU, either port 10 or 11) * fencing is only implemented for the active nodes and has location constraints * fencing topology is set to try IPMI fencing first then default to a "sure-kill" dual PDU fencing In a normal failure scenario, STONITH will first select +fence_ipmi+ to try to kill the faulty node. Using a fencing topology, if that first method fails, STONITH will then move on to selecting +fence_apc_snmp+ twice: * once for the first PDU * again for the second PDU The fence action is considered successful only if both PDUs report the required status. If any of them fails, STONITH loops back to the first fencing method, +fence_ipmi+, and so on until the node is fenced or fencing action is cancelled. .First fencing method: single IPMI device Each cluster node has it own dedicated IPMI channel that can be called for fencing using the following primitives: [source,XML] ---- ---- .Second fencing method: dual PDU devices Each cluster node also has two distinct power channels controlled by two distinct PDUs. That means a total of 4 fencing devices configured as follows: - Node 1, PDU 1, PSU 1 @ port 10 - Node 1, PDU 2, PSU 2 @ port 10 - Node 2, PDU 1, PSU 1 @ port 11 - Node 2, PDU 2, PSU 2 @ port 11 The matching fencing agents are configured as follows: [source,XML] ---- ---- .Location Constraints To prevent STONITH from trying to run a fencing agent on the same node it is supposed to fence, constraints are placed on all the fencing primitives: [source,XML] ---- ---- .Fencing topology Now that all the fencing resources are defined, it's time to create the right topology. We want to first fence using IPMI and if that does not work, fence both PDUs to effectively and surely kill the node. [source,XML] ---- ---- Please note, in +fencing-topology+, the lowest +index+ value determines the priority of the first fencing method. .Final configuration Put together, the configuration looks like this: [source,XML] ---- ... ... ---- + +== Remapping Reboots == + +When the cluster needs to reboot a node, whether because +stonith-action+ is +reboot+ or because +a reboot was manually requested (such as by `stonith_admin --reboot`), it will remap that to +other commands in two cases: + +. If the chosen fencing device does not support the +reboot+ command, the cluster + will ask it to perform +off+ instead. + +. If a fencing topology level with multiple devices must be executed, the cluster + will ask all the devices to perform +off+, then ask the devices to perform +on+. + +To understand the second case, consider the example of a node with redundant +power supplies connected to intelligent power switches. Rebooting one switch +and then the other would have no effect on the node. Turning both switches off, +and then on, actually reboots the node. + +In such a case, the fencing operation will be treated as successful as long as +the +off+ commands succeed, because then it is safe for the cluster to recover +any resources that were on the node. Timeouts and errors in the +on+ phase will +be logged but ignored. + +When a reboot operation is remapped, any action-specific timeout for the +remapped action will be used (for example, +pcmk_off_timeout+ will be used when +executing the +off+ command, not +pcmk_reboot_timeout+). diff --git a/fencing/commands.c b/fencing/commands.c index 10d6976234..0d2d614137 100644 --- a/fencing/commands.c +++ b/fencing/commands.c @@ -1,2402 +1,2480 @@ /* * Copyright (C) 2009 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if SUPPORT_CIBSECRETS # include #endif #include GHashTable *device_list = NULL; GHashTable *topology = NULL; GList *cmd_list = NULL; struct device_search_s { /* target of fence action */ char *host; /* requested fence action */ char *action; /* timeout to use if a device is queried dynamically for possible targets */ int per_device_timeout; /* number of registered fencing devices at time of request */ int replies_needed; /* number of device replies received so far */ int replies_received; /* whether the target is eligible to perform requested action (or off) */ bool allow_suicide; /* private data to pass to search callback function */ void *user_data; /* function to call when all replies have been received */ void (*callback) (GList * devices, void *user_data); /* devices capable of performing requested action (or off if remapping) */ GListPtr capable; }; static gboolean stonith_device_dispatch(gpointer user_data); static void st_child_done(GPid pid, int rc, const char *output, gpointer user_data); static void stonith_send_reply(xmlNode * reply, int call_options, const char *remote_peer, const char *client_id); static void search_devices_record_result(struct device_search_s *search, const char *device, gboolean can_fence); typedef struct async_command_s { int id; int pid; int fd_stdout; int options; int default_timeout; /* seconds */ int timeout; /* seconds */ int start_delay; /* milliseconds */ int delay_id; char *op; char *origin; char *client; char *client_name; char *remote_op_id; char *victim; uint32_t victim_nodeid; char *action; char *device; char *mode; GListPtr device_list; GListPtr device_next; void *internal_user_data; void (*done_cb) (GPid pid, int rc, const char *output, gpointer user_data); guint timer_sigterm; guint timer_sigkill; /*! If the operation timed out, this is the last signal * we sent to the process to get it to terminate */ int last_timeout_signo; } async_command_t; static xmlNode *stonith_construct_async_reply(async_command_t * cmd, const char *output, xmlNode * data, int rc); static gboolean is_action_required(const char *action, stonith_device_t *device) { if(device == NULL) { return FALSE; } else if (device->required_actions == NULL) { return FALSE; } else if (strstr(device->required_actions, action)) { return TRUE; } return FALSE; } static int get_action_delay_max(stonith_device_t * device, const char * action) { const char *value = NULL; int delay_max_ms = 0; if (safe_str_neq(action, "off") && safe_str_neq(action, "reboot")) { return 0; } value = g_hash_table_lookup(device->params, STONITH_ATTR_DELAY_MAX); if (value) { delay_max_ms = crm_get_msec(value); } return delay_max_ms; } /*! * \internal * \brief Override STONITH timeout with pcmk_*_timeout if available * * \param[in] device STONITH device to use * \param[in] action STONITH action name * \param[in] default_timeout Timeout to use if device does not have * a pcmk_*_timeout parameter for action * * \return Value of pcmk_(action)_timeout if available, otherwise default_timeout * \note For consistency, it would be nice if reboot/off/on timeouts could be * set the same way as start/stop/monitor timeouts, i.e. with an * entry in the fencing resource configuration. However that * is insufficient because fencing devices may be registered directly via * the STONITH register_device() API instead of going through the CIB * (e.g. stonith_admin uses it for its -R option, and the LRMD uses it to * ensure a device is registered when a command is issued). As device * properties, pcmk_*_timeout parameters can be grabbed by stonithd when * the device is registered, whether by CIB change or API call. */ static int get_action_timeout(stonith_device_t * device, const char *action, int default_timeout) { if (action && device && device->params) { char buffer[64] = { 0, }; const char *value = NULL; /* If "reboot" was requested but the device does not support it, * we will remap to "off", so check timeout for "off" instead */ if (safe_str_eq(action, "reboot") && is_not_set(device->flags, st_device_supports_reboot)) { crm_trace("%s doesn't support reboot, using timeout for off instead", device->id); action = "off"; } /* If the device config specified an action-specific timeout, use it */ snprintf(buffer, sizeof(buffer) - 1, "pcmk_%s_timeout", action); value = g_hash_table_lookup(device->params, buffer); if (value) { return atoi(value); } } return default_timeout; } static void free_async_command(async_command_t * cmd) { if (!cmd) { return; } if (cmd->delay_id) { g_source_remove(cmd->delay_id); } cmd_list = g_list_remove(cmd_list, cmd); g_list_free_full(cmd->device_list, free); free(cmd->device); free(cmd->action); free(cmd->victim); free(cmd->remote_op_id); free(cmd->client); free(cmd->client_name); free(cmd->origin); free(cmd->mode); free(cmd->op); free(cmd); } static async_command_t * create_async_command(xmlNode * msg) { async_command_t *cmd = NULL; xmlNode *op = get_xpath_object("//@" F_STONITH_ACTION, msg, LOG_ERR); const char *action = crm_element_value(op, F_STONITH_ACTION); CRM_CHECK(action != NULL, crm_log_xml_warn(msg, "NoAction"); return NULL); crm_log_xml_trace(msg, "Command"); cmd = calloc(1, sizeof(async_command_t)); crm_element_value_int(msg, F_STONITH_CALLID, &(cmd->id)); crm_element_value_int(msg, F_STONITH_CALLOPTS, &(cmd->options)); crm_element_value_int(msg, F_STONITH_TIMEOUT, &(cmd->default_timeout)); cmd->timeout = cmd->default_timeout; cmd->origin = crm_element_value_copy(msg, F_ORIG); cmd->remote_op_id = crm_element_value_copy(msg, F_STONITH_REMOTE_OP_ID); cmd->client = crm_element_value_copy(msg, F_STONITH_CLIENTID); cmd->client_name = crm_element_value_copy(msg, F_STONITH_CLIENTNAME); cmd->op = crm_element_value_copy(msg, F_STONITH_OPERATION); cmd->action = strdup(action); cmd->victim = crm_element_value_copy(op, F_STONITH_TARGET); cmd->mode = crm_element_value_copy(op, F_STONITH_MODE); cmd->device = crm_element_value_copy(op, F_STONITH_DEVICE); CRM_CHECK(cmd->op != NULL, crm_log_xml_warn(msg, "NoOp"); free_async_command(cmd); return NULL); CRM_CHECK(cmd->client != NULL, crm_log_xml_warn(msg, "NoClient")); cmd->done_cb = st_child_done; cmd_list = g_list_append(cmd_list, cmd); return cmd; } static gboolean stonith_device_execute(stonith_device_t * device) { int exec_rc = 0; const char *action_str = NULL; async_command_t *cmd = NULL; stonith_action_t *action = NULL; CRM_CHECK(device != NULL, return FALSE); if (device->active_pid) { crm_trace("%s is still active with pid %u", device->id, device->active_pid); return TRUE; } if (device->pending_ops) { GList *first = device->pending_ops; cmd = first->data; if (cmd && cmd->delay_id) { crm_trace ("Operation %s%s%s on %s was asked to run too early, waiting for start_delay timeout of %dms", cmd->action, cmd->victim ? " for node " : "", cmd->victim ? cmd->victim : "", device->id, cmd->start_delay); return TRUE; } device->pending_ops = g_list_remove_link(device->pending_ops, first); g_list_free_1(first); } if (cmd == NULL) { crm_trace("Nothing further to do for %s", device->id); return TRUE; } if(safe_str_eq(device->agent, STONITH_WATCHDOG_AGENT)) { if(safe_str_eq(cmd->action, "reboot")) { pcmk_panic(__FUNCTION__); return TRUE; } else if(safe_str_eq(cmd->action, "off")) { pcmk_panic(__FUNCTION__); return TRUE; } else { crm_info("Faking success for %s watchdog operation", cmd->action); cmd->done_cb(0, 0, NULL, cmd); return TRUE; } } #if SUPPORT_CIBSECRETS if (replace_secret_params(device->id, device->params) < 0) { /* replacing secrets failed! */ if (safe_str_eq(cmd->action,"stop")) { /* don't fail on stop! */ crm_info("proceeding with the stop operation for %s", device->id); } else { crm_err("failed to get secrets for %s, " "considering resource not configured", device->id); exec_rc = PCMK_OCF_NOT_CONFIGURED; cmd->done_cb(0, exec_rc, NULL, cmd); return TRUE; } } #endif action_str = cmd->action; if (safe_str_eq(cmd->action, "reboot") && is_not_set(device->flags, st_device_supports_reboot)) { crm_warn("Agent '%s' does not advertise support for 'reboot', performing 'off' action instead", device->agent); action_str = "off"; } action = stonith_action_create(device->agent, action_str, cmd->victim, cmd->victim_nodeid, cmd->timeout, device->params, device->aliases); /* for async exec, exec_rc is pid if positive and error code if negative/zero */ exec_rc = stonith_action_execute_async(action, (void *)cmd, cmd->done_cb); if (exec_rc > 0) { crm_debug("Operation %s%s%s on %s now running with pid=%d, timeout=%ds", cmd->action, cmd->victim ? " for node " : "", cmd->victim ? cmd->victim : "", device->id, exec_rc, cmd->timeout); device->active_pid = exec_rc; } else { crm_warn("Operation %s%s%s on %s failed: %s (%d)", cmd->action, cmd->victim ? " for node " : "", cmd->victim ? cmd->victim : "", device->id, pcmk_strerror(exec_rc), exec_rc); cmd->done_cb(0, exec_rc, NULL, cmd); } return TRUE; } static gboolean stonith_device_dispatch(gpointer user_data) { return stonith_device_execute(user_data); } static gboolean start_delay_helper(gpointer data) { async_command_t *cmd = data; stonith_device_t *device = NULL; cmd->delay_id = 0; device = cmd->device ? g_hash_table_lookup(device_list, cmd->device) : NULL; if (device) { mainloop_set_trigger(device->work); } return FALSE; } static void schedule_stonith_command(async_command_t * cmd, stonith_device_t * device) { int delay_max = 0; CRM_CHECK(cmd != NULL, return); CRM_CHECK(device != NULL, return); if (cmd->device) { free(cmd->device); } if (device->include_nodeid && cmd->victim) { crm_node_t *node = crm_get_peer(0, cmd->victim); cmd->victim_nodeid = node->id; } cmd->device = strdup(device->id); cmd->timeout = get_action_timeout(device, cmd->action, cmd->default_timeout); if (cmd->remote_op_id) { crm_debug("Scheduling %s on %s for remote peer %s with op id (%s) (timeout=%ds)", cmd->action, device->id, cmd->origin, cmd->remote_op_id, cmd->timeout); } else { crm_debug("Scheduling %s on %s for %s (timeout=%ds)", cmd->action, device->id, cmd->client, cmd->timeout); } device->pending_ops = g_list_append(device->pending_ops, cmd); mainloop_set_trigger(device->work); delay_max = get_action_delay_max(device, cmd->action); if (delay_max > 0) { cmd->start_delay = rand() % delay_max; crm_notice("Delaying %s on %s for %lldms (timeout=%ds)", cmd->action, device->id, cmd->start_delay, cmd->timeout); cmd->delay_id = g_timeout_add(cmd->start_delay, start_delay_helper, cmd); } } void free_device(gpointer data) { GListPtr gIter = NULL; stonith_device_t *device = data; g_hash_table_destroy(device->params); g_hash_table_destroy(device->aliases); for (gIter = device->pending_ops; gIter != NULL; gIter = gIter->next) { async_command_t *cmd = gIter->data; crm_warn("Removal of device '%s' purged operation %s", device->id, cmd->action); cmd->done_cb(0, -ENODEV, NULL, cmd); free_async_command(cmd); } g_list_free(device->pending_ops); g_list_free_full(device->targets, free); mainloop_destroy_trigger(device->work); free_xml(device->agent_metadata); free(device->namespace); free(device->on_target_actions); free(device->required_actions); free(device->agent); free(device->id); free(device); } static GHashTable * build_port_aliases(const char *hostmap, GListPtr * targets) { char *name = NULL; int last = 0, lpc = 0, max = 0, added = 0; GHashTable *aliases = g_hash_table_new_full(crm_strcase_hash, crm_strcase_equal, g_hash_destroy_str, g_hash_destroy_str); if (hostmap == NULL) { return aliases; } max = strlen(hostmap); for (; lpc <= max; lpc++) { switch (hostmap[lpc]) { /* Assignment chars */ case '=': case ':': if (lpc > last) { free(name); name = calloc(1, 1 + lpc - last); memcpy(name, hostmap + last, lpc - last); } last = lpc + 1; break; /* Delimeter chars */ /* case ',': Potentially used to specify multiple ports */ case 0: case ';': case ' ': case '\t': if (name) { char *value = NULL; value = calloc(1, 1 + lpc - last); memcpy(value, hostmap + last, lpc - last); crm_debug("Adding alias '%s'='%s'", name, value); g_hash_table_replace(aliases, name, value); if (targets) { *targets = g_list_append(*targets, strdup(value)); } value = NULL; name = NULL; added++; } else if (lpc > last) { crm_debug("Parse error at offset %d near '%s'", lpc - last, hostmap + last); } last = lpc + 1; break; } if (hostmap[lpc] == 0) { break; } } if (added == 0) { crm_info("No host mappings detected in '%s'", hostmap); } free(name); return aliases; } static void parse_host_line(const char *line, int max, GListPtr * output) { int lpc = 0; int last = 0; if (max <= 0) { return; } /* Check for any complaints about additional parameters that the device doesn't understand */ if (strstr(line, "invalid") || strstr(line, "variable")) { crm_debug("Skipping: %s", line); return; } crm_trace("Processing %d bytes: [%s]", max, line); /* Skip initial whitespace */ for (lpc = 0; lpc <= max && isspace(line[lpc]); lpc++) { last = lpc + 1; } /* Now the actual content */ for (lpc = 0; lpc <= max; lpc++) { gboolean a_space = isspace(line[lpc]); if (a_space && lpc < max && isspace(line[lpc + 1])) { /* fast-forward to the end of the spaces */ } else if (a_space || line[lpc] == ',' || line[lpc] == ';' || line[lpc] == 0) { int rc = 1; char *entry = NULL; if (lpc != last) { entry = calloc(1, 1 + lpc - last); rc = sscanf(line + last, "%[a-zA-Z0-9_-.]", entry); } if (entry == NULL) { /* Skip */ } else if (rc != 1) { crm_warn("Could not parse (%d %d): %s", last, lpc, line + last); } else if (safe_str_neq(entry, "on") && safe_str_neq(entry, "off")) { crm_trace("Adding '%s'", entry); *output = g_list_append(*output, entry); entry = NULL; } free(entry); last = lpc + 1; } } } static GListPtr parse_host_list(const char *hosts) { int lpc = 0; int max = 0; int last = 0; GListPtr output = NULL; if (hosts == NULL) { return output; } max = strlen(hosts); for (lpc = 0; lpc <= max; lpc++) { if (hosts[lpc] == '\n' || hosts[lpc] == 0) { char *line = NULL; int len = lpc - last; if(len > 1) { line = malloc(1 + len); } if(line) { snprintf(line, 1 + len, "%s", hosts + last); line[len] = 0; /* Because it might be '\n' */ parse_host_line(line, len, &output); free(line); } last = lpc + 1; } } crm_trace("Parsed %d entries from '%s'", g_list_length(output), hosts); return output; } GHashTable *metadata_cache = NULL; static xmlNode * get_agent_metadata(const char *agent) { xmlNode *xml = NULL; char *buffer = NULL; if(metadata_cache == NULL) { metadata_cache = g_hash_table_new_full( crm_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); } buffer = g_hash_table_lookup(metadata_cache, agent); if(safe_str_eq(agent, STONITH_WATCHDOG_AGENT)) { return NULL; } else if(buffer == NULL) { stonith_t *st = stonith_api_new(); int rc = st->cmds->metadata(st, st_opt_sync_call, agent, NULL, &buffer, 10); stonith_api_delete(st); if (rc || !buffer) { crm_err("Could not retrieve metadata for fencing agent %s", agent); return NULL; } g_hash_table_replace(metadata_cache, strdup(agent), buffer); } xml = string2xml(buffer); return xml; } static gboolean is_nodeid_required(xmlNode * xml) { xmlXPathObjectPtr xpath = NULL; if (stand_alone) { return FALSE; } if (!xml) { return FALSE; } xpath = xpath_search(xml, "//parameter[@name='nodeid']"); if (numXpathResults(xpath) <= 0) { freeXpathObject(xpath); return FALSE; } freeXpathObject(xpath); return TRUE; } static char * add_action(char *actions, const char *action) { static size_t len = 256; int offset = 0; if (actions == NULL) { actions = calloc(1, len); } else { offset = strlen(actions); } if (offset > 0) { offset += snprintf(actions+offset, len-offset, " "); } offset += snprintf(actions+offset, len-offset, "%s", action); return actions; } static void read_action_metadata(stonith_device_t *device) { xmlXPathObjectPtr xpath = NULL; int max = 0; int lpc = 0; if (device->agent_metadata == NULL) { return; } xpath = xpath_search(device->agent_metadata, "//action"); max = numXpathResults(xpath); if (max <= 0) { freeXpathObject(xpath); return; } for (lpc = 0; lpc < max; lpc++) { const char *on_target = NULL; const char *action = NULL; const char *automatic = NULL; const char *required = NULL; xmlNode *match = getXpathResult(xpath, lpc); CRM_LOG_ASSERT(match != NULL); if(match == NULL) { continue; }; on_target = crm_element_value(match, "on_target"); action = crm_element_value(match, "name"); automatic = crm_element_value(match, "automatic"); required = crm_element_value(match, "required"); if(safe_str_eq(action, "list")) { set_bit(device->flags, st_device_supports_list); } else if(safe_str_eq(action, "status")) { set_bit(device->flags, st_device_supports_status); } else if(safe_str_eq(action, "reboot")) { set_bit(device->flags, st_device_supports_reboot); } else if(safe_str_eq(action, "on") && (crm_is_true(automatic))) { /* this setting implies required=true for unfencing */ required = "true"; } if (action && crm_is_true(on_target)) { device->on_target_actions = add_action(device->on_target_actions, action); } if (action && crm_is_true(required)) { device->required_actions = add_action(device->required_actions, action); } } freeXpathObject(xpath); } static stonith_device_t * build_device_from_xml(xmlNode * msg) { const char *value = NULL; xmlNode *dev = get_xpath_object("//" F_STONITH_DEVICE, msg, LOG_ERR); stonith_device_t *device = NULL; device = calloc(1, sizeof(stonith_device_t)); device->id = crm_element_value_copy(dev, XML_ATTR_ID); device->agent = crm_element_value_copy(dev, "agent"); device->namespace = crm_element_value_copy(dev, "namespace"); device->params = xml2list(dev); value = g_hash_table_lookup(device->params, STONITH_ATTR_HOSTLIST); if (value) { device->targets = parse_host_list(value); } value = g_hash_table_lookup(device->params, STONITH_ATTR_HOSTMAP); device->aliases = build_port_aliases(value, &(device->targets)); device->agent_metadata = get_agent_metadata(device->agent); read_action_metadata(device); value = g_hash_table_lookup(device->params, "nodeid"); if (!value) { device->include_nodeid = is_nodeid_required(device->agent_metadata); } value = crm_element_value(dev, "rsc_provides"); if (safe_str_eq(value, "unfencing")) { /* if this agent requires unfencing, 'on' is considered a required action */ device->required_actions = add_action(device->required_actions, "on"); } if (is_action_required("on", device)) { crm_info("The fencing device '%s' requires unfencing", device->id); } if (device->on_target_actions) { crm_info("The fencing device '%s' requires actions (%s) to be executed on the target node", device->id, device->on_target_actions); } device->work = mainloop_add_trigger(G_PRIORITY_HIGH, stonith_device_dispatch, device); /* TODO: Hook up priority */ return device; } static const char * target_list_type(stonith_device_t * dev) { const char *check_type = NULL; check_type = g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTCHECK); if (check_type == NULL) { if (g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTLIST)) { check_type = "static-list"; } else if (g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTMAP)) { check_type = "static-list"; } else if(is_set(dev->flags, st_device_supports_list)){ check_type = "dynamic-list"; } else if(is_set(dev->flags, st_device_supports_status)){ check_type = "status"; } else { check_type = "none"; } } return check_type; } void schedule_internal_command(const char *origin, stonith_device_t * device, const char *action, const char *victim, int timeout, void *internal_user_data, void (*done_cb) (GPid pid, int rc, const char *output, gpointer user_data)) { async_command_t *cmd = NULL; cmd = calloc(1, sizeof(async_command_t)); cmd->id = -1; cmd->default_timeout = timeout ? timeout : 60; cmd->timeout = cmd->default_timeout; cmd->action = strdup(action); cmd->victim = victim ? strdup(victim) : NULL; cmd->device = strdup(device->id); cmd->origin = strdup(origin); cmd->client = strdup(crm_system_name); cmd->client_name = strdup(crm_system_name); cmd->internal_user_data = internal_user_data; cmd->done_cb = done_cb; /* cmd, not internal_user_data, is passed to 'done_cb' as the userdata */ schedule_stonith_command(cmd, device); } gboolean string_in_list(GListPtr list, const char *item) { int lpc = 0; int max = g_list_length(list); for (lpc = 0; lpc < max; lpc++) { const char *value = g_list_nth_data(list, lpc); if (safe_str_eq(item, value)) { return TRUE; } else { crm_trace("%d: '%s' != '%s'", lpc, item, value); } } return FALSE; } static void status_search_cb(GPid pid, int rc, const char *output, gpointer user_data) { async_command_t *cmd = user_data; struct device_search_s *search = cmd->internal_user_data; stonith_device_t *dev = cmd->device ? g_hash_table_lookup(device_list, cmd->device) : NULL; gboolean can = FALSE; free_async_command(cmd); if (!dev) { search_devices_record_result(search, NULL, FALSE); return; } dev->active_pid = 0; mainloop_set_trigger(dev->work); if (rc == 1 /* unknown */ ) { crm_trace("Host %s is not known by %s", search->host, dev->id); } else if (rc == 0 /* active */ || rc == 2 /* inactive */ ) { crm_trace("Host %s is known by %s", search->host, dev->id); can = TRUE; } else { crm_notice("Unknown result when testing if %s can fence %s: rc=%d", dev->id, search->host, rc); } search_devices_record_result(search, dev->id, can); } static void dynamic_list_search_cb(GPid pid, int rc, const char *output, gpointer user_data) { async_command_t *cmd = user_data; struct device_search_s *search = cmd->internal_user_data; stonith_device_t *dev = cmd->device ? g_hash_table_lookup(device_list, cmd->device) : NULL; gboolean can_fence = FALSE; free_async_command(cmd); /* Host/alias must be in the list output to be eligible to be fenced * * Will cause problems if down'd nodes aren't listed or (for virtual nodes) * if the guest is still listed despite being moved to another machine */ if (!dev) { search_devices_record_result(search, NULL, FALSE); return; } dev->active_pid = 0; mainloop_set_trigger(dev->work); /* If we successfully got the targets earlier, don't disable. */ if (rc != 0 && !dev->targets) { crm_notice("Disabling port list queries for %s (%d): %s", dev->id, rc, output); /* Fall back to status */ g_hash_table_replace(dev->params, strdup(STONITH_ATTR_HOSTCHECK), strdup("status")); g_list_free_full(dev->targets, free); dev->targets = NULL; } else if (!rc) { crm_info("Refreshing port list for %s", dev->id); g_list_free_full(dev->targets, free); dev->targets = parse_host_list(output); dev->targets_age = time(NULL); } if (dev->targets) { const char *alias = g_hash_table_lookup(dev->aliases, search->host); if (!alias) { alias = search->host; } if (string_in_list(dev->targets, alias)) { can_fence = TRUE; } } search_devices_record_result(search, dev->id, can_fence); } /*! * \internal * \brief Checks to see if an identical device already exists in the device_list */ static stonith_device_t * device_has_duplicate(stonith_device_t * device) { char *key = NULL; char *value = NULL; GHashTableIter gIter; stonith_device_t *dup = g_hash_table_lookup(device_list, device->id); if (!dup) { crm_trace("No match for %s", device->id); return NULL; } else if (safe_str_neq(dup->agent, device->agent)) { crm_trace("Different agent: %s != %s", dup->agent, device->agent); return NULL; } /* Use calculate_operation_digest() here? */ g_hash_table_iter_init(&gIter, device->params); while (g_hash_table_iter_next(&gIter, (void **)&key, (void **)&value)) { if(strstr(key, "CRM_meta") == key) { continue; } else if(strcmp(key, "crm_feature_set") == 0) { continue; } else { char *other_value = g_hash_table_lookup(dup->params, key); if (!other_value || safe_str_neq(other_value, value)) { crm_trace("Different value for %s: %s != %s", key, other_value, value); return NULL; } } } crm_trace("Match"); return dup; } int stonith_device_register(xmlNode * msg, const char **desc, gboolean from_cib) { stonith_device_t *dup = NULL; stonith_device_t *device = build_device_from_xml(msg); dup = device_has_duplicate(device); if (dup) { crm_debug("Device '%s' already existed in device list (%d active devices)", device->id, g_hash_table_size(device_list)); free_device(device); device = dup; } else { stonith_device_t *old = g_hash_table_lookup(device_list, device->id); if (from_cib && old && old->api_registered) { /* If the cib is writing over an entry that is shared with a stonith client, * copy any pending ops that currently exist on the old entry to the new one. * Otherwise the pending ops will be reported as failures */ crm_info("Overwriting an existing entry for %s from the cib", device->id); device->pending_ops = old->pending_ops; device->api_registered = TRUE; old->pending_ops = NULL; if (device->pending_ops) { mainloop_set_trigger(device->work); } } g_hash_table_replace(device_list, device->id, device); crm_notice("Added '%s' to the device list (%d active devices)", device->id, g_hash_table_size(device_list)); } if (desc) { *desc = device->id; } if (from_cib) { device->cib_registered = TRUE; } else { device->api_registered = TRUE; } return pcmk_ok; } int stonith_device_remove(const char *id, gboolean from_cib) { stonith_device_t *device = g_hash_table_lookup(device_list, id); if (!device) { crm_info("Device '%s' not found (%d active devices)", id, g_hash_table_size(device_list)); return pcmk_ok; } if (from_cib) { device->cib_registered = FALSE; } else { device->verified = FALSE; device->api_registered = FALSE; } if (!device->cib_registered && !device->api_registered) { g_hash_table_remove(device_list, id); crm_info("Removed '%s' from the device list (%d active devices)", id, g_hash_table_size(device_list)); } return pcmk_ok; } /*! * \internal * \brief Return the number of stonith levels registered for a node * * \param[in] tp Node's topology table entry * * \return Number of non-NULL levels in topology entry * \note This function is used only for log messages. */ static int count_active_levels(stonith_topology_t * tp) { int lpc = 0; int count = 0; for (lpc = 0; lpc < ST_LEVEL_MAX; lpc++) { if (tp->levels[lpc] != NULL) { count++; } } return count; } void free_topology_entry(gpointer data) { stonith_topology_t *tp = data; int lpc = 0; for (lpc = 0; lpc < ST_LEVEL_MAX; lpc++) { if (tp->levels[lpc] != NULL) { g_list_free_full(tp->levels[lpc], free); } } free(tp->node); free(tp); } /*! * \internal * \brief Register a STONITH level for a node * * Given an XML request specifying the node name, level index, and device IDs * for the level, this will create an entry for the node in the global topology * table if one does not already exist, then append the specified device IDs to * the entry's device list for the specified level. * * \param[in] msg XML request for STONITH level registration * \param[out] desc If not NULL, will be set to string representation ("NODE[LEVEL]") * * \return pcmk_ok on success, -EINVAL if XML does not specify valid level index */ int stonith_level_register(xmlNode * msg, char **desc) { int id = 0; int rc = pcmk_ok; xmlNode *child = NULL; xmlNode *level = get_xpath_object("//" F_STONITH_LEVEL, msg, LOG_ERR); const char *node = crm_element_value(level, F_STONITH_TARGET); stonith_topology_t *tp = g_hash_table_lookup(topology, node); CRM_LOG_ASSERT(node != NULL); crm_element_value_int(level, XML_ATTR_ID, &id); if (desc) { *desc = crm_strdup_printf("%s[%d]", node, id); } if (id <= 0 || id >= ST_LEVEL_MAX) { return -EINVAL; } if (tp == NULL) { tp = calloc(1, sizeof(stonith_topology_t)); tp->node = strdup(node); g_hash_table_replace(topology, tp->node, tp); crm_trace("Added %s to the topology (%d active entries)", node, g_hash_table_size(topology)); } if (tp->levels[id] != NULL) { crm_info("Adding to the existing %s[%d] topology entry (%d active entries)", node, id, count_active_levels(tp)); } for (child = __xml_first_child(level); child != NULL; child = __xml_next(child)) { const char *device = ID(child); crm_trace("Adding device '%s' for %s (%d)", device, node, id); tp->levels[id] = g_list_append(tp->levels[id], strdup(device)); } crm_info("Node %s has %d active fencing levels", node, count_active_levels(tp)); return rc; } int stonith_level_remove(xmlNode * msg, char **desc) { int id = 0; xmlNode *level = get_xpath_object("//" F_STONITH_LEVEL, msg, LOG_ERR); const char *node = crm_element_value(level, F_STONITH_TARGET); stonith_topology_t *tp = g_hash_table_lookup(topology, node); CRM_LOG_ASSERT(node != NULL); if (desc) { *desc = crm_strdup_printf("%s[%d]", node, id); } crm_element_value_int(level, XML_ATTR_ID, &id); if (tp == NULL) { crm_info("Node %s not found (%d active entries)", node, g_hash_table_size(topology)); return pcmk_ok; } else if (id < 0 || id >= ST_LEVEL_MAX) { return -EINVAL; } if (id == 0 && g_hash_table_remove(topology, node)) { crm_info("Removed all %s related entries from the topology (%d active entries)", node, g_hash_table_size(topology)); } else if (id > 0 && tp->levels[id] != NULL) { g_list_free_full(tp->levels[id], free); tp->levels[id] = NULL; crm_info("Removed entry '%d' from %s's topology (%d active entries remaining)", id, node, count_active_levels(tp)); } return pcmk_ok; } static int stonith_device_action(xmlNode * msg, char **output) { int rc = pcmk_ok; xmlNode *dev = get_xpath_object("//" F_STONITH_DEVICE, msg, LOG_ERR); const char *id = crm_element_value(dev, F_STONITH_DEVICE); async_command_t *cmd = NULL; stonith_device_t *device = NULL; if (id) { crm_trace("Looking for '%s'", id); device = g_hash_table_lookup(device_list, id); } if (device && device->api_registered == FALSE) { rc = -ENODEV; } else if (device) { cmd = create_async_command(msg); if (cmd == NULL) { free_device(device); return -EPROTO; } schedule_stonith_command(cmd, device); rc = -EINPROGRESS; } else { crm_info("Device %s not found", id ? id : ""); rc = -ENODEV; } return rc; } static void search_devices_record_result(struct device_search_s *search, const char *device, gboolean can_fence) { search->replies_received++; if (can_fence && device) { search->capable = g_list_append(search->capable, strdup(device)); } if (search->replies_needed == search->replies_received) { crm_debug("Finished Search. %d devices can perform action (%s) on node %s", g_list_length(search->capable), search->action ? search->action : "", search->host ? search->host : ""); search->callback(search->capable, search->user_data); free(search->host); free(search->action); free(search); } } /* * \internal * \brief Check whether the local host is allowed to execute a fencing action * * \param[in] device Fence device to check * \param[in] action Fence action to check * \param[in] target Hostname of fence target * \param[in] allow_suicide Whether self-fencing is allowed for this operation * * \return TRUE if local host is allowed to execute action, FALSE otherwise */ static gboolean localhost_is_eligible(const stonith_device_t *device, const char *action, const char *target, gboolean allow_suicide) { gboolean localhost_is_target = safe_str_eq(target, stonith_our_uname); if (device && action && device->on_target_actions && strstr(device->on_target_actions, action)) { if (!localhost_is_target) { crm_trace("%s operation with %s can only be executed for localhost not %s", action, device->id, target); return FALSE; } } else if (localhost_is_target && !allow_suicide) { crm_trace("%s operation does not support self-fencing", action); return FALSE; } return TRUE; } static void can_fence_host_with_device(stonith_device_t * dev, struct device_search_s *search) { gboolean can = FALSE; const char *check_type = NULL; const char *host = search->host; const char *alias = NULL; CRM_LOG_ASSERT(dev != NULL); if (dev == NULL) { goto search_report_results; } else if (host == NULL) { can = TRUE; goto search_report_results; } - /* Short-circuit the query if the local host is not allowed to perform the - * desired action. - */ - if (!localhost_is_eligible(dev, search->action, host, - search->allow_suicide)) { + /* Short-circuit query if this host is not allowed to perform the action */ + if (safe_str_eq(search->action, "reboot")) { + /* A "reboot" *might* get remapped to "off" then "on", so short-circuit + * only if all three are disallowed. If only one or two are disallowed, + * we'll report that with the results. We never allow suicide for + * remapped "on" operations because the host is off at that point. + */ + if (!localhost_is_eligible(dev, "reboot", host, search->allow_suicide) + && !localhost_is_eligible(dev, "off", host, search->allow_suicide) + && !localhost_is_eligible(dev, "on", host, FALSE)) { + goto search_report_results; + } + } else if (!localhost_is_eligible(dev, search->action, host, + search->allow_suicide)) { goto search_report_results; } alias = g_hash_table_lookup(dev->aliases, host); if (alias == NULL) { alias = host; } check_type = target_list_type(dev); if (safe_str_eq(check_type, "none")) { can = TRUE; } else if (safe_str_eq(check_type, "static-list")) { /* Presence in the hostmap is sufficient * Only use if all hosts on which the device can be active can always fence all listed hosts */ if (string_in_list(dev->targets, host)) { can = TRUE; } else if (g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTMAP) && g_hash_table_lookup(dev->aliases, host)) { can = TRUE; } } else if (safe_str_eq(check_type, "dynamic-list")) { time_t now = time(NULL); if (dev->targets == NULL || dev->targets_age + 60 < now) { crm_trace("Running %s command to see if %s can fence %s (%s)", check_type, dev?dev->id:"N/A", search->host, search->action); schedule_internal_command(__FUNCTION__, dev, "list", NULL, search->per_device_timeout, search, dynamic_list_search_cb); /* we'll respond to this search request async in the cb */ return; } if (string_in_list(dev->targets, alias)) { can = TRUE; } } else if (safe_str_eq(check_type, "status")) { crm_trace("Running %s command to see if %s can fence %s (%s)", check_type, dev?dev->id:"N/A", search->host, search->action); schedule_internal_command(__FUNCTION__, dev, "status", search->host, search->per_device_timeout, search, status_search_cb); /* we'll respond to this search request async in the cb */ return; } else { crm_err("Unknown check type: %s", check_type); } if (safe_str_eq(host, alias)) { crm_notice("%s can%s fence (%s) %s: %s", dev->id, can ? "" : " not", search->action, host, check_type); } else { crm_notice("%s can%s fence (%s) %s (aka. '%s'): %s", dev->id, can ? "" : " not", search->action, host, alias, check_type); } search_report_results: search_devices_record_result(search, dev ? dev->id : NULL, can); } static void search_devices(gpointer key, gpointer value, gpointer user_data) { stonith_device_t *dev = value; struct device_search_s *search = user_data; can_fence_host_with_device(dev, search); } #define DEFAULT_QUERY_TIMEOUT 20 static void get_capable_devices(const char *host, const char *action, int timeout, bool suicide, void *user_data, void (*callback) (GList * devices, void *user_data)) { struct device_search_s *search; int per_device_timeout = DEFAULT_QUERY_TIMEOUT; int devices_needing_async_query = 0; char *key = NULL; const char *check_type = NULL; GHashTableIter gIter; stonith_device_t *device = NULL; if (!g_hash_table_size(device_list)) { callback(NULL, user_data); return; } search = calloc(1, sizeof(struct device_search_s)); if (!search) { callback(NULL, user_data); return; } g_hash_table_iter_init(&gIter, device_list); while (g_hash_table_iter_next(&gIter, (void **)&key, (void **)&device)) { check_type = target_list_type(device); if (safe_str_eq(check_type, "status") || safe_str_eq(check_type, "dynamic-list")) { devices_needing_async_query++; } } /* If we have devices that require an async event in order to know what * nodes they can fence, we have to give the events a timeout. The total * query timeout is divided among those events. */ if (devices_needing_async_query) { per_device_timeout = timeout / devices_needing_async_query; if (!per_device_timeout) { crm_err("STONITH timeout %ds is too low; using %ds, but consider raising to at least %ds", timeout, DEFAULT_QUERY_TIMEOUT, DEFAULT_QUERY_TIMEOUT * devices_needing_async_query); per_device_timeout = DEFAULT_QUERY_TIMEOUT; } else if (per_device_timeout < DEFAULT_QUERY_TIMEOUT) { crm_notice("STONITH timeout %ds is low for the current configuration;" " consider raising to at least %ds", timeout, DEFAULT_QUERY_TIMEOUT * devices_needing_async_query); } } search->host = host ? strdup(host) : NULL; search->action = action ? strdup(action) : NULL; search->per_device_timeout = per_device_timeout; /* We are guaranteed this many replies. Even if a device gets * unregistered some how during the async search, we will get * the correct number of replies. */ search->replies_needed = g_hash_table_size(device_list); search->allow_suicide = suicide; search->callback = callback; search->user_data = user_data; /* kick off the search */ crm_debug("Searching through %d devices to see what is capable of action (%s) for target %s", search->replies_needed, search->action ? search->action : "", search->host ? search->host : ""); g_hash_table_foreach(device_list, search_devices, search); } struct st_query_data { xmlNode *reply; char *remote_peer; char *client_id; char *target; char *action; int call_options; }; /* * \internal * \brief Add action-specific attributes to query reply XML * * \param[in,out] xml XML to add attributes to * \param[in] action Fence action * \param[in] device Fence device */ static void add_action_specific_attributes(xmlNode *xml, const char *action, stonith_device_t *device) { int action_specific_timeout; int delay_max; CRM_CHECK(xml && action && device, return); if (is_action_required(action, device)) { crm_trace("Action %s is required on %s", action, device->id); crm_xml_add_int(xml, F_STONITH_DEVICE_REQUIRED, 1); } action_specific_timeout = get_action_timeout(device, action, 0); if (action_specific_timeout) { crm_trace("Action %s has timeout %dms on %s", action, action_specific_timeout, device->id); crm_xml_add_int(xml, F_STONITH_ACTION_TIMEOUT, action_specific_timeout); } delay_max = get_action_delay_max(device, action); if (delay_max > 0) { crm_trace("Action %s has maximum random delay %dms on %s", action, delay_max, device->id); crm_xml_add_int(xml, F_STONITH_DELAY_MAX, delay_max / 1000); } } +/* + * \internal + * \brief Add "disallowed" attribute to query reply XML if appropriate + * + * \param[in,out] xml XML to add attribute to + * \param[in] action Fence action + * \param[in] device Fence device + * \param[in] target Fence target + * \param[in] allow_suicide Whether self-fencing is allowed + */ +static void +add_disallowed(xmlNode *xml, const char *action, stonith_device_t *device, + const char *target, gboolean allow_suicide) +{ + if (!localhost_is_eligible(device, action, target, allow_suicide)) { + crm_trace("Action %s on %s is disallowed for local host", + action, device->id); + crm_xml_add(xml, F_STONITH_ACTION_DISALLOWED, XML_BOOLEAN_TRUE); + } +} + +/* + * \internal + * \brief Add child element with action-specific values to query reply XML + * + * \param[in,out] xml XML to add attribute to + * \param[in] action Fence action + * \param[in] device Fence device + * \param[in] target Fence target + * \param[in] allow_suicide Whether self-fencing is allowed + */ +static void +add_action_reply(xmlNode *xml, const char *action, stonith_device_t *device, + const char *target, gboolean allow_suicide) +{ + xmlNode *child = create_xml_node(xml, F_STONITH_ACTION); + + crm_xml_add(child, XML_ATTR_ID, action); + add_action_specific_attributes(child, action, device); + add_disallowed(child, action, device, target, allow_suicide); +} + static void stonith_query_capable_device_cb(GList * devices, void *user_data) { struct st_query_data *query = user_data; int available_devices = 0; xmlNode *dev = NULL; xmlNode *list = NULL; GListPtr lpc = NULL; /* Pack the results into XML */ list = create_xml_node(NULL, __FUNCTION__); crm_xml_add(list, F_STONITH_TARGET, query->target); for (lpc = devices; lpc != NULL; lpc = lpc->next) { stonith_device_t *device = g_hash_table_lookup(device_list, lpc->data); const char *action = query->action; if (!device) { /* It is possible the device got unregistered while * determining who can fence the target */ continue; } available_devices++; dev = create_xml_node(list, F_STONITH_DEVICE); crm_xml_add(dev, XML_ATTR_ID, device->id); crm_xml_add(dev, "namespace", device->namespace); crm_xml_add(dev, "agent", device->agent); crm_xml_add_int(dev, F_STONITH_DEVICE_VERIFIED, device->verified); /* If the originating stonithd wants to reboot the node, and we have a * capable device that doesn't support "reboot", remap to "off" instead. */ if (is_not_set(device->flags, st_device_supports_reboot) && safe_str_eq(query->action, "reboot")) { crm_trace("%s doesn't support reboot, using values for off instead", device->id); action = "off"; } /* Add action-specific values if available */ add_action_specific_attributes(dev, action, device); + if (safe_str_eq(query->action, "reboot")) { + /* A "reboot" *might* get remapped to "off" then "on", so after + * sending the "reboot"-specific values in the main element, we add + * sub-elements for "off" and "on" values. + * + * We short-circuited earlier if "reboot", "off" and "on" are all + * disallowed for the local host. However if only one or two are + * disallowed, we send back the results and mark which ones are + * disallowed. If "reboot" is disallowed, this might cause problems + * with older stonithd versions, which won't check for it. Older + * versions will ignore "off" and "on", so they are not a problem. + */ + add_disallowed(dev, action, device, query->target, + is_set(query->call_options, st_opt_allow_suicide)); + add_action_reply(dev, "off", device, query->target, + is_set(query->call_options, st_opt_allow_suicide)); + add_action_reply(dev, "on", device, query->target, FALSE); + } + /* A query without a target wants device parameters */ if (query->target == NULL) { xmlNode *attrs = create_xml_node(dev, XML_TAG_ATTRS); g_hash_table_foreach(device->params, hash2field, attrs); } } crm_xml_add_int(list, F_STONITH_AVAILABLE_DEVICES, available_devices); if (query->target) { crm_debug("Found %d matching devices for '%s'", available_devices, query->target); } else { crm_debug("%d devices installed", available_devices); } if (list != NULL) { crm_log_xml_trace(list, "Add query results"); add_message_xml(query->reply, F_STONITH_CALLDATA, list); } stonith_send_reply(query->reply, query->call_options, query->remote_peer, query->client_id); free_xml(query->reply); free(query->remote_peer); free(query->client_id); free(query->target); free(query->action); free(query); free_xml(list); g_list_free_full(devices, free); } static void stonith_query(xmlNode * msg, const char *remote_peer, const char *client_id, int call_options) { struct st_query_data *query = NULL; const char *action = NULL; const char *target = NULL; int timeout = 0; xmlNode *dev = get_xpath_object("//@" F_STONITH_ACTION, msg, LOG_DEBUG_3); crm_element_value_int(msg, F_STONITH_TIMEOUT, &timeout); if (dev) { const char *device = crm_element_value(dev, F_STONITH_DEVICE); target = crm_element_value(dev, F_STONITH_TARGET); action = crm_element_value(dev, F_STONITH_ACTION); if (device && safe_str_eq(device, "manual_ack")) { /* No query or reply necessary */ return; } } crm_log_xml_debug(msg, "Query"); query = calloc(1, sizeof(struct st_query_data)); query->reply = stonith_construct_reply(msg, NULL, NULL, pcmk_ok); query->remote_peer = remote_peer ? strdup(remote_peer) : NULL; query->client_id = client_id ? strdup(client_id) : NULL; query->target = target ? strdup(target) : NULL; query->action = action ? strdup(action) : NULL; query->call_options = call_options; get_capable_devices(target, action, timeout, is_set(call_options, st_opt_allow_suicide), query, stonith_query_capable_device_cb); } #define ST_LOG_OUTPUT_MAX 512 static void log_operation(async_command_t * cmd, int rc, int pid, const char *next, const char *output) { if (rc == 0) { next = NULL; } if (cmd->victim != NULL) { do_crm_log(rc == 0 ? LOG_NOTICE : LOG_ERR, "Operation '%s' [%d] (call %d from %s) for host '%s' with device '%s' returned: %d (%s)%s%s", cmd->action, pid, cmd->id, cmd->client_name, cmd->victim, cmd->device, rc, pcmk_strerror(rc), next ? ". Trying: " : "", next ? next : ""); } else { do_crm_log_unlikely(rc == 0 ? LOG_DEBUG : LOG_NOTICE, "Operation '%s' [%d] for device '%s' returned: %d (%s)%s%s", cmd->action, pid, cmd->device, rc, pcmk_strerror(rc), next ? ". Trying: " : "", next ? next : ""); } if (output) { /* Logging the whole string confuses syslog when the string is xml */ char *prefix = crm_strdup_printf("%s:%d", cmd->device, pid); crm_log_output(rc == 0 ? LOG_DEBUG : LOG_WARNING, prefix, output); free(prefix); } } static void stonith_send_async_reply(async_command_t * cmd, const char *output, int rc, GPid pid) { xmlNode *reply = NULL; gboolean bcast = FALSE; reply = stonith_construct_async_reply(cmd, output, NULL, rc); if (safe_str_eq(cmd->action, "metadata")) { /* Too verbose to log */ crm_trace("Metadata query for %s", cmd->device); output = NULL; } else if (crm_str_eq(cmd->action, "monitor", TRUE) || crm_str_eq(cmd->action, "list", TRUE) || crm_str_eq(cmd->action, "status", TRUE)) { crm_trace("Never broadcast %s replies", cmd->action); } else if (!stand_alone && safe_str_eq(cmd->origin, cmd->victim) && safe_str_neq(cmd->action, "on")) { crm_trace("Broadcast %s reply for %s", cmd->action, cmd->victim); crm_xml_add(reply, F_SUBTYPE, "broadcast"); bcast = TRUE; } log_operation(cmd, rc, pid, NULL, output); crm_log_xml_trace(reply, "Reply"); if (bcast) { crm_xml_add(reply, F_STONITH_OPERATION, T_STONITH_NOTIFY); send_cluster_message(NULL, crm_msg_stonith_ng, reply, FALSE); } else if (cmd->origin) { crm_trace("Directed reply to %s", cmd->origin); send_cluster_message(crm_get_peer(0, cmd->origin), crm_msg_stonith_ng, reply, FALSE); } else { crm_trace("Directed local %ssync reply to %s", (cmd->options & st_opt_sync_call) ? "" : "a-", cmd->client_name); do_local_reply(reply, cmd->client, cmd->options & st_opt_sync_call, FALSE); } if (stand_alone) { /* Do notification with a clean data object */ xmlNode *notify_data = create_xml_node(NULL, T_STONITH_NOTIFY_FENCE); crm_xml_add_int(notify_data, F_STONITH_RC, rc); crm_xml_add(notify_data, F_STONITH_TARGET, cmd->victim); crm_xml_add(notify_data, F_STONITH_OPERATION, cmd->op); crm_xml_add(notify_data, F_STONITH_DELEGATE, "localhost"); crm_xml_add(notify_data, F_STONITH_DEVICE, cmd->device); crm_xml_add(notify_data, F_STONITH_REMOTE_OP_ID, cmd->remote_op_id); crm_xml_add(notify_data, F_STONITH_ORIGIN, cmd->client); do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data); } free_xml(reply); } void unfence_cb(GPid pid, int rc, const char *output, gpointer user_data) { async_command_t * cmd = user_data; stonith_device_t *dev = g_hash_table_lookup(device_list, cmd->device); log_operation(cmd, rc, pid, NULL, output); if(dev) { dev->active_pid = 0; mainloop_set_trigger(dev->work); } else { crm_trace("Device %s does not exist", cmd->device); } if(rc != 0) { crm_exit(DAEMON_RESPAWN_STOP); } } static void cancel_stonith_command(async_command_t * cmd) { stonith_device_t *device; CRM_CHECK(cmd != NULL, return); if (!cmd->device) { return; } device = g_hash_table_lookup(device_list, cmd->device); if (device) { crm_trace("Cancel scheduled %s on %s", cmd->action, device->id); device->pending_ops = g_list_remove(device->pending_ops, cmd); } } static void st_child_done(GPid pid, int rc, const char *output, gpointer user_data) { stonith_device_t *device = NULL; stonith_device_t *next_device = NULL; async_command_t *cmd = user_data; GListPtr gIter = NULL; GListPtr gIterNext = NULL; CRM_CHECK(cmd != NULL, return); /* The device is ready to do something else now */ device = g_hash_table_lookup(device_list, cmd->device); if (device) { device->active_pid = 0; if (rc == pcmk_ok && (safe_str_eq(cmd->action, "list") || safe_str_eq(cmd->action, "monitor") || safe_str_eq(cmd->action, "status"))) { device->verified = TRUE; } mainloop_set_trigger(device->work); } crm_debug("Operation '%s' on '%s' completed with rc=%d (%d remaining)", cmd->action, cmd->device, rc, g_list_length(cmd->device_next)); if (rc == 0) { GListPtr iter; /* see if there are any required devices left to execute for this op */ for (iter = cmd->device_next; iter != NULL; iter = iter->next) { next_device = g_hash_table_lookup(device_list, iter->data); if (next_device != NULL && is_action_required(cmd->action, next_device)) { cmd->device_next = iter->next; break; } next_device = NULL; } } else if (rc != 0 && cmd->device_next && (is_action_required(cmd->action, device) == FALSE)) { /* if this device didn't work out, see if there are any others we can try. * if the failed device was 'required', we can't pick another device. */ next_device = g_hash_table_lookup(device_list, cmd->device_next->data); cmd->device_next = cmd->device_next->next; } /* this operation requires more fencing, hooray! */ if (next_device) { log_operation(cmd, rc, pid, cmd->device, output); schedule_stonith_command(cmd, next_device); /* Prevent cmd from being freed */ cmd = NULL; goto done; } if (rc > 0) { /* Try to provide _something_ useful */ if(output == NULL) { rc = -ENODATA; } else if(strstr(output, "imed out")) { rc = -ETIMEDOUT; } else if(strstr(output, "Unrecognised action")) { rc = -EOPNOTSUPP; } else { rc = -pcmk_err_generic; } } stonith_send_async_reply(cmd, output, rc, pid); if (rc != 0) { goto done; } /* Check to see if any operations are scheduled to do the exact * same thing that just completed. If so, rather than * performing the same fencing operation twice, return the result * of this operation for all pending commands it matches. */ for (gIter = cmd_list; gIter != NULL; gIter = gIterNext) { async_command_t *cmd_other = gIter->data; gIterNext = gIter->next; if (cmd == cmd_other) { continue; } /* A pending scheduled command matches the command that just finished if. * 1. The client connections are different. * 2. The node victim is the same. * 3. The fencing action is the same. * 4. The device scheduled to execute the action is the same. */ if (safe_str_eq(cmd->client, cmd_other->client) || safe_str_neq(cmd->victim, cmd_other->victim) || safe_str_neq(cmd->action, cmd_other->action) || safe_str_neq(cmd->device, cmd_other->device)) { continue; } + /* Duplicate merging will do the right thing for either type of remapped + * reboot. If the executing stonithd remapped an unsupported reboot to + * off, then cmd->action will be reboot and will be merged with any + * other reboot requests. If the originating stonithd remapped a + * topology reboot to off then on, we will get here once with + * cmd->action "off" and once with "on", and they will be merged + * separately with similar requests. + */ crm_notice ("Merging stonith action %s for node %s originating from client %s with identical stonith request from client %s", cmd_other->action, cmd_other->victim, cmd_other->client_name, cmd->client_name); cmd_list = g_list_remove_link(cmd_list, gIter); stonith_send_async_reply(cmd_other, output, rc, pid); cancel_stonith_command(cmd_other); free_async_command(cmd_other); g_list_free_1(gIter); } done: free_async_command(cmd); } static gint sort_device_priority(gconstpointer a, gconstpointer b) { const stonith_device_t *dev_a = a; const stonith_device_t *dev_b = b; if (dev_a->priority > dev_b->priority) { return -1; } else if (dev_a->priority < dev_b->priority) { return 1; } return 0; } static void stonith_fence_get_devices_cb(GList * devices, void *user_data) { async_command_t *cmd = user_data; stonith_device_t *device = NULL; crm_info("Found %d matching devices for '%s'", g_list_length(devices), cmd->victim); if (g_list_length(devices) > 0) { /* Order based on priority */ devices = g_list_sort(devices, sort_device_priority); device = g_hash_table_lookup(device_list, devices->data); if (device) { cmd->device_list = devices; cmd->device_next = devices->next; devices = NULL; /* list owned by cmd now */ } } /* we have a device, schedule it for fencing. */ if (device) { schedule_stonith_command(cmd, device); /* in progress */ return; } /* no device found! */ stonith_send_async_reply(cmd, NULL, -ENODEV, 0); free_async_command(cmd); g_list_free_full(devices, free); } static int stonith_fence(xmlNode * msg) { const char *device_id = NULL; stonith_device_t *device = NULL; async_command_t *cmd = create_async_command(msg); xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, msg, LOG_ERR); if (cmd == NULL) { return -EPROTO; } device_id = crm_element_value(dev, F_STONITH_DEVICE); if (device_id) { device = g_hash_table_lookup(device_list, device_id); if (device == NULL) { crm_err("Requested device '%s' is not available", device_id); return -ENODEV; } schedule_stonith_command(cmd, device); } else { const char *host = crm_element_value(dev, F_STONITH_TARGET); if (cmd->options & st_opt_cs_nodeid) { int nodeid = crm_atoi(host, NULL); crm_node_t *node = crm_get_peer(nodeid, NULL); if (node) { host = node->uname; } } /* If we get to here, then self-fencing is implicitly allowed */ get_capable_devices(host, cmd->action, cmd->default_timeout, TRUE, cmd, stonith_fence_get_devices_cb); } return -EINPROGRESS; } xmlNode * stonith_construct_reply(xmlNode * request, const char *output, xmlNode * data, int rc) { int lpc = 0; xmlNode *reply = NULL; const char *name = NULL; const char *value = NULL; const char *names[] = { F_STONITH_OPERATION, F_STONITH_CALLID, F_STONITH_CLIENTID, F_STONITH_CLIENTNAME, F_STONITH_REMOTE_OP_ID, F_STONITH_CALLOPTS }; crm_trace("Creating a basic reply"); reply = create_xml_node(NULL, T_STONITH_REPLY); crm_xml_add(reply, "st_origin", __FUNCTION__); crm_xml_add(reply, F_TYPE, T_STONITH_NG); crm_xml_add(reply, "st_output", output); crm_xml_add_int(reply, F_STONITH_RC, rc); CRM_CHECK(request != NULL, crm_warn("Can't create a sane reply"); return reply); for (lpc = 0; lpc < DIMOF(names); lpc++) { name = names[lpc]; value = crm_element_value(request, name); crm_xml_add(reply, name, value); } if (data != NULL) { crm_trace("Attaching reply output"); add_message_xml(reply, F_STONITH_CALLDATA, data); } return reply; } static xmlNode * stonith_construct_async_reply(async_command_t * cmd, const char *output, xmlNode * data, int rc) { xmlNode *reply = NULL; crm_trace("Creating a basic reply"); reply = create_xml_node(NULL, T_STONITH_REPLY); crm_xml_add(reply, "st_origin", __FUNCTION__); crm_xml_add(reply, F_TYPE, T_STONITH_NG); crm_xml_add(reply, F_STONITH_OPERATION, cmd->op); crm_xml_add(reply, F_STONITH_DEVICE, cmd->device); crm_xml_add(reply, F_STONITH_REMOTE_OP_ID, cmd->remote_op_id); crm_xml_add(reply, F_STONITH_CLIENTID, cmd->client); crm_xml_add(reply, F_STONITH_CLIENTNAME, cmd->client_name); crm_xml_add(reply, F_STONITH_TARGET, cmd->victim); crm_xml_add(reply, F_STONITH_ACTION, cmd->op); crm_xml_add(reply, F_STONITH_ORIGIN, cmd->origin); crm_xml_add_int(reply, F_STONITH_CALLID, cmd->id); crm_xml_add_int(reply, F_STONITH_CALLOPTS, cmd->options); crm_xml_add_int(reply, F_STONITH_RC, rc); crm_xml_add(reply, "st_output", output); if (data != NULL) { crm_info("Attaching reply output"); add_message_xml(reply, F_STONITH_CALLDATA, data); } return reply; } bool fencing_peer_active(crm_node_t *peer) { if (peer == NULL) { return FALSE; } else if (peer->uname == NULL) { return FALSE; } else if (is_set(peer->processes, crm_get_cluster_proc())) { return TRUE; } return FALSE; } /*! * \internal * \brief Determine if we need to use an alternate node to * fence the target. If so return that node's uname * * \retval NULL, no alternate host * \retval uname, uname of alternate host to use */ static const char * check_alternate_host(const char *target) { const char *alternate_host = NULL; if (find_topology_for_host(target) && safe_str_eq(target, stonith_our_uname)) { GHashTableIter gIter; crm_node_t *entry = NULL; g_hash_table_iter_init(&gIter, crm_peer_cache); while (g_hash_table_iter_next(&gIter, NULL, (void **)&entry)) { crm_trace("Checking for %s.%d != %s", entry->uname, entry->id, target); if (fencing_peer_active(entry) && safe_str_neq(entry->uname, target)) { alternate_host = entry->uname; break; } } if (alternate_host == NULL) { crm_err("No alternate host available to handle complex self fencing request"); g_hash_table_iter_init(&gIter, crm_peer_cache); while (g_hash_table_iter_next(&gIter, NULL, (void **)&entry)) { crm_notice("Peer[%d] %s", entry->id, entry->uname); } } } return alternate_host; } static void stonith_send_reply(xmlNode * reply, int call_options, const char *remote_peer, const char *client_id) { if (remote_peer) { send_cluster_message(crm_get_peer(0, remote_peer), crm_msg_stonith_ng, reply, FALSE); } else { do_local_reply(reply, client_id, is_set(call_options, st_opt_sync_call), remote_peer != NULL); } } static int handle_request(crm_client_t * client, uint32_t id, uint32_t flags, xmlNode * request, const char *remote_peer) { int call_options = 0; int rc = -EOPNOTSUPP; xmlNode *data = NULL; xmlNode *reply = NULL; char *output = NULL; const char *op = crm_element_value(request, F_STONITH_OPERATION); const char *client_id = crm_element_value(request, F_STONITH_CLIENTID); crm_element_value_int(request, F_STONITH_CALLOPTS, &call_options); if (is_set(call_options, st_opt_sync_call)) { CRM_ASSERT(client == NULL || client->request_id == id); } if (crm_str_eq(op, CRM_OP_REGISTER, TRUE)) { xmlNode *reply = create_xml_node(NULL, "reply"); CRM_ASSERT(client); crm_xml_add(reply, F_STONITH_OPERATION, CRM_OP_REGISTER); crm_xml_add(reply, F_STONITH_CLIENTID, client->id); crm_ipcs_send(client, id, reply, flags); client->request_id = 0; free_xml(reply); return 0; } else if (crm_str_eq(op, STONITH_OP_EXEC, TRUE)) { rc = stonith_device_action(request, &output); } else if (crm_str_eq(op, STONITH_OP_TIMEOUT_UPDATE, TRUE)) { const char *call_id = crm_element_value(request, F_STONITH_CALLID); const char *client_id = crm_element_value(request, F_STONITH_CLIENTID); int op_timeout = 0; crm_element_value_int(request, F_STONITH_TIMEOUT, &op_timeout); do_stonith_async_timeout_update(client_id, call_id, op_timeout); return 0; } else if (crm_str_eq(op, STONITH_OP_QUERY, TRUE)) { if (remote_peer) { create_remote_stonith_op(client_id, request, TRUE); /* Record it for the future notification */ } stonith_query(request, remote_peer, client_id, call_options); return 0; } else if (crm_str_eq(op, T_STONITH_NOTIFY, TRUE)) { const char *flag_name = NULL; CRM_ASSERT(client); flag_name = crm_element_value(request, F_STONITH_NOTIFY_ACTIVATE); if (flag_name) { crm_debug("Setting %s callbacks for %s (%s): ON", flag_name, client->name, client->id); client->options |= get_stonith_flag(flag_name); } flag_name = crm_element_value(request, F_STONITH_NOTIFY_DEACTIVATE); if (flag_name) { crm_debug("Setting %s callbacks for %s (%s): off", flag_name, client->name, client->id); client->options |= get_stonith_flag(flag_name); } if (flags & crm_ipc_client_response) { crm_ipcs_send_ack(client, id, flags, "ack", __FUNCTION__, __LINE__); } return 0; } else if (crm_str_eq(op, STONITH_OP_RELAY, TRUE)) { xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, request, LOG_TRACE); crm_notice("Peer %s has received a forwarded fencing request from %s to fence (%s) peer %s", stonith_our_uname, client ? client->name : remote_peer, crm_element_value(dev, F_STONITH_ACTION), crm_element_value(dev, F_STONITH_TARGET)); if (initiate_remote_stonith_op(NULL, request, FALSE) != NULL) { rc = -EINPROGRESS; } } else if (crm_str_eq(op, STONITH_OP_FENCE, TRUE)) { if (remote_peer || stand_alone) { rc = stonith_fence(request); } else if (call_options & st_opt_manual_ack) { remote_fencing_op_t *rop = NULL; xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, request, LOG_TRACE); const char *target = crm_element_value(dev, F_STONITH_TARGET); crm_notice("Received manual confirmation that %s is fenced", target); rop = initiate_remote_stonith_op(client, request, TRUE); rc = stonith_manual_ack(request, rop); } else { const char *alternate_host = NULL; xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, request, LOG_TRACE); const char *target = crm_element_value(dev, F_STONITH_TARGET); const char *action = crm_element_value(dev, F_STONITH_ACTION); const char *device = crm_element_value(dev, F_STONITH_DEVICE); if (client) { int tolerance = 0; crm_notice("Client %s.%.8s wants to fence (%s) '%s' with device '%s'", client->name, client->id, action, target, device ? device : "(any)"); crm_element_value_int(dev, F_STONITH_TOLERANCE, &tolerance); if (stonith_check_fence_tolerance(tolerance, target, action)) { rc = 0; goto done; } } else { crm_notice("Peer %s wants to fence (%s) '%s' with device '%s'", remote_peer, action, target, device ? device : "(any)"); } alternate_host = check_alternate_host(target); if (alternate_host && client) { const char *client_id = NULL; crm_notice("Forwarding complex self fencing request to peer %s", alternate_host); if (client) { client_id = client->id; } else { client_id = crm_element_value(request, F_STONITH_CLIENTID); } /* Create a record of it, otherwise call_id will be 0 if we need to notify of failures */ create_remote_stonith_op(client_id, request, FALSE); crm_xml_add(request, F_STONITH_OPERATION, STONITH_OP_RELAY); crm_xml_add(request, F_STONITH_CLIENTID, client->id); send_cluster_message(crm_get_peer(0, alternate_host), crm_msg_stonith_ng, request, FALSE); rc = -EINPROGRESS; } else if (initiate_remote_stonith_op(client, request, FALSE) != NULL) { rc = -EINPROGRESS; } } } else if (crm_str_eq(op, STONITH_OP_FENCE_HISTORY, TRUE)) { rc = stonith_fence_history(request, &data); } else if (crm_str_eq(op, STONITH_OP_DEVICE_ADD, TRUE)) { const char *id = NULL; xmlNode *notify_data = create_xml_node(NULL, op); rc = stonith_device_register(request, &id, FALSE); crm_xml_add(notify_data, F_STONITH_DEVICE, id); crm_xml_add_int(notify_data, F_STONITH_ACTIVE, g_hash_table_size(device_list)); do_stonith_notify(call_options, op, rc, notify_data); free_xml(notify_data); } else if (crm_str_eq(op, STONITH_OP_DEVICE_DEL, TRUE)) { xmlNode *dev = get_xpath_object("//" F_STONITH_DEVICE, request, LOG_ERR); const char *id = crm_element_value(dev, XML_ATTR_ID); xmlNode *notify_data = create_xml_node(NULL, op); rc = stonith_device_remove(id, FALSE); crm_xml_add(notify_data, F_STONITH_DEVICE, id); crm_xml_add_int(notify_data, F_STONITH_ACTIVE, g_hash_table_size(device_list)); do_stonith_notify(call_options, op, rc, notify_data); free_xml(notify_data); } else if (crm_str_eq(op, STONITH_OP_LEVEL_ADD, TRUE)) { char *id = NULL; xmlNode *notify_data = create_xml_node(NULL, op); rc = stonith_level_register(request, &id); crm_xml_add(notify_data, F_STONITH_DEVICE, id); crm_xml_add_int(notify_data, F_STONITH_ACTIVE, g_hash_table_size(topology)); do_stonith_notify(call_options, op, rc, notify_data); free_xml(notify_data); free(id); } else if (crm_str_eq(op, STONITH_OP_LEVEL_DEL, TRUE)) { char *id = NULL; xmlNode *notify_data = create_xml_node(NULL, op); rc = stonith_level_remove(request, &id); crm_xml_add(notify_data, F_STONITH_DEVICE, id); crm_xml_add_int(notify_data, F_STONITH_ACTIVE, g_hash_table_size(topology)); do_stonith_notify(call_options, op, rc, notify_data); free_xml(notify_data); } else if (crm_str_eq(op, STONITH_OP_CONFIRM, TRUE)) { async_command_t *cmd = create_async_command(request); xmlNode *reply = stonith_construct_async_reply(cmd, NULL, NULL, 0); crm_xml_add(reply, F_STONITH_OPERATION, T_STONITH_NOTIFY); crm_notice("Broadcasting manual fencing confirmation for node %s", cmd->victim); send_cluster_message(NULL, crm_msg_stonith_ng, reply, FALSE); free_async_command(cmd); free_xml(reply); } else if(safe_str_eq(op, CRM_OP_RM_NODE_CACHE)) { int id = 0; const char *name = NULL; crm_element_value_int(request, XML_ATTR_ID, &id); name = crm_element_value(request, XML_ATTR_UNAME); reap_crm_member(id, name); return pcmk_ok; } else { crm_err("Unknown %s from %s", op, client ? client->name : remote_peer); crm_log_xml_warn(request, "UnknownOp"); } done: /* Always reply unless the request is in process still. * If in progress, a reply will happen async after the request * processing is finished */ if (rc != -EINPROGRESS) { crm_trace("Reply handling: %p %u %u %d %d %s", client, client?client->request_id:0, id, is_set(call_options, st_opt_sync_call), call_options, crm_element_value(request, F_STONITH_CALLOPTS)); if (is_set(call_options, st_opt_sync_call)) { CRM_ASSERT(client == NULL || client->request_id == id); } reply = stonith_construct_reply(request, output, data, rc); stonith_send_reply(reply, call_options, remote_peer, client_id); } free(output); free_xml(data); free_xml(reply); return rc; } static void handle_reply(crm_client_t * client, xmlNode * request, const char *remote_peer) { const char *op = crm_element_value(request, F_STONITH_OPERATION); if (crm_str_eq(op, STONITH_OP_QUERY, TRUE)) { process_remote_stonith_query(request); } else if (crm_str_eq(op, T_STONITH_NOTIFY, TRUE)) { process_remote_stonith_exec(request); } else if (crm_str_eq(op, STONITH_OP_FENCE, TRUE)) { /* Reply to a complex fencing op */ process_remote_stonith_exec(request); } else { crm_err("Unknown %s reply from %s", op, client ? client->name : remote_peer); crm_log_xml_warn(request, "UnknownOp"); } } void stonith_command(crm_client_t * client, uint32_t id, uint32_t flags, xmlNode * request, const char *remote_peer) { int call_options = 0; int rc = 0; gboolean is_reply = FALSE; /* Copy op for reporting. The original might get freed by handle_reply() * before we use it in crm_debug(): * handle_reply() * |- process_remote_stonith_exec() * |-- remote_op_done() * |--- handle_local_reply_and_notify() * |---- crm_xml_add(...F_STONITH_OPERATION...) * |--- free_xml(op->request) */ char *op = crm_element_value_copy(request, F_STONITH_OPERATION); if (get_xpath_object("//" T_STONITH_REPLY, request, LOG_DEBUG_3)) { is_reply = TRUE; } crm_element_value_int(request, F_STONITH_CALLOPTS, &call_options); crm_debug("Processing %s%s %u from %s (%16x)", op, is_reply ? " reply" : "", id, client ? client->name : remote_peer, call_options); if (is_set(call_options, st_opt_sync_call)) { CRM_ASSERT(client == NULL || client->request_id == id); } if (is_reply) { handle_reply(client, request, remote_peer); } else { rc = handle_request(client, id, flags, request, remote_peer); } crm_debug("Processed %s%s from %s: %s (%d)", op, is_reply ? " reply" : "", client ? client->name : remote_peer, rc > 0 ? "" : pcmk_strerror(rc), rc); free(op); } diff --git a/fencing/regression.py.in b/fencing/regression.py.in index da6d4dbbf3..b4e6f084fd 100644 --- a/fencing/regression.py.in +++ b/fencing/regression.py.in @@ -1,1081 +1,1158 @@ #!/usr/bin/python # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. import os import sys import subprocess import shlex import time def output_from_command(command): - test = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE) - test.wait() + test = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE) + test.wait() - return test.communicate()[0].split("\n") + return test.communicate()[0].split("\n") class Test: - def __init__(self, name, description, verbose = 0, with_cpg = 0): - self.name = name - self.description = description - self.cmds = [] - self.verbose = verbose + def __init__(self, name, description, verbose = 0, with_cpg = 0): + self.name = name + self.description = description + self.cmds = [] + self.verbose = verbose - self.result_txt = "" - self.cmd_tool_output = "" - self.result_exitcode = 0; + self.result_txt = "" + self.cmd_tool_output = "" + self.result_exitcode = 0; - self.stonith_options = "-s" - self.enable_corosync = 0 + self.stonith_options = "-s" + self.enable_corosync = 0 - if with_cpg: - self.stonith_options = "-c" - self.enable_corosync = 1 + if with_cpg: + self.stonith_options = "-c" + self.enable_corosync = 1 - self.stonith_process = None - self.stonith_output = "" - self.stonith_patterns = [] - self.negative_stonith_patterns = [] + self.stonith_process = None + self.stonith_output = "" + self.stonith_patterns = [] + self.negative_stonith_patterns = [] - self.executed = 0 + self.executed = 0 - rsc_classes = output_from_command("crm_resource --list-standards") + rsc_classes = output_from_command("crm_resource --list-standards") - def __new_cmd(self, cmd, args, exitcode, stdout_match = "", no_wait = 0, stdout_negative_match = "", kill=None): - self.cmds.append( - { - "cmd" : cmd, - "kill" : kill, - "args" : args, - "expected_exitcode" : exitcode, - "stdout_match" : stdout_match, - "stdout_negative_match" : stdout_negative_match, - "no_wait" : no_wait, - } - ) + def __new_cmd(self, cmd, args, exitcode, stdout_match = "", no_wait = 0, stdout_negative_match = "", kill=None): + self.cmds.append( + { + "cmd" : cmd, + "kill" : kill, + "args" : args, + "expected_exitcode" : exitcode, + "stdout_match" : stdout_match, + "stdout_negative_match" : stdout_negative_match, + "no_wait" : no_wait, + } + ) - def stop_pacemaker(self): - cmd = shlex.split("killall -9 -q pacemakerd") - test = subprocess.Popen(cmd, stdout=subprocess.PIPE) - test.wait() + def stop_pacemaker(self): + cmd = shlex.split("killall -9 -q pacemakerd") + test = subprocess.Popen(cmd, stdout=subprocess.PIPE) + test.wait() - def start_environment(self): - ### make sure we are in full control here ### - self.stop_pacemaker() + def start_environment(self): + ### make sure we are in full control here ### + self.stop_pacemaker() - cmd = shlex.split("killall -9 -q stonithd") - test = subprocess.Popen(cmd, stdout=subprocess.PIPE) - test.wait() + cmd = shlex.split("killall -9 -q stonithd") + test = subprocess.Popen(cmd, stdout=subprocess.PIPE) + test.wait() - if self.verbose: - self.stonith_options = self.stonith_options + " -V" - print "Starting stonithd with %s" % self.stonith_options + if self.verbose: + self.stonith_options = self.stonith_options + " -V" + print "Starting stonithd with %s" % self.stonith_options - if os.path.exists("/tmp/stonith-regression.log"): - os.remove('/tmp/stonith-regression.log') + if os.path.exists("/tmp/stonith-regression.log"): + os.remove('/tmp/stonith-regression.log') - self.stonith_process = subprocess.Popen( - shlex.split("@CRM_DAEMON_DIR@/stonithd %s -l /tmp/stonith-regression.log" % self.stonith_options)) + self.stonith_process = subprocess.Popen( + shlex.split("@CRM_DAEMON_DIR@/stonithd %s -l /tmp/stonith-regression.log" % self.stonith_options)) - time.sleep(1) - - def clean_environment(self): - if self.stonith_process: - self.stonith_process.terminate() - self.stonith_process.wait() - - self.stonith_output = "" - self.stonith_process = None - - f = open('/tmp/stonith-regression.log', 'r') - for line in f.readlines(): - self.stonith_output = self.stonith_output + line - - if self.verbose: - print "Daemon Output Start" - print self.stonith_output - print "Daemon Output End" - os.remove('/tmp/stonith-regression.log') - - def add_stonith_log_pattern(self, pattern): - self.stonith_patterns.append(pattern) - - def add_stonith_negative_log_pattern(self, pattern): - self.negative_stonith_patterns.append(pattern) - - def add_cmd(self, cmd, args): - self.__new_cmd(cmd, args, 0, "") - - def add_cmd_no_wait(self, cmd, args): - self.__new_cmd(cmd, args, 0, "", 1) - - def add_cmd_check_stdout(self, cmd, args, match, no_match = ""): - self.__new_cmd(cmd, args, 0, match, 0, no_match) - - def add_expected_fail_cmd(self, cmd, args, exitcode = 255): - self.__new_cmd(cmd, args, exitcode, "") - - def get_exitcode(self): - return self.result_exitcode - - def print_result(self, filler): - print "%s%s" % (filler, self.result_txt) - - def run_cmd(self, args): - cmd = shlex.split(args['args']) - cmd.insert(0, args['cmd']) - - if self.verbose: - print "\n\nRunning: "+" ".join(cmd) - test = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - - if args['kill']: - if self.verbose: - print "Also running: "+args['kill'] - subprocess.Popen(shlex.split(args['kill'])) - - if args['no_wait'] == 0: - test.wait() - else: - return 0 - - output_res = test.communicate() - output = output_res[0] + output_res[1] - - if self.verbose: - print output - - if args['stdout_match'] != "" and output.count(args['stdout_match']) == 0: - test.returncode = -2 - print "STDOUT string '%s' was not found in cmd output: %s" % (args['stdout_match'], output) - - if args['stdout_negative_match'] != "" and output.count(args['stdout_negative_match']) != 0: - test.returncode = -2 - print "STDOUT string '%s' was found in cmd output: %s" % (args['stdout_negative_match'], output) - - return test.returncode; - - - def count_negative_matches(self, outline): - count = 0 - for line in self.negative_stonith_patterns: - if outline.count(line): - count = 1 - if self.verbose: - print "This pattern should not have matched = '%s" % (line) - return count - - def match_stonith_patterns(self): - negative_matches = 0 - cur = 0 - pats = self.stonith_patterns - total_patterns = len(self.stonith_patterns) - - if len(self.stonith_patterns) == 0: - return - - for line in self.stonith_output.split("\n"): - negative_matches = negative_matches + self.count_negative_matches(line) - if len(pats) == 0: - continue - cur = -1 - for p in pats: - cur = cur + 1 - if line.count(pats[cur]): - del pats[cur] - break - - if len(pats) > 0 or negative_matches: - if self.verbose: - for p in pats: - print "Pattern Not Matched = '%s'" % p - - self.result_txt = "FAILURE - '%s' failed. %d patterns out of %d not matched. %d negative matches." % (self.name, len(pats), total_patterns, negative_matches) - self.result_exitcode = -1 - - def run(self): - res = 0 - i = 1 - self.start_environment() - - if self.verbose: - print "\n--- START TEST - %s" % self.name - - self.result_txt = "SUCCESS - '%s'" % (self.name) - self.result_exitcode = 0 - for cmd in self.cmds: - res = self.run_cmd(cmd) - if res != cmd['expected_exitcode']: - print "Step %d FAILED - command returned %d, expected %d" % (i, res, cmd['expected_exitcode']) - self.result_txt = "FAILURE - '%s' failed at step %d. Command: %s %s" % (self.name, i, cmd['cmd'], cmd['args']) - self.result_exitcode = -1 - break - else: - if self.verbose: - print "Step %d SUCCESS" % (i) - i = i + 1 - self.clean_environment() - - if self.result_exitcode == 0: - self.match_stonith_patterns() - - print self.result_txt - if self.verbose: - print "--- END TEST - %s\n" % self.name - - self.executed = 1 - return res + time.sleep(1) + + def clean_environment(self): + if self.stonith_process: + self.stonith_process.terminate() + self.stonith_process.wait() + + self.stonith_output = "" + self.stonith_process = None + + f = open('/tmp/stonith-regression.log', 'r') + for line in f.readlines(): + self.stonith_output = self.stonith_output + line + + if self.verbose: + print "Daemon Output Start" + print self.stonith_output + print "Daemon Output End" + os.remove('/tmp/stonith-regression.log') + + def add_stonith_log_pattern(self, pattern): + self.stonith_patterns.append(pattern) + + def add_stonith_negative_log_pattern(self, pattern): + self.negative_stonith_patterns.append(pattern) + + def add_cmd(self, cmd, args): + self.__new_cmd(cmd, args, 0, "") + + def add_cmd_no_wait(self, cmd, args): + self.__new_cmd(cmd, args, 0, "", 1) + + def add_cmd_check_stdout(self, cmd, args, match, no_match = ""): + self.__new_cmd(cmd, args, 0, match, 0, no_match) + + def add_expected_fail_cmd(self, cmd, args, exitcode = 255): + self.__new_cmd(cmd, args, exitcode, "") + + def get_exitcode(self): + return self.result_exitcode + + def print_result(self, filler): + print "%s%s" % (filler, self.result_txt) + + def run_cmd(self, args): + cmd = shlex.split(args['args']) + cmd.insert(0, args['cmd']) + + if self.verbose: + print "\n\nRunning: "+" ".join(cmd) + test = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + + if args['kill']: + if self.verbose: + print "Also running: "+args['kill'] + subprocess.Popen(shlex.split(args['kill'])) + + if args['no_wait'] == 0: + test.wait() + else: + return 0 + + output_res = test.communicate() + output = output_res[0] + output_res[1] + + if self.verbose: + print output + + if args['stdout_match'] != "" and output.count(args['stdout_match']) == 0: + test.returncode = -2 + print "STDOUT string '%s' was not found in cmd output: %s" % (args['stdout_match'], output) + + if args['stdout_negative_match'] != "" and output.count(args['stdout_negative_match']) != 0: + test.returncode = -2 + print "STDOUT string '%s' was found in cmd output: %s" % (args['stdout_negative_match'], output) + + return test.returncode; + + + def count_negative_matches(self, outline): + count = 0 + for line in self.negative_stonith_patterns: + if outline.count(line): + count = 1 + if self.verbose: + print "This pattern should not have matched = '%s" % (line) + return count + + def match_stonith_patterns(self): + negative_matches = 0 + cur = 0 + pats = self.stonith_patterns + total_patterns = len(self.stonith_patterns) + + if len(self.stonith_patterns) == 0: + return + + for line in self.stonith_output.split("\n"): + negative_matches = negative_matches + self.count_negative_matches(line) + if len(pats) == 0: + continue + cur = -1 + for p in pats: + cur = cur + 1 + if line.count(pats[cur]): + del pats[cur] + break + + if len(pats) > 0 or negative_matches: + if self.verbose: + for p in pats: + print "Pattern Not Matched = '%s'" % p + + self.result_txt = "FAILURE - '%s' failed. %d patterns out of %d not matched. %d negative matches." % (self.name, len(pats), total_patterns, negative_matches) + self.result_exitcode = -1 + + def run(self): + res = 0 + i = 1 + self.start_environment() + + if self.verbose: + print "\n--- START TEST - %s" % self.name + + self.result_txt = "SUCCESS - '%s'" % (self.name) + self.result_exitcode = 0 + for cmd in self.cmds: + res = self.run_cmd(cmd) + if res != cmd['expected_exitcode']: + print "Step %d FAILED - command returned %d, expected %d" % (i, res, cmd['expected_exitcode']) + self.result_txt = "FAILURE - '%s' failed at step %d. Command: %s %s" % (self.name, i, cmd['cmd'], cmd['args']) + self.result_exitcode = -1 + break + else: + if self.verbose: + print "Step %d SUCCESS" % (i) + i = i + 1 + self.clean_environment() + + if self.result_exitcode == 0: + self.match_stonith_patterns() + + print self.result_txt + if self.verbose: + print "--- END TEST - %s\n" % self.name + + self.executed = 1 + return res class Tests: - def __init__(self, verbose = 0): - self.tests = [] - self.verbose = verbose - self.autogen_corosync_cfg = 0 - if not os.path.exists("/etc/corosync/corosync.conf"): - self.autogen_corosync_cfg = 1 - - def new_test(self, name, description, with_cpg = 0): - test = Test(name, description, self.verbose, with_cpg) - self.tests.append(test) - return test - - def print_list(self): - print "\n==== %d TESTS FOUND ====" % (len(self.tests)) - print "%35s - %s" % ("TEST NAME", "TEST DESCRIPTION") - print "%35s - %s" % ("--------------------", "--------------------") - for test in self.tests: - print "%35s - %s" % (test.name, test.description) - print "==== END OF LIST ====\n" - - - def start_corosync(self): - if self.verbose: - print "Starting corosync" - - test = subprocess.Popen("corosync", stdout=subprocess.PIPE) - test.wait() - time.sleep(10) - - def stop_corosync(self): - cmd = shlex.split("killall -9 -q corosync") - test = subprocess.Popen(cmd, stdout=subprocess.PIPE) - test.wait() - - def run_single(self, name): - for test in self.tests: - if test.name == name: - test.run() - break; - - def run_tests_matching(self, pattern): - for test in self.tests: - if test.name.count(pattern) != 0: - test.run() - - def run_cpg_only(self): - for test in self.tests: - if test.enable_corosync: - test.run() - - def run_no_cpg(self): - for test in self.tests: - if not test.enable_corosync: - test.run() - - def run_tests(self): - for test in self.tests: - test.run() - - def exit(self): - for test in self.tests: - if test.executed == 0: - continue - - if test.get_exitcode() != 0: - sys.exit(-1) - - sys.exit(0) - - def print_results(self): - failures = 0; - success = 0; - print "\n\n======= FINAL RESULTS ==========" - print "\n--- FAILURE RESULTS:" - for test in self.tests: - if test.executed == 0: - continue - - if test.get_exitcode() != 0: - failures = failures + 1 - test.print_result(" ") - else: - success = success + 1 - - if failures == 0: - print " None" - - print "\n--- TOTALS\n Pass:%d\n Fail:%d\n" % (success, failures) - def build_api_sanity_tests(self): - verbose_arg = "" - if self.verbose: - verbose_arg = "-V" - - test = self.new_test("standalone_low_level_api_test", "Sanity test client api in standalone mode.") - test.add_cmd("@CRM_DAEMON_DIR@/stonith-test", "-t %s" % (verbose_arg)) - - test = self.new_test("cpg_low_level_api_test", "Sanity test client api using mainloop and cpg.", 1) - test.add_cmd("@CRM_DAEMON_DIR@/stonith-test", "-m %s" % (verbose_arg)) - - def build_custom_timeout_tests(self): - # custom timeout without topology - test = self.new_test("cpg_custom_timeout_1", - "Verify per device timeouts work as expected without using topology.", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=1\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=4\"") - test.add_cmd("stonith_admin", "-F node3 -t 2") - # timeout is 2+1+4 = 7 - test.add_stonith_log_pattern("remote op timeout set to 7") - - # custom timeout _WITH_ topology - test = self.new_test("cpg_custom_timeout_2", - "Verify per device timeouts work as expected _WITH_ topology.", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=1\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=4000\"") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v false2") - test.add_cmd("stonith_admin", "-F node3 -t 2") - # timeout is 2+1+4000 = 4003 - test.add_stonith_log_pattern("remote op timeout set to 4003") - - def build_fence_merge_tests(self): - - ### Simple test that overlapping fencing operations get merged - test = self.new_test("cpg_custom_merge_single", - "Verify overlapping identical fencing operations are merged, no fencing levels used.", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" ") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd("stonith_admin", "-F node3 -t 10") - ### one merger will happen - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - ### the pattern below signifies that both the original and duplicate operation completed - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - - ### Test that multiple mergers occur - test = self.new_test("cpg_custom_merge_multiple", - "Verify multiple overlapping identical fencing operations are merged", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"delay=2\" -o \"pcmk_host_list=node3\" ") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd("stonith_admin", "-F node3 -t 10") - ### 4 mergers should occur - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - ### the pattern below signifies that both the original and duplicate operation completed - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - - ### Test that multiple mergers occur with topologies used - test = self.new_test("cpg_custom_merge_with_topology", - "Verify multiple overlapping identical fencing operations are merged with fencing levels.", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" ") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false2") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") - test.add_cmd("stonith_admin", "-F node3 -t 10") - ### 4 mergers should occur - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") - ### the pattern below signifies that both the original and duplicate operation completed - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - test.add_stonith_log_pattern("Operation off of node3 by") - - - test = self.new_test("cpg_custom_no_merge", - "Verify differing fencing operations are not merged", 1) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3 node2\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3 node2\" ") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3 node2\"") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false2") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") - test.add_cmd_no_wait("stonith_admin", "-F node2 -t 10") - test.add_cmd("stonith_admin", "-F node3 -t 10") - test.add_stonith_negative_log_pattern("Merging stonith action off for node node3 originating from client") - - def build_standalone_tests(self): - test_types = [ - { - "prefix" : "standalone" , - "use_cpg" : 0, - }, - { - "prefix" : "cpg" , - "use_cpg" : 1, - }, - ] - - # test what happens when all devices timeout - for test_type in test_types: - test = self.new_test("%s_fence_multi_device_failure" % test_type["prefix"], - "Verify that all devices timeout, a fencing failure is returned.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false3 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - if test_type["use_cpg"] == 1: - test.add_expected_fail_cmd("stonith_admin", "-F node3 -t 2", 194) - test.add_stonith_log_pattern("remote op timeout set to 6") - else: - test.add_expected_fail_cmd("stonith_admin", "-F node3 -t 2", 55) - - test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: ") - test.add_stonith_log_pattern("for host 'node3' with device 'false2' returned: ") - test.add_stonith_log_pattern("for host 'node3' with device 'false3' returned: ") - - # test what happens when multiple devices can fence a node, but the first device fails. - for test_type in test_types: - test = self.new_test("%s_fence_device_failure_rollover" % test_type["prefix"], - "Verify that when one fence device fails for a node, the others are tried.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-F node3 -t 2") - - if test_type["use_cpg"] == 1: - test.add_stonith_log_pattern("remote op timeout set to 6") - - # simple topology test for one device - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_simple" % test_type["prefix"], - "Verify all fencing devices at a level are used.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v true") - test.add_cmd("stonith_admin", "-F node3 -t 2") - - test.add_stonith_log_pattern("remote op timeout set to 2") - test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") - - - # add topology, delete topology, verify fencing still works - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_add_remove" % test_type["prefix"], - "Verify fencing occurrs after all topology levels are removed", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v true") - test.add_cmd("stonith_admin", "-d node3 -i 1") - test.add_cmd("stonith_admin", "-F node3 -t 2") - - test.add_stonith_log_pattern("remote op timeout set to 2") - test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") - - # test what happens when the first fencing level has multiple devices. - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_device_fails" % test_type["prefix"], - "Verify if one device in a level fails, the other is tried.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R false -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true") - test.add_cmd("stonith_admin", "-F node3 -t 20") - - test.add_stonith_log_pattern("remote op timeout set to 40") - test.add_stonith_log_pattern("for host 'node3' with device 'false' returned: -201") - test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") - - # test what happens when the first fencing level fails. - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_multi_level_fails" % test_type["prefix"], - "Verify if one level fails, the next leve is tried.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") - - test.add_cmd("stonith_admin", "-F node3 -t 3") - - test.add_stonith_log_pattern("remote op timeout set to 18") - test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: -201") - test.add_stonith_log_pattern("for host 'node3' with device 'false2' returned: -201") - test.add_stonith_log_pattern("for host 'node3' with device 'true3' returned: 0") - test.add_stonith_log_pattern("for host 'node3' with device 'true4' returned: 0") - - - # test what happens when the first fencing level had devices that no one has registered - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_missing_devices" % test_type["prefix"], - "Verify topology can continue with missing devices.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") - - test.add_cmd("stonith_admin", "-F node3 -t 2") - - # Test what happens if multiple fencing levels are defined, and then the first one is removed. - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - - test = self.new_test("%s_topology_level_removal" % test_type["prefix"], - "Verify level removal works.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") - test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") - - test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") - test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") - - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") - test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") - - # Now remove level 2, verify none of the devices in level two are hit. - test.add_cmd("stonith_admin", "-d node3 -i 2") - - test.add_cmd("stonith_admin", "-F node3 -t 20") - - test.add_stonith_log_pattern("remote op timeout set to 8") - test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: -201") - test.add_stonith_negative_log_pattern("for host 'node3' with device 'false2' returned: ") - test.add_stonith_log_pattern("for host 'node3' with device 'true3' returned: 0") - test.add_stonith_log_pattern("for host 'node3' with device 'true4' returned: 0") - - # test the stonith builds the correct list of devices that can fence a node. - for test_type in test_types: - test = self.new_test("%s_list_devices" % test_type["prefix"], - "Verify list of devices that can fence a node is correct", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - - test.add_cmd_check_stdout("stonith_admin", "-l node1 -V", "true2", "true1") - test.add_cmd_check_stdout("stonith_admin", "-l node1 -V", "true3", "true1") - - # simple test of device monitor - for test_type in test_types: - test = self.new_test("%s_monitor" % test_type["prefix"], - "Verify device is reachable", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") - - test.add_cmd("stonith_admin", "-Q true1") - test.add_cmd("stonith_admin", "-Q false1") - test.add_expected_fail_cmd("stonith_admin", "-Q true2", 237) - - # Verify monitor occurs for duration of timeout period on failure - for test_type in test_types: - test = self.new_test("%s_monitor_timeout" % test_type["prefix"], - "Verify monitor uses duration of timeout period given.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_monitor_fail -o \"pcmk_host_list=node3\"") - test.add_expected_fail_cmd("stonith_admin", "-Q true1 -t 5", 195) - test.add_stonith_log_pattern("Attempt 2 to execute") - - # Verify monitor occurs for duration of timeout period on failure, but stops at max retries - for test_type in test_types: - test = self.new_test("%s_monitor_timeout_max_retries" % test_type["prefix"], - "Verify monitor retries until max retry value or timeout is hit.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_monitor_fail -o \"pcmk_host_list=node3\"") - test.add_expected_fail_cmd("stonith_admin", "-Q true1 -t 15",195) - test.add_stonith_log_pattern("Attempted to execute agent fence_dummy_monitor_fail (list) the maximum number of times") - - # simple register test - for test_type in test_types: - test = self.new_test("%s_register" % test_type["prefix"], - "Verify devices can be registered and un-registered", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") - - test.add_cmd("stonith_admin", "-Q true1") - - test.add_cmd("stonith_admin", "-D true1") - - test.add_expected_fail_cmd("stonith_admin", "-Q true1", 237) - - - # simple reboot test - for test_type in test_types: - test = self.new_test("%s_reboot" % test_type["prefix"], - "Verify devices can be rebooted", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") - - test.add_cmd("stonith_admin", "-B node3 -t 2") - - test.add_cmd("stonith_admin", "-D true1") - - test.add_expected_fail_cmd("stonith_admin", "-Q true1", 237) - - # test fencing history. - for test_type in test_types: - if test_type["use_cpg"] == 0: - continue - test = self.new_test("%s_fence_history" % test_type["prefix"], - "Verify last fencing operation is returned.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") - - test.add_cmd("stonith_admin", "-F node3 -t 2 -V") - - test.add_cmd_check_stdout("stonith_admin", "-H node3", "was able to turn off node node3", "") - - # simple test of dynamic list query - for test_type in test_types: - test = self.new_test("%s_dynamic_list_query" % test_type["prefix"], - "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_list") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_list") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_list") - - test.add_cmd_check_stdout("stonith_admin", "-l fake_port_1", "3 devices found") - - - # fence using dynamic list query - for test_type in test_types: - test = self.new_test("%s_fence_dynamic_list_query" % test_type["prefix"], - "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_list") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_list") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_list") - - test.add_cmd("stonith_admin", "-F fake_port_1 -t 5 -V"); - - # simple test of query using status action - for test_type in test_types: - test = self.new_test("%s_status_query" % test_type["prefix"], - "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") - - test.add_cmd_check_stdout("stonith_admin", "-l fake_port_1", "3 devices found") - - # test what happens when no reboot action is advertised - for test_type in test_types: - test = self.new_test("%s_no_reboot_support" % test_type["prefix"], - "Verify reboot action defaults to off when no reboot action is advertised by agent.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_no_reboot -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-B node1 -t 5 -V"); - test.add_stonith_log_pattern("does not advertise support for 'reboot', performing 'off'") - test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); - - # make sure reboot is used when reboot action is advertised - for test_type in test_types: - test = self.new_test("%s_with_reboot_support" % test_type["prefix"], - "Verify reboot action can be used when metadata advertises it.", test_type["use_cpg"]) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") - test.add_cmd("stonith_admin", "-B node1 -t 5 -V"); - test.add_stonith_negative_log_pattern("does not advertise support for 'reboot', performing 'off'") - test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); - - def build_nodeid_tests(self): - our_uname = output_from_command("uname -n") - if our_uname: - our_uname = our_uname[0] - - ### verify nodeid is supplied when nodeid is in the metadata parameters - test = self.new_test("cpg_supply_nodeid", - "Verify nodeid is given when fence agent has nodeid as parameter", 1) - - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) - test.add_stonith_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) - - ### verify nodeid is _NOT_ supplied when nodeid is not in the metadata parameters - test = self.new_test("cpg_do_not_supply_nodeid", - "Verify nodeid is _NOT_ given when fence agent does not have nodeid as parameter", 1) - - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) - test.add_stonith_negative_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) - - ### verify nodeid use doesn't explode standalone mode - test = self.new_test("standalone_do_not_supply_nodeid", - "Verify nodeid in metadata parameter list doesn't kill standalone mode", 0) - - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) - test.add_stonith_negative_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) - - - def build_unfence_tests(self): - our_uname = output_from_command("uname -n") - if our_uname: - our_uname = our_uname[0] - - ### verify unfencing using automatic unfencing - test = self.new_test("cpg_unfence_required_1", - "Verify require unfencing on all devices when automatic=true in agent's metadata", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) - # both devices should be executed - test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); - test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); - - - ### verify unfencing using automatic unfencing fails if any of the required agents fail - test = self.new_test("cpg_unfence_required_2", - "Verify require unfencing on all devices when automatic=true in agent's metadata", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=fail\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_expected_fail_cmd("stonith_admin", "-U %s -t 6" % (our_uname), 143) - - ### verify unfencing using automatic devices with topology - test = self.new_test("cpg_unfence_required_3", - "Verify require unfencing on all devices even when required devices are at different topology levels", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) - test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) - test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); - test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); - - - ### verify unfencing using automatic devices with topology - test = self.new_test("cpg_unfence_required_4", - "Verify all required devices are executed even with topology levels fail.", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true4 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R false3 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R false4 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 1 -v false1" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v false2" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v false3" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v true3" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 3 -v false4" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 4 -v true4" % (our_uname)) - test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) - test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); - test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); - test.add_stonith_log_pattern("with device 'true3' returned: 0 (OK)"); - test.add_stonith_log_pattern("with device 'true4' returned: 0 (OK)"); - - ### verify unfencing using on_target device - test = self.new_test("cpg_unfence_on_target_1", - "Verify unfencing with on_target = true", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) - test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) - test.add_stonith_log_pattern("(on) to be executed on the target node") - - - ### verify failure of unfencing using on_target device - test = self.new_test("cpg_unfence_on_target_2", - "Verify failure unfencing with on_target = true", 1) - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake_1234\"" % (our_uname)) - test.add_expected_fail_cmd("stonith_admin", "-U node_fake_1234 -t 3", 237) - test.add_stonith_log_pattern("(on) to be executed on the target node") - - - ### verify unfencing using on_target device with topology - test = self.new_test("cpg_unfence_on_target_3", - "Verify unfencing with on_target = true using topology", 1) - - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) - - test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) - test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) - - test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) - test.add_stonith_log_pattern("(on) to be executed on the target node") - - ### verify unfencing using on_target device with topology fails when victim node doesn't exist - test = self.new_test("cpg_unfence_on_target_4", - "Verify unfencing failure with on_target = true using topology", 1) - - test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake\"" % (our_uname)) - test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake\"" % (our_uname)) - - test.add_cmd("stonith_admin", "-r node_fake -i 1 -v true1") - test.add_cmd("stonith_admin", "-r node_fake -i 2 -v true2") - - test.add_expected_fail_cmd("stonith_admin", "-U node_fake -t 3", 237) - test.add_stonith_log_pattern("(on) to be executed on the target node") - - - def setup_environment(self, use_corosync): - if self.autogen_corosync_cfg and use_corosync: - corosync_conf = (""" + def __init__(self, verbose = 0): + self.tests = [] + self.verbose = verbose + self.autogen_corosync_cfg = 0 + if not os.path.exists("/etc/corosync/corosync.conf"): + self.autogen_corosync_cfg = 1 + + def new_test(self, name, description, with_cpg = 0): + test = Test(name, description, self.verbose, with_cpg) + self.tests.append(test) + return test + + def print_list(self): + print "\n==== %d TESTS FOUND ====" % (len(self.tests)) + print "%35s - %s" % ("TEST NAME", "TEST DESCRIPTION") + print "%35s - %s" % ("--------------------", "--------------------") + for test in self.tests: + print "%35s - %s" % (test.name, test.description) + print "==== END OF LIST ====\n" + + + def start_corosync(self): + if self.verbose: + print "Starting corosync" + + test = subprocess.Popen("corosync", stdout=subprocess.PIPE) + test.wait() + time.sleep(10) + + def stop_corosync(self): + cmd = shlex.split("killall -9 -q corosync") + test = subprocess.Popen(cmd, stdout=subprocess.PIPE) + test.wait() + + def run_single(self, name): + for test in self.tests: + if test.name == name: + test.run() + break; + + def run_tests_matching(self, pattern): + for test in self.tests: + if test.name.count(pattern) != 0: + test.run() + + def run_cpg_only(self): + for test in self.tests: + if test.enable_corosync: + test.run() + + def run_no_cpg(self): + for test in self.tests: + if not test.enable_corosync: + test.run() + + def run_tests(self): + for test in self.tests: + test.run() + + def exit(self): + for test in self.tests: + if test.executed == 0: + continue + + if test.get_exitcode() != 0: + sys.exit(-1) + + sys.exit(0) + + def print_results(self): + failures = 0; + success = 0; + print "\n\n======= FINAL RESULTS ==========" + print "\n--- FAILURE RESULTS:" + for test in self.tests: + if test.executed == 0: + continue + + if test.get_exitcode() != 0: + failures = failures + 1 + test.print_result(" ") + else: + success = success + 1 + + if failures == 0: + print " None" + + print "\n--- TOTALS\n Pass:%d\n Fail:%d\n" % (success, failures) + def build_api_sanity_tests(self): + verbose_arg = "" + if self.verbose: + verbose_arg = "-V" + + test = self.new_test("standalone_low_level_api_test", "Sanity test client api in standalone mode.") + test.add_cmd("@CRM_DAEMON_DIR@/stonith-test", "-t %s" % (verbose_arg)) + + test = self.new_test("cpg_low_level_api_test", "Sanity test client api using mainloop and cpg.", 1) + test.add_cmd("@CRM_DAEMON_DIR@/stonith-test", "-m %s" % (verbose_arg)) + + def build_custom_timeout_tests(self): + # custom timeout without topology + test = self.new_test("cpg_custom_timeout_1", + "Verify per device timeouts work as expected without using topology.", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=1\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=4\"") + test.add_cmd("stonith_admin", "-F node3 -t 2") + # timeout is 2+1+4 = 7 + test.add_stonith_log_pattern("remote op timeout set to 7") + + # custom timeout _WITH_ topology + test = self.new_test("cpg_custom_timeout_2", + "Verify per device timeouts work as expected _WITH_ topology.", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=1\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\" -o \"pcmk_off_timeout=4000\"") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v false2") + test.add_cmd("stonith_admin", "-F node3 -t 2") + # timeout is 2+1+4000 = 4003 + test.add_stonith_log_pattern("remote op timeout set to 4003") + + def build_fence_merge_tests(self): + + ### Simple test that overlapping fencing operations get merged + test = self.new_test("cpg_custom_merge_single", + "Verify overlapping identical fencing operations are merged, no fencing levels used.", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" ") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd("stonith_admin", "-F node3 -t 10") + ### one merger will happen + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + ### the pattern below signifies that both the original and duplicate operation completed + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + + ### Test that multiple mergers occur + test = self.new_test("cpg_custom_merge_multiple", + "Verify multiple overlapping identical fencing operations are merged", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"delay=2\" -o \"pcmk_host_list=node3\" ") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd("stonith_admin", "-F node3 -t 10") + ### 4 mergers should occur + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + ### the pattern below signifies that both the original and duplicate operation completed + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + + ### Test that multiple mergers occur with topologies used + test = self.new_test("cpg_custom_merge_with_topology", + "Verify multiple overlapping identical fencing operations are merged with fencing levels.", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\" ") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false2") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd_no_wait("stonith_admin", "-F node3 -t 10") + test.add_cmd("stonith_admin", "-F node3 -t 10") + ### 4 mergers should occur + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + test.add_stonith_log_pattern("Merging stonith action off for node node3 originating from client") + ### the pattern below signifies that both the original and duplicate operation completed + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + test.add_stonith_log_pattern("Operation off of node3 by") + + + test = self.new_test("cpg_custom_no_merge", + "Verify differing fencing operations are not merged", 1) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3 node2\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3 node2\" ") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3 node2\"") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false2") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true1") + test.add_cmd_no_wait("stonith_admin", "-F node2 -t 10") + test.add_cmd("stonith_admin", "-F node3 -t 10") + test.add_stonith_negative_log_pattern("Merging stonith action off for node node3 originating from client") + + def build_standalone_tests(self): + test_types = [ + { + "prefix" : "standalone" , + "use_cpg" : 0, + }, + { + "prefix" : "cpg" , + "use_cpg" : 1, + }, + ] + + # test what happens when all devices timeout + for test_type in test_types: + test = self.new_test("%s_fence_multi_device_failure" % test_type["prefix"], + "Verify that all devices timeout, a fencing failure is returned.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false3 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + if test_type["use_cpg"] == 1: + test.add_expected_fail_cmd("stonith_admin", "-F node3 -t 2", 194) + test.add_stonith_log_pattern("remote op timeout set to 6") + else: + test.add_expected_fail_cmd("stonith_admin", "-F node3 -t 2", 55) + + test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: ") + test.add_stonith_log_pattern("for host 'node3' with device 'false2' returned: ") + test.add_stonith_log_pattern("for host 'node3' with device 'false3' returned: ") + + # test what happens when multiple devices can fence a node, but the first device fails. + for test_type in test_types: + test = self.new_test("%s_fence_device_failure_rollover" % test_type["prefix"], + "Verify that when one fence device fails for a node, the others are tried.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-F node3 -t 2") + + if test_type["use_cpg"] == 1: + test.add_stonith_log_pattern("remote op timeout set to 6") + + # simple topology test for one device + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_simple" % test_type["prefix"], + "Verify all fencing devices at a level are used.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v true") + test.add_cmd("stonith_admin", "-F node3 -t 2") + + test.add_stonith_log_pattern("remote op timeout set to 2") + test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") + + + # add topology, delete topology, verify fencing still works + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_add_remove" % test_type["prefix"], + "Verify fencing occurrs after all topology levels are removed", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v true") + test.add_cmd("stonith_admin", "-d node3 -i 1") + test.add_cmd("stonith_admin", "-F node3 -t 2") + + test.add_stonith_log_pattern("remote op timeout set to 2") + test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") + + # test what happens when the first fencing level has multiple devices. + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_device_fails" % test_type["prefix"], + "Verify if one device in a level fails, the other is tried.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R false -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true") + test.add_cmd("stonith_admin", "-F node3 -t 20") + + test.add_stonith_log_pattern("remote op timeout set to 40") + test.add_stonith_log_pattern("for host 'node3' with device 'false' returned: -201") + test.add_stonith_log_pattern("for host 'node3' with device 'true' returned: 0") + + # test what happens when the first fencing level fails. + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_multi_level_fails" % test_type["prefix"], + "Verify if one level fails, the next leve is tried.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") + + test.add_cmd("stonith_admin", "-F node3 -t 3") + + test.add_stonith_log_pattern("remote op timeout set to 18") + test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: -201") + test.add_stonith_log_pattern("for host 'node3' with device 'false2' returned: -201") + test.add_stonith_log_pattern("for host 'node3' with device 'true3' returned: 0") + test.add_stonith_log_pattern("for host 'node3' with device 'true4' returned: 0") + + + # test what happens when the first fencing level had devices that no one has registered + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_missing_devices" % test_type["prefix"], + "Verify topology can continue with missing devices.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") + + test.add_cmd("stonith_admin", "-F node3 -t 2") + + # Test what happens if multiple fencing levels are defined, and then the first one is removed. + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + + test = self.new_test("%s_topology_level_removal" % test_type["prefix"], + "Verify level removal works.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true4 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd("stonith_admin", "-r node3 -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node3 -i 1 -v true1") + + test.add_cmd("stonith_admin", "-r node3 -i 2 -v true2") + test.add_cmd("stonith_admin", "-r node3 -i 2 -v false2") + + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true3") + test.add_cmd("stonith_admin", "-r node3 -i 3 -v true4") + + # Now remove level 2, verify none of the devices in level two are hit. + test.add_cmd("stonith_admin", "-d node3 -i 2") + + test.add_cmd("stonith_admin", "-F node3 -t 20") + + test.add_stonith_log_pattern("remote op timeout set to 8") + test.add_stonith_log_pattern("for host 'node3' with device 'false1' returned: -201") + test.add_stonith_negative_log_pattern("for host 'node3' with device 'false2' returned: ") + test.add_stonith_log_pattern("for host 'node3' with device 'true3' returned: 0") + test.add_stonith_log_pattern("for host 'node3' with device 'true4' returned: 0") + + # test the stonith builds the correct list of devices that can fence a node. + for test_type in test_types: + test = self.new_test("%s_list_devices" % test_type["prefix"], + "Verify list of devices that can fence a node is correct", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + + test.add_cmd_check_stdout("stonith_admin", "-l node1 -V", "true2", "true1") + test.add_cmd_check_stdout("stonith_admin", "-l node1 -V", "true3", "true1") + + # simple test of device monitor + for test_type in test_types: + test = self.new_test("%s_monitor" % test_type["prefix"], + "Verify device is reachable", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=node3\"") + + test.add_cmd("stonith_admin", "-Q true1") + test.add_cmd("stonith_admin", "-Q false1") + test.add_expected_fail_cmd("stonith_admin", "-Q true2", 237) + + # Verify monitor occurs for duration of timeout period on failure + for test_type in test_types: + test = self.new_test("%s_monitor_timeout" % test_type["prefix"], + "Verify monitor uses duration of timeout period given.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_monitor_fail -o \"pcmk_host_list=node3\"") + test.add_expected_fail_cmd("stonith_admin", "-Q true1 -t 5", 195) + test.add_stonith_log_pattern("Attempt 2 to execute") + + # Verify monitor occurs for duration of timeout period on failure, but stops at max retries + for test_type in test_types: + test = self.new_test("%s_monitor_timeout_max_retries" % test_type["prefix"], + "Verify monitor retries until max retry value or timeout is hit.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_monitor_fail -o \"pcmk_host_list=node3\"") + test.add_expected_fail_cmd("stonith_admin", "-Q true1 -t 15",195) + test.add_stonith_log_pattern("Attempted to execute agent fence_dummy_monitor_fail (list) the maximum number of times") + + # simple register test + for test_type in test_types: + test = self.new_test("%s_register" % test_type["prefix"], + "Verify devices can be registered and un-registered", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") + + test.add_cmd("stonith_admin", "-Q true1") + + test.add_cmd("stonith_admin", "-D true1") + + test.add_expected_fail_cmd("stonith_admin", "-Q true1", 237) + + + # simple reboot test + for test_type in test_types: + test = self.new_test("%s_reboot" % test_type["prefix"], + "Verify devices can be rebooted", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") + + test.add_cmd("stonith_admin", "-B node3 -t 2") + + test.add_cmd("stonith_admin", "-D true1") + + test.add_expected_fail_cmd("stonith_admin", "-Q true1", 237) + + # test fencing history. + for test_type in test_types: + if test_type["use_cpg"] == 0: + continue + test = self.new_test("%s_fence_history" % test_type["prefix"], + "Verify last fencing operation is returned.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node3\"") + + test.add_cmd("stonith_admin", "-F node3 -t 2 -V") + + test.add_cmd_check_stdout("stonith_admin", "-H node3", "was able to turn off node node3", "") + + # simple test of dynamic list query + for test_type in test_types: + test = self.new_test("%s_dynamic_list_query" % test_type["prefix"], + "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_list") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_list") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_list") + + test.add_cmd_check_stdout("stonith_admin", "-l fake_port_1", "3 devices found") + + + # fence using dynamic list query + for test_type in test_types: + test = self.new_test("%s_fence_dynamic_list_query" % test_type["prefix"], + "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_list") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_list") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_list") + + test.add_cmd("stonith_admin", "-F fake_port_1 -t 5 -V"); + + # simple test of query using status action + for test_type in test_types: + test = self.new_test("%s_status_query" % test_type["prefix"], + "Verify dynamic list of fencing devices can be retrieved.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_check=status\"") + + test.add_cmd_check_stdout("stonith_admin", "-l fake_port_1", "3 devices found") + + # test what happens when no reboot action is advertised + for test_type in test_types: + test = self.new_test("%s_no_reboot_support" % test_type["prefix"], + "Verify reboot action defaults to off when no reboot action is advertised by agent.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_no_reboot -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-B node1 -t 5 -V"); + test.add_stonith_log_pattern("does not advertise support for 'reboot', performing 'off'") + test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); + + # make sure reboot is used when reboot action is advertised + for test_type in test_types: + test = self.new_test("%s_with_reboot_support" % test_type["prefix"], + "Verify reboot action can be used when metadata advertises it.", test_type["use_cpg"]) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=node1 node2 node3\"") + test.add_cmd("stonith_admin", "-B node1 -t 5 -V"); + test.add_stonith_negative_log_pattern("does not advertise support for 'reboot', performing 'off'") + test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); + + def build_nodeid_tests(self): + our_uname = output_from_command("uname -n") + if our_uname: + our_uname = our_uname[0] + + ### verify nodeid is supplied when nodeid is in the metadata parameters + test = self.new_test("cpg_supply_nodeid", + "Verify nodeid is given when fence agent has nodeid as parameter", 1) + + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) + test.add_stonith_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) + + ### verify nodeid is _NOT_ supplied when nodeid is not in the metadata parameters + test = self.new_test("cpg_do_not_supply_nodeid", + "Verify nodeid is _NOT_ given when fence agent does not have nodeid as parameter", 1) + + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) + test.add_stonith_negative_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) + + ### verify nodeid use doesn't explode standalone mode + test = self.new_test("standalone_do_not_supply_nodeid", + "Verify nodeid in metadata parameter list doesn't kill standalone mode", 0) + + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-F %s -t 3" % (our_uname)) + test.add_stonith_negative_log_pattern("For stonith action (off) for victim %s, adding nodeid" % (our_uname)) + + + def build_unfence_tests(self): + our_uname = output_from_command("uname -n") + if our_uname: + our_uname = our_uname[0] + + ### verify unfencing using automatic unfencing + test = self.new_test("cpg_unfence_required_1", + "Verify require unfencing on all devices when automatic=true in agent's metadata", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) + # both devices should be executed + test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); + test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); + + + ### verify unfencing using automatic unfencing fails if any of the required agents fail + test = self.new_test("cpg_unfence_required_2", + "Verify require unfencing on all devices when automatic=true in agent's metadata", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=fail\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_expected_fail_cmd("stonith_admin", "-U %s -t 6" % (our_uname), 143) + + ### verify unfencing using automatic devices with topology + test = self.new_test("cpg_unfence_required_3", + "Verify require unfencing on all devices even when required devices are at different topology levels", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) + test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) + test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); + test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); + + + ### verify unfencing using automatic devices with topology + test = self.new_test("cpg_unfence_required_4", + "Verify all required devices are executed even with topology levels fail.", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true3 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true4 -a fence_dummy_automatic_unfence -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R false1 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R false2 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R false3 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R false4 -a fence_dummy -o \"mode=fail\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 1 -v false1" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v false2" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v false3" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v true3" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 3 -v false4" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 4 -v true4" % (our_uname)) + test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) + test.add_stonith_log_pattern("with device 'true1' returned: 0 (OK)"); + test.add_stonith_log_pattern("with device 'true2' returned: 0 (OK)"); + test.add_stonith_log_pattern("with device 'true3' returned: 0 (OK)"); + test.add_stonith_log_pattern("with device 'true4' returned: 0 (OK)"); + + ### verify unfencing using on_target device + test = self.new_test("cpg_unfence_on_target_1", + "Verify unfencing with on_target = true", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s\"" % (our_uname)) + test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) + test.add_stonith_log_pattern("(on) to be executed on the target node") + + + ### verify failure of unfencing using on_target device + test = self.new_test("cpg_unfence_on_target_2", + "Verify failure unfencing with on_target = true", 1) + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake_1234\"" % (our_uname)) + test.add_expected_fail_cmd("stonith_admin", "-U node_fake_1234 -t 3", 237) + test.add_stonith_log_pattern("(on) to be executed on the target node") + + + ### verify unfencing using on_target device with topology + test = self.new_test("cpg_unfence_on_target_3", + "Verify unfencing with on_target = true using topology", 1) + + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node3\"" % (our_uname)) + + test.add_cmd("stonith_admin", "-r %s -i 1 -v true1" % (our_uname)) + test.add_cmd("stonith_admin", "-r %s -i 2 -v true2" % (our_uname)) + + test.add_cmd("stonith_admin", "-U %s -t 3" % (our_uname)) + test.add_stonith_log_pattern("(on) to be executed on the target node") + + ### verify unfencing using on_target device with topology fails when victim node doesn't exist + test = self.new_test("cpg_unfence_on_target_4", + "Verify unfencing failure with on_target = true using topology", 1) + + test.add_cmd("stonith_admin", "-R true1 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake\"" % (our_uname)) + test.add_cmd("stonith_admin", "-R true2 -a fence_dummy -o \"mode=pass\" -o \"pcmk_host_list=%s node_fake\"" % (our_uname)) + + test.add_cmd("stonith_admin", "-r node_fake -i 1 -v true1") + test.add_cmd("stonith_admin", "-r node_fake -i 2 -v true2") + + test.add_expected_fail_cmd("stonith_admin", "-U node_fake -t 3", 237) + test.add_stonith_log_pattern("(on) to be executed on the target node") + + def build_remap_tests(self): + test = self.new_test("cpg_remap_simple", + "Verify sequential topology reboot is remapped to all-off-then-all-on", 1) + test.add_cmd("stonith_admin", + """-R true1 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """ + """-o "pcmk_off_timeout=1" -o "pcmk_reboot_timeout=10" """) + test.add_cmd("stonith_admin", + """-R true2 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """ + """-o "pcmk_off_timeout=2" -o "pcmk_reboot_timeout=20" """) + test.add_cmd("stonith_admin", "-r node_fake -i 1 -v true1 -v true2") + test.add_cmd("stonith_admin", "-B node_fake -t 5") + test.add_stonith_log_pattern("Remapping multiple-device reboot of node_fake") + # timeout should be sum of off timeouts (1+2=3), not reboot timeouts (10+20=30) + test.add_stonith_log_pattern("remote op timeout set to 3 for fencing of node node_fake") + test.add_stonith_log_pattern("perform op off node_fake with true1") + test.add_stonith_log_pattern("perform op off node_fake with true2") + test.add_stonith_log_pattern("Remapped off of node_fake complete, remapping to on") + # fence_dummy sets "on" as an on_target action + test.add_stonith_log_pattern("Ignoring true1 'on' failure (no capable peers) for node_fake") + test.add_stonith_log_pattern("Ignoring true2 'on' failure (no capable peers) for node_fake") + test.add_stonith_log_pattern("Undoing remap of reboot of node_fake") + + test = self.new_test("cpg_remap_automatic", + "Verify remapped topology reboot skips automatic 'on'", 1) + test.add_cmd("stonith_admin", + """-R true1 -a fence_dummy_automatic_unfence """ + """-o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", + """-R true2 -a fence_dummy_automatic_unfence """ + """-o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", "-r node_fake -i 1 -v true1 -v true2") + test.add_cmd("stonith_admin", "-B node_fake -t 5") + test.add_stonith_log_pattern("Remapping multiple-device reboot of node_fake") + test.add_stonith_log_pattern("perform op off node_fake with true1") + test.add_stonith_log_pattern("perform op off node_fake with true2") + test.add_stonith_log_pattern("Remapped off of node_fake complete, remapping to on") + test.add_stonith_log_pattern("Undoing remap of reboot of node_fake") + test.add_stonith_negative_log_pattern("perform op on node_fake with") + test.add_stonith_negative_log_pattern("'on' failure") + + test = self.new_test("cpg_remap_complex_1", + "Verify remapped topology reboot in second level works if non-remapped first level fails", 1) + test.add_cmd("stonith_admin", """-R false1 -a fence_dummy -o "mode=fail" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R true1 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R true2 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", "-r node_fake -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node_fake -i 2 -v true1 -v true2") + test.add_cmd("stonith_admin", "-B node_fake -t 5") + test.add_stonith_log_pattern("perform op reboot node_fake with false1") + test.add_stonith_log_pattern("Remapping multiple-device reboot of node_fake") + test.add_stonith_log_pattern("perform op off node_fake with true1") + test.add_stonith_log_pattern("perform op off node_fake with true2") + test.add_stonith_log_pattern("Remapped off of node_fake complete, remapping to on") + test.add_stonith_log_pattern("Ignoring true1 'on' failure (no capable peers) for node_fake") + test.add_stonith_log_pattern("Ignoring true2 'on' failure (no capable peers) for node_fake") + test.add_stonith_log_pattern("Undoing remap of reboot of node_fake") + + test = self.new_test("cpg_remap_complex_2", + "Verify remapped topology reboot failure in second level proceeds to third level", 1) + test.add_cmd("stonith_admin", """-R false1 -a fence_dummy -o "mode=fail" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R false2 -a fence_dummy -o "mode=fail" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R true1 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R true2 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", """-R true3 -a fence_dummy -o "mode=pass" -o "pcmk_host_list=node_fake" """) + test.add_cmd("stonith_admin", "-r node_fake -i 1 -v false1") + test.add_cmd("stonith_admin", "-r node_fake -i 2 -v true1 -v false2 -v true3") + test.add_cmd("stonith_admin", "-r node_fake -i 3 -v true2") + test.add_cmd("stonith_admin", "-B node_fake -t 5") + test.add_stonith_log_pattern("perform op reboot node_fake with false1") + test.add_stonith_log_pattern("Remapping multiple-device reboot of node_fake") + test.add_stonith_log_pattern("perform op off node_fake with true1") + test.add_stonith_log_pattern("perform op off node_fake with false2") + test.add_stonith_log_pattern("Attempted to execute agent fence_dummy (off) the maximum number of times") + test.add_stonith_log_pattern("Undoing remap of reboot of node_fake") + test.add_stonith_log_pattern("perform op reboot node_fake with true2") + test.add_stonith_negative_log_pattern("node_fake with true3") + + def setup_environment(self, use_corosync): + if self.autogen_corosync_cfg and use_corosync: + corosync_conf = (""" totem { version: 2 crypto_cipher: none crypto_hash: none nodeid: 101 secauth: off interface { ttl: 1 ringnumber: 0 mcastport: 6666 mcastaddr: 226.94.1.1 bindnetaddr: 127.0.0.1 } } logging { debug: off fileline: off to_syslog: no to_stderr: no syslog_facility: daemon timestamp: on to_logfile: yes logfile: /var/log/corosync.log logfile_priority: info } """) - os.system("cat <<-END >>/etc/corosync/corosync.conf\n%s\nEND" % (corosync_conf)) + os.system("cat <<-END >>/etc/corosync/corosync.conf\n%s\nEND" % (corosync_conf)) - if use_corosync: - ### make sure we are in control ### - self.stop_corosync() - self.start_corosync() + if use_corosync: + ### make sure we are in control ### + self.stop_corosync() + self.start_corosync() - monitor_fail_agent = ("""#!/usr/bin/python + monitor_fail_agent = ("""#!/usr/bin/python import sys def main(): for line in sys.stdin.readlines(): if line.count("monitor") > 0: sys.exit(-1); sys.exit(-1) if __name__ == "__main__": main() """) - dynamic_list_agent = ("""#!/usr/bin/python + dynamic_list_agent = ("""#!/usr/bin/python import sys def main(): for line in sys.stdin.readlines(): if line.count("list") > 0: print "fake_port_1" sys.exit(0) if line.count("off") > 0: sys.exit(0) sys.exit(-1) if __name__ == "__main__": main() """) - os.system("cat <<-END >>/usr/sbin/fence_dummy_list\n%s\nEND" % (dynamic_list_agent)) - os.system("chmod 711 /usr/sbin/fence_dummy_list") + os.system("cat <<-END >>/usr/sbin/fence_dummy_list\n%s\nEND" % (dynamic_list_agent)) + os.system("chmod 711 /usr/sbin/fence_dummy_list") - os.system("cat <<-END >>/usr/sbin/fence_dummy_monitor_fail\n%s\nEND" % (monitor_fail_agent)) - os.system("chmod 711 /usr/sbin/fence_dummy_monitor_fail") + os.system("cat <<-END >>/usr/sbin/fence_dummy_monitor_fail\n%s\nEND" % (monitor_fail_agent)) + os.system("chmod 711 /usr/sbin/fence_dummy_monitor_fail") - os.system("cp /usr/share/pacemaker/tests/cts/fence_dummy /usr/sbin/fence_dummy") + os.system("cp /usr/share/pacemaker/tests/cts/fence_dummy /usr/sbin/fence_dummy") - # modifies dummy agent to do require unfencing - os.system("cat /usr/share/pacemaker/tests/cts/fence_dummy | sed 's/on_target=/automatic=/g' > /usr/sbin/fence_dummy_automatic_unfence"); - os.system("chmod 711 /usr/sbin/fence_dummy_automatic_unfence") + # modifies dummy agent to do require unfencing + os.system("cat /usr/share/pacemaker/tests/cts/fence_dummy | sed 's/on_target=/automatic=/g' > /usr/sbin/fence_dummy_automatic_unfence"); + os.system("chmod 711 /usr/sbin/fence_dummy_automatic_unfence") - # modifies dummy agent to not advertise reboot - os.system("cat /usr/share/pacemaker/tests/cts/fence_dummy | sed 's/^.*.*//g' > /usr/sbin/fence_dummy_no_reboot"); - os.system("chmod 711 /usr/sbin/fence_dummy_no_reboot") + # modifies dummy agent to not advertise reboot + os.system("cat /usr/share/pacemaker/tests/cts/fence_dummy | sed 's/^.*.*//g' > /usr/sbin/fence_dummy_no_reboot"); + os.system("chmod 711 /usr/sbin/fence_dummy_no_reboot") - def cleanup_environment(self, use_corosync): - if use_corosync: - self.stop_corosync() + def cleanup_environment(self, use_corosync): + if use_corosync: + self.stop_corosync() - if self.verbose and os.path.exists('/var/log/corosync.log'): - print "Corosync output" - f = open('/var/log/corosync.log', 'r') - for line in f.readlines(): - print line.strip() - os.remove('/var/log/corosync.log') + if self.verbose and os.path.exists('/var/log/corosync.log'): + print "Corosync output" + f = open('/var/log/corosync.log', 'r') + for line in f.readlines(): + print line.strip() + os.remove('/var/log/corosync.log') - if self.autogen_corosync_cfg: - os.system("rm -f /etc/corosync/corosync.conf") + if self.autogen_corosync_cfg: + os.system("rm -f /etc/corosync/corosync.conf") - os.system("rm -f /usr/sbin/fence_dummy_monitor_fail") - os.system("rm -f /usr/sbin/fence_dummy_list") - os.system("rm -f /usr/sbin/fence_dummy") - os.system("rm -f /usr/sbin/fence_dummy_automatic_unfence") - os.system("rm -f /usr/sbin/fence_dummy_no_reboot") + os.system("rm -f /usr/sbin/fence_dummy_monitor_fail") + os.system("rm -f /usr/sbin/fence_dummy_list") + os.system("rm -f /usr/sbin/fence_dummy") + os.system("rm -f /usr/sbin/fence_dummy_automatic_unfence") + os.system("rm -f /usr/sbin/fence_dummy_no_reboot") class TestOptions: - def __init__(self): - self.options = {} - self.options['list-tests'] = 0 - self.options['run-all'] = 1 - self.options['run-only'] = "" - self.options['run-only-pattern'] = "" - self.options['verbose'] = 0 - self.options['invalid-arg'] = "" - self.options['cpg-only'] = 0 - self.options['no-cpg'] = 0 - self.options['show-usage'] = 0 - - def build_options(self, argv): - args = argv[1:] - skip = 0 - for i in range(0, len(args)): - if skip: - skip = 0 - continue - elif args[i] == "-h" or args[i] == "--help": - self.options['show-usage'] = 1 - elif args[i] == "-l" or args[i] == "--list-tests": - self.options['list-tests'] = 1 - elif args[i] == "-V" or args[i] == "--verbose": - self.options['verbose'] = 1 - elif args[i] == "-n" or args[i] == "--no-cpg": - self.options['no-cpg'] = 1 - elif args[i] == "-c" or args[i] == "--cpg-only": - self.options['cpg-only'] = 1 - elif args[i] == "-r" or args[i] == "--run-only": - self.options['run-only'] = args[i+1] - skip = 1 - elif args[i] == "-p" or args[i] == "--run-only-pattern": - self.options['run-only-pattern'] = args[i+1] - skip = 1 - - def show_usage(self): - print "usage: " + sys.argv[0] + " [options]" - print "If no options are provided, all tests will run" - print "Options:" - print "\t [--help | -h] Show usage" - print "\t [--list-tests | -l] Print out all registered tests." - print "\t [--cpg-only | -c] Only run tests that require corosync." - print "\t [--no-cpg | -n] Only run tests that do not require corosync" - print "\t [--run-only | -r 'testname'] Run a specific test" - print "\t [--verbose | -V] Verbose output" - print "\t [--run-only-pattern | -p 'string'] Run only tests containing the string value" - print "\n\tExample: Run only the test 'start_top'" - print "\t\t python ./regression.py --run-only start_stop" - print "\n\tExample: Run only the tests with the string 'systemd' present in them" - print "\t\t python ./regression.py --run-only-pattern systemd" + def __init__(self): + self.options = {} + self.options['list-tests'] = 0 + self.options['run-all'] = 1 + self.options['run-only'] = "" + self.options['run-only-pattern'] = "" + self.options['verbose'] = 0 + self.options['invalid-arg'] = "" + self.options['cpg-only'] = 0 + self.options['no-cpg'] = 0 + self.options['show-usage'] = 0 + + def build_options(self, argv): + args = argv[1:] + skip = 0 + for i in range(0, len(args)): + if skip: + skip = 0 + continue + elif args[i] == "-h" or args[i] == "--help": + self.options['show-usage'] = 1 + elif args[i] == "-l" or args[i] == "--list-tests": + self.options['list-tests'] = 1 + elif args[i] == "-V" or args[i] == "--verbose": + self.options['verbose'] = 1 + elif args[i] == "-n" or args[i] == "--no-cpg": + self.options['no-cpg'] = 1 + elif args[i] == "-c" or args[i] == "--cpg-only": + self.options['cpg-only'] = 1 + elif args[i] == "-r" or args[i] == "--run-only": + self.options['run-only'] = args[i+1] + skip = 1 + elif args[i] == "-p" or args[i] == "--run-only-pattern": + self.options['run-only-pattern'] = args[i+1] + skip = 1 + + def show_usage(self): + print "usage: " + sys.argv[0] + " [options]" + print "If no options are provided, all tests will run" + print "Options:" + print "\t [--help | -h] Show usage" + print "\t [--list-tests | -l] Print out all registered tests." + print "\t [--cpg-only | -c] Only run tests that require corosync." + print "\t [--no-cpg | -n] Only run tests that do not require corosync" + print "\t [--run-only | -r 'testname'] Run a specific test" + print "\t [--verbose | -V] Verbose output" + print "\t [--run-only-pattern | -p 'string'] Run only tests containing the string value" + print "\n\tExample: Run only the test 'start_top'" + print "\t\t python ./regression.py --run-only start_stop" + print "\n\tExample: Run only the tests with the string 'systemd' present in them" + print "\t\t python ./regression.py --run-only-pattern systemd" def main(argv): - o = TestOptions() - o.build_options(argv) - - use_corosync = 1 - - tests = Tests(o.options['verbose']) - tests.build_standalone_tests() - tests.build_custom_timeout_tests() - tests.build_api_sanity_tests() - tests.build_fence_merge_tests() - tests.build_unfence_tests() - tests.build_nodeid_tests() - - if o.options['list-tests']: - tests.print_list() - sys.exit(0) - elif o.options['show-usage']: - o.show_usage() - sys.exit(0) - - print "Starting ..." - - if o.options['no-cpg']: - use_corosync = 0 - - tests.setup_environment(use_corosync) - - if o.options['run-only-pattern'] != "": - tests.run_tests_matching(o.options['run-only-pattern']) - tests.print_results() - elif o.options['run-only'] != "": - tests.run_single(o.options['run-only']) - tests.print_results() - elif o.options['no-cpg']: - tests.run_no_cpg() - tests.print_results() - elif o.options['cpg-only']: - tests.run_cpg_only() - tests.print_results() - else: - tests.run_tests() - tests.print_results() - - tests.cleanup_environment(use_corosync) - tests.exit() + o = TestOptions() + o.build_options(argv) + + use_corosync = 1 + + tests = Tests(o.options['verbose']) + tests.build_standalone_tests() + tests.build_custom_timeout_tests() + tests.build_api_sanity_tests() + tests.build_fence_merge_tests() + tests.build_unfence_tests() + tests.build_nodeid_tests() + tests.build_remap_tests() + + if o.options['list-tests']: + tests.print_list() + sys.exit(0) + elif o.options['show-usage']: + o.show_usage() + sys.exit(0) + + print "Starting ..." + + if o.options['no-cpg']: + use_corosync = 0 + + tests.setup_environment(use_corosync) + + if o.options['run-only-pattern'] != "": + tests.run_tests_matching(o.options['run-only-pattern']) + tests.print_results() + elif o.options['run-only'] != "": + tests.run_single(o.options['run-only']) + tests.print_results() + elif o.options['no-cpg']: + tests.run_no_cpg() + tests.print_results() + elif o.options['cpg-only']: + tests.run_cpg_only() + tests.print_results() + else: + tests.run_tests() + tests.print_results() + + tests.cleanup_environment(use_corosync) + tests.exit() if __name__=="__main__": - main(sys.argv) + main(sys.argv) diff --git a/fencing/remote.c b/fencing/remote.c index eaac1fa39e..2c00b5fa4a 100644 --- a/fencing/remote.c +++ b/fencing/remote.c @@ -1,1900 +1,2065 @@ /* * Copyright (C) 2009 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define TIMEOUT_MULTIPLY_FACTOR 1.2 /* When one stonithd queries its peers for devices able to handle a fencing * request, each peer will reply with a list of such devices available to it. * Each reply will be parsed into a st_query_result_t, with each device's * information kept in a device_properties_t. */ typedef struct device_properties_s { /* Whether access to this device has been verified */ gboolean verified; /* The remaining members are indexed by the operation's "phase" */ /* Whether this device has been executed in each phase */ gboolean executed[3]; /* Whether this device is disallowed from executing in each phase */ gboolean disallowed[3]; /* Action-specific timeout for each phase */ int custom_action_timeout[3]; /* Action-specific maximum random delay for each phase */ int delay_max[3]; } device_properties_t; typedef struct st_query_result_s { /* Name of peer that sent this result */ char *host; /* Only try peers for non-topology based operations once */ gboolean tried; /* Number of entries in the devices table */ int ndevices; /* Devices available to this host that are capable of fencing the target */ GHashTable *devices; } st_query_result_t; GHashTable *remote_op_list = NULL; void call_remote_stonith(remote_fencing_op_t * op, st_query_result_t * peer); static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int rc, int dup); extern xmlNode *stonith_create_op(int call_id, const char *token, const char *op, xmlNode * data, int call_options); static void report_timeout_period(remote_fencing_op_t * op, int op_timeout); static int get_op_total_timeout(const remote_fencing_op_t *op, const st_query_result_t *chosen_peer); static gint sort_strings(gconstpointer a, gconstpointer b) { return strcmp(a, b); } static void free_remote_query(gpointer data) { if (data) { st_query_result_t *query = data; crm_trace("Free'ing query result from %s", query->host); g_hash_table_destroy(query->devices); free(query->host); free(query); } } struct peer_count_data { const remote_fencing_op_t *op; gboolean verified_only; int count; }; /*! * \internal * \brief Increment a counter if a device has not been executed yet * * \param[in] key Device ID (ignored) * \param[in] value Device properties * \param[in] user_data Peer count data */ static void count_peer_device(gpointer key, gpointer value, gpointer user_data) { device_properties_t *props = (device_properties_t*)value; struct peer_count_data *data = user_data; if (!props->executed[data->op->phase] && (!data->verified_only || props->verified)) { ++(data->count); } } /*! * \internal * \brief Check the number of available devices in a peer's query results * * \param[in] op Operation that results are for * \param[in] peer Peer to count * \param[in] verified_only Whether to count only verified devices * * \return Number of devices available to peer that were not already executed */ static int count_peer_devices(const remote_fencing_op_t *op, const st_query_result_t *peer, gboolean verified_only) { struct peer_count_data data; data.op = op; data.verified_only = verified_only; data.count = 0; if (peer) { g_hash_table_foreach(peer->devices, count_peer_device, &data); } return data.count; } /*! * \internal * \brief Search for a device in a query result * * \param[in] op Operation that result is for * \param[in] peer Query result for a peer * \param[in] device Device ID to search for * * \return Device properties if found, NULL otherwise */ static device_properties_t * find_peer_device(const remote_fencing_op_t *op, const st_query_result_t *peer, const char *device) { device_properties_t *props = g_hash_table_lookup(peer->devices, device); return (props && !props->executed[op->phase] && !props->disallowed[op->phase])? props : NULL; } /*! * \internal * \brief Find a device in a peer's device list and mark it as executed * * \param[in] op Operation that peer result is for * \param[in,out] peer Peer with results to search * \param[in] device ID of device to mark as done * \param[in] verified_devices_only Only consider verified devices * * \return TRUE if device was found and marked, FALSE otherwise */ static gboolean grab_peer_device(const remote_fencing_op_t *op, st_query_result_t *peer, const char *device, gboolean verified_devices_only) { device_properties_t *props = find_peer_device(op, peer, device); if ((props == NULL) || (verified_devices_only && !props->verified)) { return FALSE; } crm_trace("Removing %s from %s (%d remaining)", device, peer->host, count_peer_devices(op, peer, FALSE)); props->executed[op->phase] = TRUE; return TRUE; } /* * \internal * \brief Free the list of required devices for a particular phase * * \param[in,out] op Operation to modify * \param[in] phase Phase to modify */ static void free_required_list(remote_fencing_op_t *op, enum st_remap_phase phase) { if (op->required_list[phase]) { g_list_free_full(op->required_list[phase], free); op->required_list[phase] = NULL; } } static void clear_remote_op_timers(remote_fencing_op_t * op) { if (op->query_timer) { g_source_remove(op->query_timer); op->query_timer = 0; } if (op->op_timer_total) { g_source_remove(op->op_timer_total); op->op_timer_total = 0; } if (op->op_timer_one) { g_source_remove(op->op_timer_one); op->op_timer_one = 0; } } static void free_remote_op(gpointer data) { remote_fencing_op_t *op = data; crm_trace("Free'ing op %s for %s", op->id, op->target); crm_log_xml_debug(op->request, "Destroying"); clear_remote_op_timers(op); free(op->id); free(op->action); free(op->target); free(op->client_id); free(op->client_name); free(op->originator); if (op->query_results) { g_list_free_full(op->query_results, free_remote_query); } if (op->request) { free_xml(op->request); op->request = NULL; } if (op->devices_list) { g_list_free_full(op->devices_list, free); op->devices_list = NULL; } free_required_list(op, st_phase_requested); free_required_list(op, st_phase_off); free_required_list(op, st_phase_on); free(op); } +/* + * \internal + * \brief Return an operation's originally requested action (before any remap) + * + * \param[in] op Operation to check + * + * \return Operation's original action + */ +static const char * +op_requested_action(const remote_fencing_op_t *op) +{ + return ((op->phase > st_phase_requested)? "reboot" : op->action); +} + +/* + * \internal + * \brief Remap a "reboot" operation to the "off" phase + * + * \param[in,out] op Operation to remap + */ +static void +op_phase_off(remote_fencing_op_t *op) +{ + crm_info("Remapping multiple-device reboot of %s (%s) to off", + op->target, op->id); + op->phase = st_phase_off; + + /* Happily, "off" and "on" are shorter than "reboot", so we can reuse the + * memory allocation at each phase. + */ + strcpy(op->action, "off"); +} + +/*! + * \internal + * \brief Advance a remapped reboot operation to the "on" phase + * + * \param[in,out] op Operation to remap + */ +static void +op_phase_on(remote_fencing_op_t *op) +{ + GListPtr iter = NULL; + + crm_info("Remapped off of %s complete, remapping to on for %s.%.8s", + op->target, op->client_name, op->id); + op->phase = st_phase_on; + strcpy(op->action, "on"); + + /* Any devices that are required for "on" will be automatically executed by + * the cluster when the node next joins, so we skip them here. + */ + for (iter = op->required_list[op->phase]; iter != NULL; iter = iter->next) { + GListPtr match = g_list_find_custom(op->devices_list, iter->data, + sort_strings); + + if (match) { + op->devices_list = g_list_remove(op->devices_list, match->data); + } + } + + /* We know this level will succeed, because phase 1 completed successfully + * and we ignore any errors from phase 2. So we can free the required list, + * which will keep them from being executed after the device list is done. + */ + free_required_list(op, op->phase); + + /* Rewind device list pointer */ + op->devices = op->devices_list; +} + +/*! + * \internal + * \brief Reset a remapped reboot operation + * + * \param[in,out] op Operation to reset + */ +static void +undo_op_remap(remote_fencing_op_t *op) +{ + if (op->phase > 0) { + crm_info("Undoing remap of reboot of %s for %s.%.8s", + op->target, op->client_name, op->id); + op->phase = st_phase_requested; + strcpy(op->action, "reboot"); + } +} + static xmlNode * create_op_done_notify(remote_fencing_op_t * op, int rc) { xmlNode *notify_data = create_xml_node(NULL, T_STONITH_NOTIFY_FENCE); crm_xml_add_int(notify_data, "state", op->state); crm_xml_add_int(notify_data, F_STONITH_RC, rc); crm_xml_add(notify_data, F_STONITH_TARGET, op->target); crm_xml_add(notify_data, F_STONITH_ACTION, op->action); crm_xml_add(notify_data, F_STONITH_DELEGATE, op->delegate); crm_xml_add(notify_data, F_STONITH_REMOTE_OP_ID, op->id); crm_xml_add(notify_data, F_STONITH_ORIGIN, op->originator); crm_xml_add(notify_data, F_STONITH_CLIENTID, op->client_id); crm_xml_add(notify_data, F_STONITH_CLIENTNAME, op->client_name); return notify_data; } static void bcast_result_to_peers(remote_fencing_op_t * op, int rc) { static int count = 0; xmlNode *bcast = create_xml_node(NULL, T_STONITH_REPLY); xmlNode *notify_data = create_op_done_notify(op, rc); count++; crm_trace("Broadcasting result to peers"); crm_xml_add(bcast, F_TYPE, T_STONITH_NOTIFY); crm_xml_add(bcast, F_SUBTYPE, "broadcast"); crm_xml_add(bcast, F_STONITH_OPERATION, T_STONITH_NOTIFY); crm_xml_add_int(bcast, "count", count); add_message_xml(bcast, F_STONITH_CALLDATA, notify_data); send_cluster_message(NULL, crm_msg_stonith_ng, bcast, FALSE); free_xml(notify_data); free_xml(bcast); return; } static void handle_local_reply_and_notify(remote_fencing_op_t * op, xmlNode * data, int rc) { xmlNode *notify_data = NULL; xmlNode *reply = NULL; if (op->notify_sent == TRUE) { /* nothing to do */ return; } /* Do notification with a clean data object */ notify_data = create_op_done_notify(op, rc); crm_xml_add_int(data, "state", op->state); crm_xml_add(data, F_STONITH_TARGET, op->target); crm_xml_add(data, F_STONITH_OPERATION, op->action); reply = stonith_construct_reply(op->request, NULL, data, rc); crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate); /* Send fencing OP reply to local client that initiated fencing */ do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE); /* bcast to all local clients that the fencing operation happend */ do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data); /* mark this op as having notify's already sent */ op->notify_sent = TRUE; free_xml(reply); free_xml(notify_data); } static void handle_duplicates(remote_fencing_op_t * op, xmlNode * data, int rc) { GListPtr iter = NULL; for (iter = op->duplicates; iter != NULL; iter = iter->next) { remote_fencing_op_t *other = iter->data; if (other->state == st_duplicate) { /* Ie. it hasn't timed out already */ other->state = op->state; crm_debug("Peforming duplicate notification for %s@%s.%.8s = %s", other->client_name, other->originator, other->id, pcmk_strerror(rc)); remote_op_done(other, data, rc, TRUE); } else { crm_err("Skipping duplicate notification for %s@%s - %d", other->client_name, other->originator, other->state); } } } /*! * \internal * \brief Finalize a remote operation. * * \description This function has two code paths. * * Path 1. This node is the owner of the operation and needs * to notify the cpg group via a broadcast as to the operation's * results. * * Path 2. The cpg broadcast is received. All nodes notify their local * stonith clients the operation results. * * So, The owner of the operation first notifies the cluster of the result, * and once that cpg notify is received back it notifies all the local clients. * * Nodes that are passive watchers of the operation will receive the * broadcast and only need to notify their local clients the operation finished. * * \param op, The fencing operation to finalize * \param data, The xml msg reply (if present) of the last delegated fencing * operation. * \param dup, Is this operation a duplicate, if so treat it a little differently * making sure the broadcast is not sent out. */ static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int rc, int dup) { int level = LOG_ERR; const char *subt = NULL; xmlNode *local_data = NULL; op->completed = time(NULL); clear_remote_op_timers(op); + undo_op_remap(op); if (op->notify_sent == TRUE) { crm_err("Already sent notifications for '%s of %s by %s' (for=%s@%s.%.8s, state=%d): %s", op->action, op->target, op->delegate ? op->delegate : "", op->client_name, op->originator, op->id, op->state, pcmk_strerror(rc)); goto remote_op_done_cleanup; } if (!op->delegate && data && rc != -ENODEV && rc != -EHOSTUNREACH) { xmlNode *ndata = get_xpath_object("//@" F_STONITH_DELEGATE, data, LOG_TRACE); if(ndata) { op->delegate = crm_element_value_copy(ndata, F_STONITH_DELEGATE); } else { op->delegate = crm_element_value_copy(data, F_ORIG); } } if (data == NULL) { data = create_xml_node(NULL, "remote-op"); local_data = data; } /* Tell everyone the operation is done, we will continue * with doing the local notifications once we receive * the broadcast back. */ subt = crm_element_value(data, F_SUBTYPE); if (dup == FALSE && safe_str_neq(subt, "broadcast")) { /* Defer notification until the bcast message arrives */ bcast_result_to_peers(op, rc); goto remote_op_done_cleanup; } if (rc == pcmk_ok || dup) { level = LOG_NOTICE; } else if (safe_str_neq(op->originator, stonith_our_uname)) { level = LOG_NOTICE; } do_crm_log(level, "Operation %s of %s by %s for %s@%s.%.8s: %s", op->action, op->target, op->delegate ? op->delegate : "", op->client_name, op->originator, op->id, pcmk_strerror(rc)); handle_local_reply_and_notify(op, data, rc); if (dup == FALSE) { handle_duplicates(op, data, rc); } /* Free non-essential parts of the record * Keep the record around so we can query the history */ if (op->query_results) { g_list_free_full(op->query_results, free_remote_query); op->query_results = NULL; } if (op->request) { free_xml(op->request); op->request = NULL; } remote_op_done_cleanup: free_xml(local_data); } static gboolean remote_op_watchdog_done(gpointer userdata) { remote_fencing_op_t *op = userdata; op->op_timer_one = 0; crm_notice("Remote %s operation on %s for %s.%8s assumed complete", op->action, op->target, op->client_name, op->id); op->state = st_done; remote_op_done(op, NULL, pcmk_ok, FALSE); return FALSE; } static gboolean remote_op_timeout_one(gpointer userdata) { remote_fencing_op_t *op = userdata; op->op_timer_one = 0; crm_notice("Remote %s operation on %s for %s.%8s timed out", op->action, op->target, op->client_name, op->id); call_remote_stonith(op, NULL); return FALSE; } static gboolean remote_op_timeout(gpointer userdata) { remote_fencing_op_t *op = userdata; op->op_timer_total = 0; if (op->state == st_done) { crm_debug("Action %s (%s) for %s (%s) already completed", op->action, op->id, op->target, op->client_name); return FALSE; } crm_debug("Action %s (%s) for %s (%s) timed out", op->action, op->id, op->target, op->client_name); + + if (op->phase == st_phase_on) { + /* A remapped reboot operation timed out in the "on" phase, but the + * "off" phase completed successfully, so quit trying any further + * devices, and return success. + */ + remote_op_done(op, NULL, pcmk_ok, FALSE); + return FALSE; + } + op->state = st_failed; remote_op_done(op, NULL, -ETIME, FALSE); return FALSE; } static gboolean remote_op_query_timeout(gpointer data) { remote_fencing_op_t *op = data; op->query_timer = 0; if (op->state == st_done) { crm_debug("Operation %s for %s already completed", op->id, op->target); } else if (op->state == st_exec) { crm_debug("Operation %s for %s already in progress", op->id, op->target); } else if (op->query_results) { crm_debug("Query %s for %s complete: %d", op->id, op->target, op->state); call_remote_stonith(op, NULL); } else { crm_debug("Query %s for %s timed out: %d", op->id, op->target, op->state); if (op->op_timer_total) { g_source_remove(op->op_timer_total); op->op_timer_total = 0; } remote_op_timeout(op); } return FALSE; } static gboolean topology_is_empty(stonith_topology_t *tp) { int i; if (tp == NULL) { return TRUE; } for (i = 0; i < ST_LEVEL_MAX; i++) { if (tp->levels[i] != NULL) { return FALSE; } } return TRUE; } /* * \internal * \brief Add a device to the required list for a particular phase * * \param[in,out] op Operation to modify * \param[in] phase Phase to modify * \param[in] device Device ID to add */ static void add_required_device(remote_fencing_op_t *op, enum st_remap_phase phase, const char *device) { GListPtr match = g_list_find_custom(op->required_list[phase], device, sort_strings); if (!match) { op->required_list[phase] = g_list_prepend(op->required_list[phase], strdup(device)); } } /* * \internal * \brief Remove a device from the required list for the current phase * * \param[in,out] op Operation to modify * \param[in] device Device ID to remove */ static void remove_required_device(remote_fencing_op_t *op, const char *device) { GListPtr match = g_list_find_custom(op->required_list[op->phase], device, sort_strings); if (match) { op->required_list[op->phase] = g_list_remove(op->required_list[op->phase], match->data); } } /* deep copy the device list */ static void set_op_device_list(remote_fencing_op_t * op, GListPtr devices) { GListPtr lpc = NULL; if (op->devices_list) { g_list_free_full(op->devices_list, free); op->devices_list = NULL; } for (lpc = devices; lpc != NULL; lpc = lpc->next) { op->devices_list = g_list_append(op->devices_list, strdup(lpc->data)); } op->devices = op->devices_list; } stonith_topology_t * find_topology_for_host(const char *host) { stonith_topology_t *tp = g_hash_table_lookup(topology, host); if(tp == NULL) { int status = 1; regex_t r_patt; GHashTableIter tIter; crm_trace("Testing %d topologies for a match", g_hash_table_size(topology)); g_hash_table_iter_init(&tIter, topology); while (g_hash_table_iter_next(&tIter, NULL, (gpointer *) & tp)) { if (regcomp(&r_patt, tp->node, REG_EXTENDED)) { crm_info("Bad regex '%s' for fencing level", tp->node); } else { status = regexec(&r_patt, host, 0, NULL, 0); regfree(&r_patt); } if (status == 0) { crm_notice("Matched %s with %s", host, tp->node); break; } crm_trace("No match for %s with %s", host, tp->node); tp = NULL; } } return tp; } /*! * \internal * \brief Set fencing operation's device list to target's next topology level * * \param[in,out] op Remote fencing operation to modify * * \return pcmk_ok if successful, target was not specified (i.e. queries) or * target has no topology, or -EINVAL if no more topology levels to try */ static int stonith_topology_next(remote_fencing_op_t * op) { stonith_topology_t *tp = NULL; if (op->target) { /* Queries don't have a target set */ tp = find_topology_for_host(op->target); } if (topology_is_empty(tp)) { return pcmk_ok; } set_bit(op->call_options, st_opt_topology); + /* This is a new level, so undo any remapping left over from previous */ + undo_op_remap(op); + do { op->level++; } while (op->level < ST_LEVEL_MAX && tp->levels[op->level] == NULL); if (op->level < ST_LEVEL_MAX) { crm_trace("Attempting fencing level %d for %s (%d devices) - %s@%s.%.8s", op->level, op->target, g_list_length(tp->levels[op->level]), op->client_name, op->originator, op->id); set_op_device_list(op, tp->levels[op->level]); + + if (g_list_next(op->devices_list) && safe_str_eq(op->action, "reboot")) { + /* A reboot has been requested for a topology level with multiple + * devices. Instead of rebooting the devices sequentially, we will + * turn them all off, then turn them all on again. (Think about + * switched power outlets for redundant power supplies.) + */ + op_phase_off(op); + } return pcmk_ok; } crm_notice("All fencing options to fence %s for %s@%s.%.8s failed", op->target, op->client_name, op->originator, op->id); return -EINVAL; } /*! * \brief Check to see if this operation is a duplicate of another in flight * operation. If so merge this operation into the inflight operation, and mark * it as a duplicate. */ static void merge_duplicates(remote_fencing_op_t * op) { GHashTableIter iter; remote_fencing_op_t *other = NULL; time_t now = time(NULL); g_hash_table_iter_init(&iter, remote_op_list); while (g_hash_table_iter_next(&iter, NULL, (void **)&other)) { crm_node_t *peer = NULL; + const char *other_action = op_requested_action(other); if (other->state > st_exec) { /* Must be in-progress */ continue; } else if (safe_str_neq(op->target, other->target)) { /* Must be for the same node */ continue; - } else if (safe_str_neq(op->action, other->action)) { - crm_trace("Must be for the same action: %s vs. ", op->action, other->action); + } else if (safe_str_neq(op->action, other_action)) { + crm_trace("Must be for the same action: %s vs. %s", + op->action, other_action); continue; } else if (safe_str_eq(op->client_name, other->client_name)) { crm_trace("Must be for different clients: %s", op->client_name); continue; } else if (safe_str_eq(other->target, other->originator)) { crm_trace("Can't be a suicide operation: %s", other->target); continue; } peer = crm_get_peer(0, other->originator); if(fencing_peer_active(peer) == FALSE) { crm_notice("Failing stonith action %s for node %s originating from %s@%s.%.8s: Originator is dead", other->action, other->target, other->client_name, other->originator, other->id); other->state = st_failed; continue; } else if(other->total_timeout > 0 && now > (other->total_timeout + other->created)) { crm_info("Stonith action %s for node %s originating from %s@%s.%.8s is too old: %d vs. %d + %d", other->action, other->target, other->client_name, other->originator, other->id, now, other->created, other->total_timeout); continue; } /* There is another in-flight request to fence the same host * Piggyback on that instead. If it fails, so do we. */ other->duplicates = g_list_append(other->duplicates, op); if (other->total_timeout == 0) { crm_trace("Making a best-guess as to the timeout used"); other->total_timeout = op->total_timeout = TIMEOUT_MULTIPLY_FACTOR * get_op_total_timeout(op, NULL); } crm_notice ("Merging stonith action %s for node %s originating from client %s.%.8s with identical request from %s@%s.%.8s (%ds)", op->action, op->target, op->client_name, op->id, other->client_name, other->originator, other->id, other->total_timeout); report_timeout_period(op, other->total_timeout); op->state = st_duplicate; } } static uint32_t fencing_active_peers(void) { uint32_t count = 0; crm_node_t *entry; GHashTableIter gIter; g_hash_table_iter_init(&gIter, crm_peer_cache); while (g_hash_table_iter_next(&gIter, NULL, (void **)&entry)) { if(fencing_peer_active(entry)) { count++; } } return count; } int stonith_manual_ack(xmlNode * msg, remote_fencing_op_t * op) { xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, msg, LOG_ERR); op->state = st_done; op->completed = time(NULL); op->delegate = strdup("a human"); crm_notice("Injecting manual confirmation that %s is safely off/down", crm_element_value(dev, F_STONITH_TARGET)); remote_op_done(op, msg, pcmk_ok, FALSE); /* Replies are sent via done_cb->stonith_send_async_reply()->do_local_reply() */ return -EINPROGRESS; } /*! * \internal * \brief Create a new remote stonith op * \param client, he local stonith client id that initaited the operation * \param request, The request from the client that started the operation * \param peer, Is this operation owned by another stonith peer? Operations * owned by other peers are stored on all the stonith nodes, but only the * owner executes the operation. All the nodes get the results to the operation * once the owner finishes executing it. */ void * create_remote_stonith_op(const char *client, xmlNode * request, gboolean peer) { remote_fencing_op_t *op = NULL; xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, request, LOG_TRACE); int call_options = 0; if (remote_op_list == NULL) { remote_op_list = g_hash_table_new_full(crm_str_hash, g_str_equal, NULL, free_remote_op); } /* If this operation is owned by another node, check to make * sure we haven't already created this operation. */ if (peer && dev) { const char *op_id = crm_element_value(dev, F_STONITH_REMOTE_OP_ID); CRM_CHECK(op_id != NULL, return NULL); op = g_hash_table_lookup(remote_op_list, op_id); if (op) { crm_debug("%s already exists", op_id); return op; } } op = calloc(1, sizeof(remote_fencing_op_t)); crm_element_value_int(request, F_STONITH_TIMEOUT, (int *)&(op->base_timeout)); if (peer && dev) { op->id = crm_element_value_copy(dev, F_STONITH_REMOTE_OP_ID); } else { op->id = crm_generate_uuid(); } g_hash_table_replace(remote_op_list, op->id, op); CRM_LOG_ASSERT(g_hash_table_lookup(remote_op_list, op->id) != NULL); crm_trace("Created %s", op->id); op->state = st_query; op->replies_expected = fencing_active_peers(); op->action = crm_element_value_copy(dev, F_STONITH_ACTION); op->originator = crm_element_value_copy(dev, F_STONITH_ORIGIN); op->delegate = crm_element_value_copy(dev, F_STONITH_DELEGATE); /* May not be set */ op->created = time(NULL); if (op->originator == NULL) { /* Local or relayed request */ op->originator = strdup(stonith_our_uname); } CRM_LOG_ASSERT(client != NULL); if (client) { op->client_id = strdup(client); } op->client_name = crm_element_value_copy(request, F_STONITH_CLIENTNAME); op->target = crm_element_value_copy(dev, F_STONITH_TARGET); op->request = copy_xml(request); /* TODO: Figure out how to avoid this */ crm_element_value_int(request, F_STONITH_CALLOPTS, &call_options); op->call_options = call_options; crm_element_value_int(request, F_STONITH_CALLID, (int *)&(op->client_callid)); crm_trace("%s new stonith op: %s - %s of %s for %s", (peer && dev) ? "Recorded" : "Generated", op->id, op->action, op->target, op->client_name); if (op->call_options & st_opt_cs_nodeid) { int nodeid = crm_atoi(op->target, NULL); crm_node_t *node = crm_get_peer(nodeid, NULL); /* Ensure the conversion only happens once */ op->call_options &= ~st_opt_cs_nodeid; if (node && node->uname) { free(op->target); op->target = strdup(node->uname); } else { crm_warn("Could not expand nodeid '%s' into a host name (%p)", op->target, node); } } /* check to see if this is a duplicate operation of another in-flight operation */ merge_duplicates(op); return op; } remote_fencing_op_t * initiate_remote_stonith_op(crm_client_t * client, xmlNode * request, gboolean manual_ack) { int query_timeout = 0; xmlNode *query = NULL; const char *client_id = NULL; remote_fencing_op_t *op = NULL; if (client) { client_id = client->id; } else { client_id = crm_element_value(request, F_STONITH_CLIENTID); } CRM_LOG_ASSERT(client_id != NULL); op = create_remote_stonith_op(client_id, request, FALSE); op->owner = TRUE; if (manual_ack) { crm_notice("Initiating manual confirmation for %s: %s", op->target, op->id); return op; } CRM_CHECK(op->action, return NULL); if (stonith_topology_next(op) != pcmk_ok) { op->state = st_failed; } switch (op->state) { case st_failed: crm_warn("Initiation of remote operation %s for %s: failed (%s)", op->action, op->target, op->id); remote_op_done(op, NULL, -EINVAL, FALSE); return op; case st_duplicate: crm_info("Initiating remote operation %s for %s: %s (duplicate)", op->action, op->target, op->id); return op; default: crm_notice("Initiating remote operation %s for %s: %s (%d)", op->action, op->target, op->id, op->state); } query = stonith_create_op(op->client_callid, op->id, STONITH_OP_QUERY, NULL, op->call_options); crm_xml_add(query, F_STONITH_REMOTE_OP_ID, op->id); crm_xml_add(query, F_STONITH_TARGET, op->target); - crm_xml_add(query, F_STONITH_ACTION, op->action); + crm_xml_add(query, F_STONITH_ACTION, op_requested_action(op)); crm_xml_add(query, F_STONITH_ORIGIN, op->originator); crm_xml_add(query, F_STONITH_CLIENTID, op->client_id); crm_xml_add(query, F_STONITH_CLIENTNAME, op->client_name); crm_xml_add_int(query, F_STONITH_TIMEOUT, op->base_timeout); send_cluster_message(NULL, crm_msg_stonith_ng, query, FALSE); free_xml(query); query_timeout = op->base_timeout * TIMEOUT_MULTIPLY_FACTOR; op->query_timer = g_timeout_add((1000 * query_timeout), remote_op_query_timeout, op); return op; } enum find_best_peer_options { /*! Skip checking the target peer for capable fencing devices */ FIND_PEER_SKIP_TARGET = 0x0001, /*! Only check the target peer for capable fencing devices */ FIND_PEER_TARGET_ONLY = 0x0002, /*! Skip peers and devices that are not verified */ FIND_PEER_VERIFIED_ONLY = 0x0004, }; static st_query_result_t * find_best_peer(const char *device, remote_fencing_op_t * op, enum find_best_peer_options options) { GListPtr iter = NULL; gboolean verified_devices_only = (options & FIND_PEER_VERIFIED_ONLY) ? TRUE : FALSE; if (!device && is_set(op->call_options, st_opt_topology)) { return NULL; } for (iter = op->query_results; iter != NULL; iter = iter->next) { st_query_result_t *peer = iter->data; crm_trace("Testing result from %s for %s with %d devices: %d %x", peer->host, op->target, peer->ndevices, peer->tried, options); if ((options & FIND_PEER_SKIP_TARGET) && safe_str_eq(peer->host, op->target)) { continue; } if ((options & FIND_PEER_TARGET_ONLY) && safe_str_neq(peer->host, op->target)) { continue; } if (is_set(op->call_options, st_opt_topology)) { if (grab_peer_device(op, peer, device, verified_devices_only)) { return peer; } } else if ((peer->tried == FALSE) && count_peer_devices(op, peer, verified_devices_only)) { /* No topology: Use the current best peer */ crm_trace("Simple fencing"); return peer; } } return NULL; } static st_query_result_t * stonith_choose_peer(remote_fencing_op_t * op) { const char *device = NULL; st_query_result_t *peer = NULL; uint32_t active = fencing_active_peers(); do { if (op->devices) { device = op->devices->data; crm_trace("Checking for someone to fence (%s) %s with %s", op->action, op->target, device); } else { crm_trace("Checking for someone to fence (%s) %s", op->action, op->target); } /* Best choice is a peer other than the target with verified access */ peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET|FIND_PEER_VERIFIED_ONLY); if (peer) { crm_trace("Found verified peer %s for %s", peer->host, device?device:""); return peer; } if(op->query_timer != 0 && op->replies < QB_MIN(op->replies_expected, active)) { crm_trace("Waiting before looking for unverified devices to fence %s", op->target); return NULL; } /* If no other peer has verified access, next best is unverified access */ peer = find_best_peer(device, op, FIND_PEER_SKIP_TARGET); if (peer) { crm_trace("Found best unverified peer %s", peer->host); return peer; } /* If no other peer can do it, last option is self-fencing * (which is never allowed for the "on" phase of a remapped reboot) */ if (op->phase != st_phase_on) { peer = find_best_peer(device, op, FIND_PEER_TARGET_ONLY); if (peer) { crm_trace("%s will fence itself", peer->host); return peer; } } /* Try the next fencing level if there is one (unless we're in the "on" * phase of a remapped "reboot", because we ignore errors in that case) */ } while ((op->phase != st_phase_on) && is_set(op->call_options, st_opt_topology) && stonith_topology_next(op) == pcmk_ok); crm_notice("Couldn't find anyone to fence (%s) %s with %s", op->action, op->target, (device? device : "any device")); return NULL; } static int get_device_timeout(const remote_fencing_op_t *op, const st_query_result_t *peer, const char *device) { device_properties_t *props; if (!peer || !device) { return op->base_timeout; } props = g_hash_table_lookup(peer->devices, device); if (!props) { return op->base_timeout; } return (props->custom_action_timeout[op->phase]? props->custom_action_timeout[op->phase] : op->base_timeout) + props->delay_max[op->phase]; } struct timeout_data { const remote_fencing_op_t *op; const st_query_result_t *peer; int total_timeout; }; /*! * \internal * \brief Add timeout to a total if device has not been executed yet * * \param[in] key GHashTable key (device ID) * \param[in] value GHashTable value (device properties) * \param[in] user_data Timeout data */ static void add_device_timeout(gpointer key, gpointer value, gpointer user_data) { const char *device_id = key; device_properties_t *props = value; struct timeout_data *timeout = user_data; if (!props->executed[timeout->op->phase] && !props->disallowed[timeout->op->phase]) { timeout->total_timeout += get_device_timeout(timeout->op, timeout->peer, device_id); } } static int get_peer_timeout(const remote_fencing_op_t *op, const st_query_result_t *peer) { struct timeout_data timeout; timeout.op = op; timeout.peer = peer; timeout.total_timeout = 0; g_hash_table_foreach(peer->devices, add_device_timeout, &timeout); return (timeout.total_timeout? timeout.total_timeout : op->base_timeout); } static int get_op_total_timeout(const remote_fencing_op_t *op, const st_query_result_t *chosen_peer) { int total_timeout = 0; stonith_topology_t *tp = find_topology_for_host(op->target); if (is_set(op->call_options, st_opt_topology) && tp) { int i; GListPtr device_list = NULL; GListPtr iter = NULL; /* Yep, this looks scary, nested loops all over the place. * Here is what is going on. * Loop1: Iterate through fencing levels. * Loop2: If a fencing level has devices, loop through each device * Loop3: For each device in a fencing level, see what peer owns it * and what that peer has reported the timeout is for the device. */ for (i = 0; i < ST_LEVEL_MAX; i++) { if (!tp->levels[i]) { continue; } for (device_list = tp->levels[i]; device_list; device_list = device_list->next) { for (iter = op->query_results; iter != NULL; iter = iter->next) { const st_query_result_t *peer = iter->data; if (find_peer_device(op, peer, device_list->data)) { total_timeout += get_device_timeout(op, peer, device_list->data); break; } } /* End Loop3: match device with peer that owns device, find device's timeout period */ } /* End Loop2: iterate through devices at a specific level */ } /*End Loop1: iterate through fencing levels */ } else if (chosen_peer) { total_timeout = get_peer_timeout(op, chosen_peer); } else { total_timeout = op->base_timeout; } return total_timeout ? total_timeout : op->base_timeout; } static void report_timeout_period(remote_fencing_op_t * op, int op_timeout) { GListPtr iter = NULL; xmlNode *update = NULL; const char *client_node = NULL; const char *client_id = NULL; const char *call_id = NULL; if (op->call_options & st_opt_sync_call) { /* There is no reason to report the timeout for a syncronous call. It * is impossible to use the reported timeout to do anything when the client * is blocking for the response. This update is only important for * async calls that require a callback to report the results in. */ return; } else if (!op->request) { return; } crm_trace("Reporting timeout for %s.%.8s", op->client_name, op->id); client_node = crm_element_value(op->request, F_STONITH_CLIENTNODE); call_id = crm_element_value(op->request, F_STONITH_CALLID); client_id = crm_element_value(op->request, F_STONITH_CLIENTID); if (!client_node || !call_id || !client_id) { return; } if (safe_str_eq(client_node, stonith_our_uname)) { /* The client is connected to this node, send the update direclty to them */ do_stonith_async_timeout_update(client_id, call_id, op_timeout); return; } /* The client is connected to another node, relay this update to them */ update = stonith_create_op(op->client_callid, op->id, STONITH_OP_TIMEOUT_UPDATE, NULL, 0); crm_xml_add(update, F_STONITH_REMOTE_OP_ID, op->id); crm_xml_add(update, F_STONITH_CLIENTID, client_id); crm_xml_add(update, F_STONITH_CALLID, call_id); crm_xml_add_int(update, F_STONITH_TIMEOUT, op_timeout); send_cluster_message(crm_get_peer(0, client_node), crm_msg_stonith_ng, update, FALSE); free_xml(update); for (iter = op->duplicates; iter != NULL; iter = iter->next) { remote_fencing_op_t *dup = iter->data; crm_trace("Reporting timeout for duplicate %s.%.8s", dup->client_name, dup->id); report_timeout_period(iter->data, op_timeout); } } /* * \internal * \brief Advance an operation to the next device in its topology * * \param[in,out] op Operation to advance * \param[in] device ID of device just completed * \param[in] msg XML reply that contained device result (if available) * \param[in] rc Return code of device's execution */ static void advance_op_topology(remote_fencing_op_t *op, const char *device, xmlNode *msg, int rc) { /* Advance to the next device at this topology level, if any */ if (op->devices) { op->devices = op->devices->next; } /* If this device was required, it's not anymore */ remove_required_device(op, device); /* If there are no more devices at this topology level, * run through any required devices not already executed */ if (op->devices == NULL) { op->devices = op->required_list[op->phase]; } + if ((op->devices == NULL) && (op->phase == st_phase_off)) { + /* We're done with this level and with required devices, but we had + * remapped "reboot" to "off", so start over with "on". If any devices + * need to be turned back on, op->devices will be non-NULL after this. + */ + op_phase_on(op); + } + if (op->devices) { /* Necessary devices remain, so execute the next one */ crm_trace("Next for %s on behalf of %s@%s (rc was %d)", op->target, op->originator, op->client_name, rc); call_remote_stonith(op, NULL); } else { - /* We're done with all devices, so finalize operation */ + /* We're done with all devices and phases, so finalize operation */ crm_trace("Marking complex fencing op for %s as complete", op->target); op->state = st_done; remote_op_done(op, msg, rc, FALSE); } } void call_remote_stonith(remote_fencing_op_t * op, st_query_result_t * peer) { const char *device = NULL; int timeout = op->base_timeout; crm_trace("State for %s.%.8s: %s %d", op->target, op->client_name, op->id, op->state); if (peer == NULL && !is_set(op->call_options, st_opt_topology)) { peer = stonith_choose_peer(op); } if (!op->op_timer_total) { int total_timeout = get_op_total_timeout(op, peer); op->total_timeout = TIMEOUT_MULTIPLY_FACTOR * total_timeout; op->op_timer_total = g_timeout_add(1000 * op->total_timeout, remote_op_timeout, op); report_timeout_period(op, op->total_timeout); crm_info("Total remote op timeout set to %d for fencing of node %s for %s.%.8s", total_timeout, op->target, op->client_name, op->id); } if (is_set(op->call_options, st_opt_topology) && op->devices) { /* Ignore any peer preference, they might not have the device we need */ /* When using topology, stonith_choose_peer() removes the device from * further consideration, so be sure to calculate timeout beforehand */ peer = stonith_choose_peer(op); device = op->devices->data; timeout = get_device_timeout(op, peer, device); } if (peer) { int timeout_one = 0; xmlNode *remote_op = stonith_create_op(op->client_callid, op->id, STONITH_OP_FENCE, NULL, 0); crm_xml_add(remote_op, F_STONITH_REMOTE_OP_ID, op->id); crm_xml_add(remote_op, F_STONITH_TARGET, op->target); crm_xml_add(remote_op, F_STONITH_ACTION, op->action); crm_xml_add(remote_op, F_STONITH_ORIGIN, op->originator); crm_xml_add(remote_op, F_STONITH_CLIENTID, op->client_id); crm_xml_add(remote_op, F_STONITH_CLIENTNAME, op->client_name); crm_xml_add_int(remote_op, F_STONITH_TIMEOUT, timeout); crm_xml_add_int(remote_op, F_STONITH_CALLOPTS, op->call_options); if (device) { timeout_one = TIMEOUT_MULTIPLY_FACTOR * get_device_timeout(op, peer, device); crm_info("Requesting that %s perform op %s %s with %s for %s (%ds)", peer->host, op->action, op->target, device, op->client_name, timeout_one); crm_xml_add(remote_op, F_STONITH_DEVICE, device); crm_xml_add(remote_op, F_STONITH_MODE, "slave"); } else { timeout_one = TIMEOUT_MULTIPLY_FACTOR * get_peer_timeout(op, peer); crm_info("Requesting that %s perform op %s %s for %s (%ds, %ds)", peer->host, op->action, op->target, op->client_name, timeout_one, stonith_watchdog_timeout_ms); crm_xml_add(remote_op, F_STONITH_MODE, "smart"); } op->state = st_exec; if (op->op_timer_one) { g_source_remove(op->op_timer_one); } if(stonith_watchdog_timeout_ms > 0 && device && safe_str_eq(device, "watchdog")) { crm_notice("Waiting %ds for %s to self-fence (%s) for %s.%.8s (%p)", stonith_watchdog_timeout_ms/1000, op->target, op->action, op->client_name, op->id, device); op->op_timer_one = g_timeout_add(stonith_watchdog_timeout_ms, remote_op_watchdog_done, op); /* TODO check devices to verify watchdog will be in use */ } else if(stonith_watchdog_timeout_ms > 0 && safe_str_eq(peer->host, op->target) && safe_str_neq(op->action, "on")) { crm_notice("Waiting %ds for %s to self-fence (%s) for %s.%.8s (%p)", stonith_watchdog_timeout_ms/1000, op->target, op->action, op->client_name, op->id, device); op->op_timer_one = g_timeout_add(stonith_watchdog_timeout_ms, remote_op_watchdog_done, op); } else { op->op_timer_one = g_timeout_add((1000 * timeout_one), remote_op_timeout_one, op); } send_cluster_message(crm_get_peer(0, peer->host), crm_msg_stonith_ng, remote_op, FALSE); peer->tried = TRUE; free_xml(remote_op); return; + } else if (op->phase == st_phase_on) { + /* A remapped "on" cannot be executed, but the node was already + * turned off successfully, so ignore the error and continue. + */ + crm_warn("Ignoring %s 'on' failure (no capable peers) for %s after successful 'off'", + device, op->target); + advance_op_topology(op, device, NULL, pcmk_ok); + return; + } else if (op->owner == FALSE) { crm_err("Fencing (%s) of %s for %s is not ours to control", op->action, op->target, op->client_name); } else if (op->query_timer == 0) { /* We've exhausted all available peers */ crm_info("No remaining peers capable of fencing (%s) %s for %s (%d)", op->target, op->action, op->client_name, op->state); CRM_LOG_ASSERT(op->state < st_done); remote_op_timeout(op); } else if(op->replies >= op->replies_expected || op->replies >= fencing_active_peers()) { int rc = -EHOSTUNREACH; /* if the operation never left the query state, * but we have all the expected replies, then no devices * are available to execute the fencing operation. */ if(stonith_watchdog_timeout_ms && (device == NULL || safe_str_eq(device, "watchdog"))) { crm_notice("Waiting %ds for %s to self-fence (%s) for %s.%.8s (%p)", stonith_watchdog_timeout_ms/1000, op->target, op->action, op->client_name, op->id, device); op->op_timer_one = g_timeout_add(stonith_watchdog_timeout_ms, remote_op_watchdog_done, op); return; } if (op->state == st_query) { crm_info("None of the %d peers have devices capable of fencing (%s) %s for %s (%d)", op->replies, op->action, op->target, op->client_name, op->state); rc = -ENODEV; } else { crm_info("None of the %d peers are capable of fencing (%s) %s for %s (%d)", op->replies, op->action, op->target, op->client_name, op->state); } op->state = st_failed; remote_op_done(op, NULL, rc, FALSE); } else if (device) { crm_info("Waiting for additional peers capable of fencing (%s) %s with %s for %s.%.8s", op->action, op->target, device, op->client_name, op->id); } else { crm_info("Waiting for additional peers capable of fencing (%s) %s for %s%.8s", op->action, op->target, op->client_name, op->id); } } /*! * \internal * \brief Comparison function for sorting query results * * \param[in] a GList item to compare * \param[in] b GList item to compare * * \return Per the glib documentation, "a negative integer if the first value * comes before the second, 0 if they are equal, or a positive integer * if the first value comes after the second." */ static gint sort_peers(gconstpointer a, gconstpointer b) { const st_query_result_t *peer_a = a; const st_query_result_t *peer_b = b; return (peer_b->ndevices - peer_a->ndevices); } /*! * \internal * \brief Determine if all the devices in the topology are found or not */ static gboolean all_topology_devices_found(remote_fencing_op_t * op) { GListPtr device = NULL; GListPtr iter = NULL; device_properties_t *match = NULL; stonith_topology_t *tp = NULL; gboolean skip_target = FALSE; int i; tp = find_topology_for_host(op->target); if (!tp) { return FALSE; } if (safe_str_eq(op->action, "off") || safe_str_eq(op->action, "reboot")) { /* Don't count the devices on the target node if we are killing * the target node. */ skip_target = TRUE; } for (i = 0; i < ST_LEVEL_MAX; i++) { for (device = tp->levels[i]; device; device = device->next) { match = NULL; for (iter = op->query_results; iter && !match; iter = iter->next) { st_query_result_t *peer = iter->data; if (skip_target && safe_str_eq(peer->host, op->target)) { continue; } match = find_peer_device(op, peer, device->data); } if (!match) { return FALSE; } } } return TRUE; } /* * \internal * \brief Parse action-specific device properties from XML * * \param[in] msg XML element containing the properties * \param[in] peer Name of peer that sent XML (for logs) * \param[in] device Device ID (for logs) - * \param[in] action Action the properties relate to + * \param[in] action Action the properties relate to (for logs) * \param[in] phase Phase the properties relate to * \param[in,out] props Device properties to update */ static void parse_action_specific(xmlNode *xml, const char *peer, const char *device, const char *action, remote_fencing_op_t *op, enum st_remap_phase phase, device_properties_t *props) { int required; props->custom_action_timeout[phase] = 0; crm_element_value_int(xml, F_STONITH_ACTION_TIMEOUT, &props->custom_action_timeout[phase]); if (props->custom_action_timeout[phase]) { crm_trace("Peer %s with device %s returned %s action timeout %d", peer, device, action, props->custom_action_timeout[phase]); } props->delay_max[phase] = 0; crm_element_value_int(xml, F_STONITH_DELAY_MAX, &props->delay_max[phase]); if (props->delay_max[phase]) { crm_trace("Peer %s with device %s returned maximum of random delay %d for %s", peer, device, props->delay_max[phase], action); } required = 0; crm_element_value_int(xml, F_STONITH_DEVICE_REQUIRED, &required); if (required) { /* If the action is marked as required, add the device to the * operation's list of required devices for this phase. We use this * for unfencing when executing a topology. In phase 0 (requested * action) or phase 1 (remapped "off"), required devices get executed * regardless of their topology level; in phase 2 (remapped "on"), * required devices are not attempted, because the cluster will * execute them automatically later. */ crm_trace("Peer %s requires device %s to execute for action %s", peer, device, action); add_required_device(op, phase, device); } + + /* If a reboot is remapped to off+on, it's possible that a node is allowed + * to perform one action but not another. + */ + if (crm_is_true(crm_element_value(xml, F_STONITH_ACTION_DISALLOWED))) { + props->disallowed[phase] = TRUE; + crm_trace("Peer %s is disallowed from executing %s for device %s", + peer, action, device); + } } /* * \internal * \brief Parse one device's properties from peer's XML query reply * * \param[in] xml XML node containing device properties * \param[in,out] op Operation that query and reply relate to * \param[in,out] result Peer's results * \param[in] device ID of device being parsed */ static void add_device_properties(xmlNode *xml, remote_fencing_op_t *op, st_query_result_t *result, const char *device) { + xmlNode *child; int verified = 0; device_properties_t *props = calloc(1, sizeof(device_properties_t)); /* Add a new entry to this result's devices list */ CRM_ASSERT(props != NULL); g_hash_table_insert(result->devices, strdup(device), props); /* Peers with verified (monitored) access will be preferred */ crm_element_value_int(xml, F_STONITH_DEVICE_VERIFIED, &verified); if (verified) { crm_trace("Peer %s has confirmed a verified device %s", result->host, device); props->verified = TRUE; } /* Parse action-specific device properties */ - parse_action_specific(xml, result->host, device, op->action, + parse_action_specific(xml, result->host, device, op_requested_action(op), op, st_phase_requested, props); + for (child = __xml_first_child(xml); child != NULL; child = __xml_next(child)) { + /* Replies for "reboot" operations will include the action-specific + * values for "off" and "on" in child elements, just in case the reboot + * winds up getting remapped. + */ + if (safe_str_eq(ID(child), "off")) { + parse_action_specific(child, result->host, device, "off", + op, st_phase_off, props); + } else if (safe_str_eq(ID(child), "on")) { + parse_action_specific(child, result->host, device, "on", + op, st_phase_on, props); + } + } } /* * \internal * \brief Parse a peer's XML query reply and add it to operation's results * * \param[in,out] op Operation that query and reply relate to * \param[in] host Name of peer that sent this reply * \param[in] ndevices Number of devices expected in reply * \param[in] xml XML node containing device list * * \return Newly allocated result structure with parsed reply */ static st_query_result_t * add_result(remote_fencing_op_t *op, const char *host, int ndevices, xmlNode *xml) { st_query_result_t *result = calloc(1, sizeof(st_query_result_t)); xmlNode *child; CRM_CHECK(result != NULL, return NULL); result->host = strdup(host); result->devices = g_hash_table_new_full(crm_str_hash, g_str_equal, free, free); /* Each child element describes one capable device available to the peer */ for (child = __xml_first_child(xml); child != NULL; child = __xml_next(child)) { const char *device = ID(child); if (device) { add_device_properties(child, op, result, device); } } result->ndevices = g_hash_table_size(result->devices); CRM_CHECK(ndevices == result->ndevices, crm_err("Query claimed to have %d devices but %d found", ndevices, result->ndevices)); op->query_results = g_list_insert_sorted(op->query_results, result, sort_peers); return result; } /* * \internal * \brief Handle a peer's reply to our fencing query * * Parse a query result from XML and store it in the remote operation * table, and when enough replies have been received, issue a fencing request. * * \param[in] msg XML reply received * * \return pcmk_ok on success, -errno on error * * \note See initiate_remote_stonith_op() for how the XML query was initially * formed, and stonith_query() for how the peer formed its XML reply. */ int process_remote_stonith_query(xmlNode * msg) { int ndevices = 0; gboolean host_is_target = FALSE; gboolean have_all_replies = FALSE; const char *id = NULL; const char *host = NULL; remote_fencing_op_t *op = NULL; st_query_result_t *result = NULL; uint32_t replies_expected; xmlNode *dev = get_xpath_object("//@" F_STONITH_REMOTE_OP_ID, msg, LOG_ERR); CRM_CHECK(dev != NULL, return -EPROTO); id = crm_element_value(dev, F_STONITH_REMOTE_OP_ID); CRM_CHECK(id != NULL, return -EPROTO); dev = get_xpath_object("//@" F_STONITH_AVAILABLE_DEVICES, msg, LOG_ERR); CRM_CHECK(dev != NULL, return -EPROTO); crm_element_value_int(dev, F_STONITH_AVAILABLE_DEVICES, &ndevices); op = g_hash_table_lookup(remote_op_list, id); if (op == NULL) { crm_debug("Unknown or expired remote op: %s", id); return -EOPNOTSUPP; } replies_expected = QB_MIN(op->replies_expected, fencing_active_peers()); if ((++op->replies >= replies_expected) && (op->state == st_query)) { have_all_replies = TRUE; } host = crm_element_value(msg, F_ORIG); host_is_target = safe_str_eq(host, op->target); crm_info("Query result %d of %d from %s for %s/%s (%d devices) %s", op->replies, replies_expected, host, op->target, op->action, ndevices, id); if (ndevices > 0) { result = add_result(op, host, ndevices, dev); } if (is_set(op->call_options, st_opt_topology)) { /* If we start the fencing before all the topology results are in, * it is possible fencing levels will be skipped because of the missing * query results. */ if (op->state == st_query && all_topology_devices_found(op)) { /* All the query results are in for the topology, start the fencing ops. */ crm_trace("All topology devices found"); call_remote_stonith(op, result); } else if (have_all_replies) { crm_info("All topology query replies have arrived, continuing (%d expected/%d received) ", replies_expected, op->replies); call_remote_stonith(op, NULL); } } else if (op->state == st_query) { int nverified = count_peer_devices(op, result, TRUE); /* We have a result for a non-topology fencing op that looks promising, * go ahead and start fencing before query timeout */ if (result && (host_is_target == FALSE) && nverified) { /* we have a verified device living on a peer that is not the target */ crm_trace("Found %d verified devices", nverified); call_remote_stonith(op, result); } else if (have_all_replies) { crm_info("All query replies have arrived, continuing (%d expected/%d received) ", replies_expected, op->replies); call_remote_stonith(op, NULL); } else { crm_trace("Waiting for more peer results before launching fencing operation"); } } else if (result && (op->state == st_done)) { crm_info("Discarding query result from %s (%d devices): Operation is in state %d", result->host, result->ndevices, op->state); } return pcmk_ok; } /* * \internal * \brief Handle a peer's reply to a fencing request * * Parse a fencing reply from XML, and either finalize the operation * or attempt another device as appropriate. * * \param[in] msg XML reply received * * \return pcmk_ok on success, -errno on error */ int process_remote_stonith_exec(xmlNode * msg) { int rc = 0; const char *id = NULL; const char *device = NULL; remote_fencing_op_t *op = NULL; xmlNode *dev = get_xpath_object("//@" F_STONITH_REMOTE_OP_ID, msg, LOG_ERR); CRM_CHECK(dev != NULL, return -EPROTO); id = crm_element_value(dev, F_STONITH_REMOTE_OP_ID); CRM_CHECK(id != NULL, return -EPROTO); dev = get_xpath_object("//@" F_STONITH_RC, msg, LOG_ERR); CRM_CHECK(dev != NULL, return -EPROTO); crm_element_value_int(dev, F_STONITH_RC, &rc); device = crm_element_value(dev, F_STONITH_DEVICE); if (remote_op_list) { op = g_hash_table_lookup(remote_op_list, id); } if (op == NULL && rc == pcmk_ok) { /* Record successful fencing operations */ const char *client_id = crm_element_value(dev, F_STONITH_CLIENTID); op = create_remote_stonith_op(client_id, dev, TRUE); } if (op == NULL) { /* Could be for an event that began before we started */ /* TODO: Record the op for later querying */ crm_info("Unknown or expired remote op: %s", id); return -EOPNOTSUPP; } if (op->devices && device && safe_str_neq(op->devices->data, device)) { crm_err ("Received outdated reply for device %s (instead of %s) to %s node %s. Operation already timed out at remote level.", device, op->devices->data, op->action, op->target); return rc; } if (safe_str_eq(crm_element_value(msg, F_SUBTYPE), "broadcast")) { crm_debug("Marking call to %s for %s on behalf of %s@%s.%.8s: %s (%d)", op->action, op->target, op->client_name, op->id, op->originator, pcmk_strerror(rc), rc); if (rc == pcmk_ok) { op->state = st_done; } else { op->state = st_failed; } remote_op_done(op, msg, rc, FALSE); return pcmk_ok; } else if (safe_str_neq(op->originator, stonith_our_uname)) { /* If this isn't a remote level broadcast, and we are not the * originator of the operation, we should not be receiving this msg. */ crm_err ("%s received non-broadcast fencing result for operation it does not own (device %s targeting %s)", stonith_our_uname, device, op->target); return rc; } if (is_set(op->call_options, st_opt_topology)) { const char *device = crm_element_value(msg, F_STONITH_DEVICE); crm_notice("Call to %s for %s on behalf of %s@%s: %s (%d)", device, op->target, op->client_name, op->originator, pcmk_strerror(rc), rc); /* We own the op, and it is complete. broadcast the result to all nodes * and notify our local clients. */ if (op->state == st_done) { remote_op_done(op, msg, rc, FALSE); return rc; } + if ((op->phase == 2) && (rc != pcmk_ok)) { + /* A remapped "on" failed, but the node was already turned off + * successfully, so ignore the error and continue. + */ + crm_warn("Ignoring %s 'on' failure (exit code %d) for %s after successful 'off'", + device, rc, op->target); + rc = pcmk_ok; + } + if (rc == pcmk_ok) { /* An operation completed successfully. Try another device if * necessary, otherwise mark the operation as done. */ advance_op_topology(op, device, msg, rc); return rc; } else { /* This device failed, time to try another topology level. If no other * levels are available, mark this operation as failed and report results. */ if (stonith_topology_next(op) != pcmk_ok) { op->state = st_failed; remote_op_done(op, msg, rc, FALSE); return rc; } } } else if (rc == pcmk_ok && op->devices == NULL) { crm_trace("All done for %s", op->target); op->state = st_done; remote_op_done(op, msg, rc, FALSE); return rc; } else if (rc == -ETIME && op->devices == NULL) { /* If the operation timed out don't bother retrying other peers. */ op->state = st_failed; remote_op_done(op, msg, rc, FALSE); return rc; } else { /* fall-through and attempt other fencing action using another peer */ } /* Retry on failure */ crm_trace("Next for %s on behalf of %s@%s (rc was %d)", op->target, op->originator, op->client_name, rc); call_remote_stonith(op, NULL); return rc; } int stonith_fence_history(xmlNode * msg, xmlNode ** output) { int rc = 0; const char *target = NULL; xmlNode *dev = get_xpath_object("//@" F_STONITH_TARGET, msg, LOG_TRACE); if (dev) { int options = 0; target = crm_element_value(dev, F_STONITH_TARGET); crm_element_value_int(msg, F_STONITH_CALLOPTS, &options); if (target && (options & st_opt_cs_nodeid)) { int nodeid = crm_atoi(target, NULL); crm_node_t *node = crm_get_peer(nodeid, NULL); if (node) { target = node->uname; } } } crm_trace("Looking for operations on %s in %p", target, remote_op_list); *output = create_xml_node(NULL, F_STONITH_HISTORY_LIST); if (remote_op_list) { GHashTableIter iter; remote_fencing_op_t *op = NULL; g_hash_table_iter_init(&iter, remote_op_list); while (g_hash_table_iter_next(&iter, NULL, (void **)&op)) { xmlNode *entry = NULL; if (target && strcmp(op->target, target) != 0) { continue; } rc = 0; crm_trace("Attaching op %s", op->id); entry = create_xml_node(*output, STONITH_OP_EXEC); crm_xml_add(entry, F_STONITH_TARGET, op->target); crm_xml_add(entry, F_STONITH_ACTION, op->action); crm_xml_add(entry, F_STONITH_ORIGIN, op->originator); crm_xml_add(entry, F_STONITH_DELEGATE, op->delegate); crm_xml_add(entry, F_STONITH_CLIENTNAME, op->client_name); crm_xml_add_int(entry, F_STONITH_DATE, op->completed); crm_xml_add_int(entry, F_STONITH_STATE, op->state); } } return rc; } gboolean stonith_check_fence_tolerance(int tolerance, const char *target, const char *action) { GHashTableIter iter; time_t now = time(NULL); remote_fencing_op_t *rop = NULL; crm_trace("tolerance=%d, remote_op_list=%p", tolerance, remote_op_list); if (tolerance <= 0 || !remote_op_list || target == NULL || action == NULL) { return FALSE; } g_hash_table_iter_init(&iter, remote_op_list); while (g_hash_table_iter_next(&iter, NULL, (void **)&rop)) { if (strcmp(rop->target, target) != 0) { continue; } else if (rop->state != st_done) { continue; + /* We don't have to worry about remapped reboots here + * because if state is done, any remapping has been undone + */ } else if (strcmp(rop->action, action) != 0) { continue; } else if ((rop->completed + tolerance) < now) { continue; } crm_notice("Target %s was fenced (%s) less than %ds ago by %s on behalf of %s", target, action, tolerance, rop->delegate, rop->originator); return TRUE; } return FALSE; } diff --git a/include/crm/fencing/internal.h b/include/crm/fencing/internal.h index a6f58b12eb..a59151b333 100644 --- a/include/crm/fencing/internal.h +++ b/include/crm/fencing/internal.h @@ -1,132 +1,134 @@ /* * Copyright (C) 2011 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef STONITH_NG_INTERNAL__H # define STONITH_NG_INTERNAL__H # include # include struct stonith_action_s; typedef struct stonith_action_s stonith_action_t; stonith_action_t *stonith_action_create(const char *agent, const char *_action, const char *victim, uint32_t victim_nodeid, int timeout, GHashTable * device_args, GHashTable * port_map); GPid stonith_action_execute_async(stonith_action_t * action, void *userdata, void (*done) (GPid pid, int rc, const char *output, gpointer user_data)); int stonith_action_execute(stonith_action_t * action, int *agent_result, char **output); gboolean is_redhat_agent(const char *agent); xmlNode *create_level_registration_xml(const char *node, int level, stonith_key_value_t * device_list); xmlNode *create_device_registration_xml(const char *id, const char *namespace, const char *agent, stonith_key_value_t * params, const char *rsc_provides); # define ST_LEVEL_MAX 10 # define F_STONITH_CLIENTID "st_clientid" # define F_STONITH_CALLOPTS "st_callopt" # define F_STONITH_CALLID "st_callid" # define F_STONITH_CALLDATA "st_calldata" # define F_STONITH_OPERATION "st_op" # define F_STONITH_TARGET "st_target" # define F_STONITH_REMOTE_OP_ID "st_remote_op" # define F_STONITH_RC "st_rc" /*! Timeout period per a device execution */ # define F_STONITH_TIMEOUT "st_timeout" # define F_STONITH_TOLERANCE "st_tolerance" /*! Action specific timeout period returned in query of fencing devices. */ # define F_STONITH_ACTION_TIMEOUT "st_action_timeout" +/*! Host in query result is not allowed to run this action */ +# define F_STONITH_ACTION_DISALLOWED "st_action_disallowed" /*! Maximum of random fencing delay for a device */ # define F_STONITH_DELAY_MAX "st_delay_max" /*! Has this device been verified using a monitor type * operation (monitor, list, status) */ # define F_STONITH_DEVICE_VERIFIED "st_monitor_verified" /*! device is required for this action */ # define F_STONITH_DEVICE_REQUIRED "st_required" /*! number of available devices in query result */ # define F_STONITH_AVAILABLE_DEVICES "st-available-devices" # define F_STONITH_CALLBACK_TOKEN "st_async_id" # define F_STONITH_CLIENTNAME "st_clientname" # define F_STONITH_CLIENTNODE "st_clientnode" # define F_STONITH_NOTIFY_TYPE "st_notify_type" # define F_STONITH_NOTIFY_ACTIVATE "st_notify_activate" # define F_STONITH_NOTIFY_DEACTIVATE "st_notify_deactivate" # define F_STONITH_DELEGATE "st_delegate" /*! The node initiating the stonith operation. If an operation * is relayed, this is the last node the operation lands on. When * in standalone mode, origin is the client's id that originated the * operation. */ # define F_STONITH_ORIGIN "st_origin" # define F_STONITH_HISTORY_LIST "st_history" # define F_STONITH_DATE "st_date" # define F_STONITH_STATE "st_state" # define F_STONITH_LEVEL "st_level" # define F_STONITH_ACTIVE "st_active" # define F_STONITH_DEVICE "st_device_id" # define F_STONITH_ACTION "st_device_action" # define F_STONITH_MODE "st_mode" # define T_STONITH_NG "stonith-ng" # define T_STONITH_REPLY "st-reply" /*! For async operations, an event from the server containing * the total amount of time the server is allowing for the operation * to take place is returned to the client. */ # define T_STONITH_TIMEOUT_VALUE "st-async-timeout-value" # define T_STONITH_NOTIFY "st_notify" # define STONITH_ATTR_ARGMAP "pcmk_arg_map" # define STONITH_ATTR_HOSTARG "pcmk_host_argument" # define STONITH_ATTR_HOSTMAP "pcmk_host_map" # define STONITH_ATTR_HOSTLIST "pcmk_host_list" # define STONITH_ATTR_HOSTCHECK "pcmk_host_check" # define STONITH_ATTR_DELAY_MAX "pcmk_delay_max" # define STONITH_ATTR_ACTION_OP "action" # define STONITH_OP_EXEC "st_execute" # define STONITH_OP_TIMEOUT_UPDATE "st_timeout_update" # define STONITH_OP_QUERY "st_query" # define STONITH_OP_FENCE "st_fence" # define STONITH_OP_RELAY "st_relay" # define STONITH_OP_CONFIRM "st_confirm" # define STONITH_OP_DEVICE_ADD "st_device_register" # define STONITH_OP_DEVICE_DEL "st_device_remove" # define STONITH_OP_DEVICE_METADATA "st_device_metadata" # define STONITH_OP_FENCE_HISTORY "st_fence_history" # define STONITH_OP_LEVEL_ADD "st_level_add" # define STONITH_OP_LEVEL_DEL "st_level_remove" # define stonith_channel "st_command" # define stonith_channel_callback "st_callback" # define STONITH_WATCHDOG_AGENT "#watchdog" #endif