diff --git a/man/sbd.8.pod.in b/man/sbd.8.pod.in index ff89c82..e4ad5f8 100644 --- a/man/sbd.8.pod.in +++ b/man/sbd.8.pod.in @@ -1,675 +1,675 @@ =head1 NAME sbd - STONITH Block Device daemon =head1 SYNOPSIS sbd <-d F> [options] C =head1 SUMMARY SBD provides a node fencing mechanism (Shoot the other node in the head, STONITH) for Pacemaker-based clusters through the exchange of messages via shared block storage such as for example a SAN, iSCSI, FCoE. This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers, and it can be used as a STONITH mechanism in all configurations that have reliable shared storage. SBD can also be used without any shared storage. In this mode, the watchdog device will be used to reset the node if it loses quorum, if any monitored daemon is lost and not recovered or if Pacemaker decides that the node requires fencing. The F binary implements both the daemon that watches the message slots as well as the management tool for interacting with the block storage device(s). This mode of operation is specified via the C parameter; some of these modes take additional parameters. To use SBD with shared storage, you must first C the messaging layout on one to three block devices. Second, configure F to list those devices (and possibly adjust other options), and restart the cluster stack on each node to ensure that C is started. Third, configure the C fencing resource in the Pacemaker CIB. Each of these steps is documented in more detail below the description of the command options. C can only be used as root. =head2 GENERAL OPTIONS =over =item B<-d> F Specify the block device(s) to be used. If you have more than one, specify this option up to three times. This parameter is mandatory for all modes, since SBD always needs a block device to interact with. This man page uses F, F, and F as example device names for brevity. However, in your production environment, you should instead always refer to them by using the long, stable device name (e.g., F). =item B<-v|-vv|-vvv> Enable verbose|debug|debug-library logging (optional) =item B<-h> Display a concise summary of C options. =item B<-n> I Set local node name; defaults to C. This should not need to be set. =item B<-R> Do B enable realtime priority. By default, C runs at realtime priority, locks itself into memory, and also acquires highest IO priority to protect itself against interference from other processes on the system. This is a debugging-only option. =item B<-I> I Async IO timeout (defaults to 3 seconds, optional). You should not need to adjust this unless your IO setup is really very slow. (In daemon mode, the watchdog is refreshed when the majority of devices could be read within this time.) =back =head2 create Example usage: sbd -d /dev/sdc2 -d /dev/sdd3 create If you specify the I command, sbd will write a metadata header to the device(s) specified and also initialize the messaging slots for up to 255 nodes. B: This command will not prompt for confirmation. Roughly the first megabyte of the specified block device(s) will be overwritten immediately and without backup. This command accepts a few options to adjust the default timings that are written to the metadata (to ensure they are identical across all nodes accessing the device). =over =item B<-1> I Set watchdog timeout to N seconds. This depends mostly on your storage latency; the majority of devices must be successfully read within this time, or else the node will self-fence. If your sbd device(s) reside on a multipath setup or iSCSI, this should be the time required to detect a path failure. You may be able to reduce this if your device outages are independent, or if you are using the Pacemaker integration. =item B<-2> I Set slot allocation timeout to N seconds. You should not need to tune this. =item B<-3> I Set daemon loop timeout to N seconds. You should not need to tune this. =item B<-4> I Set I timeout to N seconds. This should be twice the I timeout. This is the time after which a message written to a node's slot will be considered delivered. (Or long enough for the node to detect that it needed to self-fence.) This also affects the I in Pacemaker's CIB; see below. =back =head2 list Example usage: # sbd -d /dev/sda1 list 0 hex-0 clear 1 hex-7 clear 2 hex-9 clear List all allocated slots on device, and messages. You should see all cluster nodes that have ever been started against this device. Nodes that are currently running should have a I state; nodes that have been fenced, but not yet restarted, will show the appropriate fencing message. =head2 dump Example usage: # sbd -d /dev/sda1 dump ==Dumping header on disk /dev/sda1 Header version : 2 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 15 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 30 ==Header on disk /dev/sda1 is dumped Dump meta-data header from device. =head2 watch Example usage: sbd -d /dev/sdc2 -d /dev/sdd3 -P watch This command will make C start in daemon mode. It will constantly monitor the message slot of the local node for incoming messages, reachability, and optionally take Pacemaker's state into account. C B be started on boot before the cluster stack! See below for enabling this according to your boot environment. The options for this mode are rarely specified directly on the commandline directly, but most frequently set via F. It also constantly monitors connectivity to the storage device, and self-fences in case the partition becomes unreachable, guaranteeing that it does not disconnect from fencing messages. A node slot is automatically allocated on the device(s) the first time the daemon starts watching the device; hence, manual allocation is not usually required. If a watchdog is used together with the C as is strongly recommended, the watchdog is activated at initial start of the sbd daemon. The watchdog is refreshed every time the majority of SBD devices has been successfully read. Using a watchdog provides additional protection against C crashing. If the Pacemaker integration is activated, C will B self-fence if device majority is lost, if: =over =item 1. The partition the node is in is still quorate according to the CIB; =item 2. it is still quorate according to Corosync's node count; =item 3. the node itself is considered online and healthy by Pacemaker. =back This allows C to survive temporary outages of the majority of devices. However, while the cluster is in such a degraded state, it can neither successfully fence nor be shutdown cleanly (as taking the cluster below the quorum threshold will immediately cause all remaining nodes to self-fence). In short, it will not tolerate any further faults. Please repair the system before continuing. There is one C process that acts as a master to which all watchers report; one per device to monitor the node's slot; and, optionally, one that handles the Pacemaker integration. =over =item B<-W> Enable or disable use of the system watchdog to protect against the sbd processes failing and the node being left in an undefined state. Specify this once to enable, twice to disable. Defaults to I. =item B<-w> F This can be used to override the default watchdog device used and should not usually be necessary. =item B<-p> F This option can be used to specify a pidfile for the main sbd process. =item B<-F> I Number of failures before a failing servant process will not be restarted immediately until the dampening delay has expired. If set to zero, servants will be restarted immediately and indefinitely. If set to one, a failed servant will be restarted once every B<-t> seconds. If set to a different value, the servant will be restarted that many times within the dampening period and then delay. Defaults to I<1>. =item B<-t> I Dampening delay before faulty servants are restarted. Combined with C<-F 1>, the most logical way to tune the restart frequency of servant processes. Default is 5 seconds. If set to zero, processes will be restarted indefinitely and immediately. =item B<-P> Enable Pacemaker integration which checks Pacemaker quorum and node health. Specify this once to enable, twice to disable. Defaults to I. =item B<-S> I Set the start mode. (Defaults to I<0>.) If this is set to zero, sbd will always start up unconditionally, regardless of whether the node was previously fenced or not. If set to one, sbd will only start if the node was previously shutdown cleanly (as indicated by an exit request message in the slot), or if the slot is empty. A reset, crashdump, or power-off request in any slot will halt the start up. This is useful to prevent nodes from rejoining if they were faulty. The node must be manually "unfenced" by sending an empty message to it: sbd -d /dev/sda1 message node1 clear =item B<-s> I Set the start-up wait time for devices. (Defaults to I<120>.) Dynamic block devices such as iSCSI might not be fully initialized and present yet. This allows one to set a timeout for waiting for devices to appear on start-up. If set to 0, start-up will be aborted immediately if no devices are available. =item B<-Z> Enable trace mode. B Specifying this once will turn all reboots or power-offs, be they caused by self-fence decisions or messages, into a crashdump. Specifying this twice will just log them but not continue running. =item B<-T> By default, the daemon will set the watchdog timeout as specified in the device metadata. However, this does not work for every watchdog device. In this case, you must manually ensure that the watchdog timeout used by the system correctly matches the SBD settings, and then specify this option to allow C to continue with start-up. =item B<-5> I Warn if the time interval for tickling the watchdog exceeds this many seconds. Since the node is unable to log the watchdog expiry (it reboots immediately without a chance to write its logs to disk), this is very useful for getting an indication that the watchdog timeout is too short for the IO load of the system. -Default is 3 seconds, set to zero to disable. +Default is about 3/5 of watchdog timeout, set to zero to disable. =item B<-C> I Watchdog timeout to set before crashdumping. If SBD is set to crashdump instead of reboot - either via the trace mode settings or the I fencing agent's parameter -, SBD will adjust the watchdog timeout to this setting before triggering the dump. Otherwise, the watchdog might trigger and prevent a successful crashdump from ever being written. Set to zero (= default) to disable. =item B<-r> I Actions to be executed when the watchers don't timely report to the sbd master process or one of the watchers detects that the master process has died. Set timeout-action to comma-separated combination of noflush|flush plus reboot|crashdump|off. If just one of both is given the other stays at the default. This doesn't affect actions like off, crashdump, reboot explicitly triggered via message slots. And it does as well not configure the action a watchdog would trigger should it run off (there is no generic interface). Defaults to flush,reboot. =back =head2 allocate Example usage: sbd -d /dev/sda1 allocate node1 Explicitly allocates a slot for the specified node name. This should rarely be necessary, as every node will automatically allocate itself a slot the first time it starts up on watch mode. =head2 message Example usage: sbd -d /dev/sda1 message node1 test Writes the specified message to node's slot. This is rarely done directly, but rather abstracted via the C fencing agent configured as a cluster resource. Supported message types are: =over =item test This only generates a log message on the receiving node and can be used to check if SBD is seeing the device. Note that this could overwrite a fencing request send by the cluster, so should not be used during production. =item reset Reset the target upon receipt of this message. =item off Power-off the target. =item crashdump Cause the target node to crashdump. =item exit This will make the C daemon exit cleanly on the target. You should B send this message manually; this is handled properly during shutdown of the cluster stack. Manually stopping the daemon means the node is unprotected! =item clear This message indicates that no real message has been sent to the node. You should not set this manually; C will clear the message slot automatically during start-up, and setting this manually could overwrite a fencing message by the cluster. =back =head2 query-watchdog Example usage: sbd query-watchdog Check for available watchdog devices and print some info. B: This command will arm the watchdog during query, and if your watchdog refuses disarming (for example, if its kernel module has the 'nowayout' parameter set) this will reset your system. =head2 test-watchdog Example usage: sbd test-watchdog [-w /dev/watchdog3] Test specified watchdog device (/dev/watchdog by default). B: This command will arm the watchdog and have your system reset in case your watchdog is working properly! If issued from an interactive session, it will prompt for confirmation. =head1 Base system configuration =head2 Configure a watchdog It is highly recommended that you configure your Linux system to load a watchdog driver with hardware assistance (as is available on most modern systems), such as I, I, or others. As a fall-back, you can use the I module. No other software must access the watchdog timer; it can only be accessed by one process at any given time. Some hardware vendors ship systems management software that use the watchdog for system resets (f.e. HP ASR daemon). Such software has to be disabled if the watchdog is to be used by SBD. =head2 Choosing and initializing the block device(s) First, you have to decide if you want to use one, two, or three devices. If you are using multiple ones, they should reside on independent storage setups. Putting all three of them on the same logical unit for example would not provide any additional redundancy. The SBD device can be connected via Fibre Channel, Fibre Channel over Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of network-based quorum server; the advantage is that it does not require a smart host at your third location, just block storage. The SBD partitions themselves B be mirrored (via MD, DRBD, or the storage layer itself), since this could result in a split-mirror scenario. Nor can they reside on cLVM2 volume groups, since they must be accessed by the cluster stack before it has started the cLVM2 daemons; hence, these should be either raw partitions or logical units on (multipath) storage. The block device(s) must be accessible from all nodes. (While it is not necessary that they share the same path name on all nodes, this is considered a very good idea.) SBD will only use about one megabyte per device, so you can easily create a small partition, or very small logical units. (The size of the SBD device depends on the block size of the underlying device. Thus, 1MB is fine on plain SCSI devices and SAN storage with 512 byte blocks. On the IBM s390x architecture in particular, disks default to 4k blocks, and thus require roughly 4MB.) The number of devices will affect the operation of SBD as follows: =over =item One device In its most simple implementation, you use one device only. This is appropriate for clusters where all your data is on the same shared storage (with internal redundancy) anyway; the SBD device does not introduce an additional single point of failure then. If the SBD device is not accessible, the daemon will fail to start and inhibit startup of cluster services. =item Two devices This configuration is a trade-off, primarily aimed at environments where host-based mirroring is used, but no third storage device is available. SBD will not commit suicide if it loses access to one mirror leg; this allows the cluster to continue to function even in the face of one outage. However, SBD will not fence the other side while only one mirror leg is available, since it does not have enough knowledge to detect an asymmetric split of the storage. So it will not be able to automatically tolerate a second failure while one of the storage arrays is down. (Though you can use the appropriate crm command to acknowledge the fence manually.) It will not start unless both devices are accessible on boot. =item Three devices In this most reliable and recommended configuration, SBD will only self-fence if more than one device is lost; hence, this configuration is resilient against temporary single device outages (be it due to failures or maintenance). Fencing messages can still be successfully relayed if at least two devices remain accessible. This configuration is appropriate for more complex scenarios where storage is not confined to a single array. For example, host-based mirroring solutions could have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI. It will only start if at least two devices are accessible on boot. =back After you have chosen the devices and created the appropriate partitions and perhaps multipath alias names to ease management, use the C command described above to initialize the SBD metadata on them. =head3 Sharing the block device(s) between multiple clusters It is possible to share the block devices between multiple clusters, provided the total number of nodes accessing them does not exceed I<255> nodes, and they all must share the same SBD timeouts (since these are part of the metadata). If you are using multiple devices this can reduce the setup overhead required. However, you should B share devices between clusters in different security domains. =head2 Configure SBD to start on boot On systems using C, the C or C system start-up scripts must handle starting or stopping C as required before starting the rest of the cluster stack. For C, sbd simply has to be enabled using systemctl enable sbd.service The daemon is brought online on each node before corosync and Pacemaker are started, and terminated only after all other cluster components have been shut down - ensuring that cluster resources are never activated without SBD supervision. =head2 Configuration via sysconfig The system instance of C is configured via F. In this file, you must specify the device(s) used, as well as any options to pass to the daemon: SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1" SBD_PACEMAKER="true" C will fail to start if no C is specified. See the installed template or section for configuration via environment for more options that can be configured here. In general configuration done via parameters takes precedence over the configuration from the configuration file. =head2 Configuration via environment =over @environment_section@ =back =head2 Testing the sbd installation After a restart of the cluster stack on this node, you can now try sending a test message to it as root, from this or any other node: sbd -d /dev/sda1 message node1 test The node will acknowledge the receipt of the message in the system logs: Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2 This confirms that SBD is indeed up and running on the node, and that it is ready to receive messages. Make B that F is identical on all cluster nodes, and that all cluster nodes are running the daemon. =head1 Pacemaker CIB integration =head2 Fencing resource Pacemaker can only interact with SBD to issue a node fence if there is a configure fencing resource. This should be a primitive, not a clone, as follows: primitive fencing-sbd stonith:external/sbd \ params pcmk_delay_max=30 This will automatically use the same devices as configured in F. While you should not configure this as a clone (as Pacemaker will register the fencing device on each node automatically), the I setting enables random fencing delay which ensures, in a scenario where a split-brain scenario did occur in a two node cluster, that one of the nodes has a better chance to survive to avoid double fencing. SBD also supports turning the reset request into a crash request, which may be helpful for debugging if you have kernel crashdumping configured; then, every fence request will cause the node to dump core. You can enable this via the C parameter on the fencing resource. This is B recommended for production use, but only for debugging phases. =head2 General cluster properties You must also enable STONITH in general, and set the STONITH timeout to be at least twice the I timeout you have configured, to allow enough time for the fencing message to be delivered. If your I timeout is 60 seconds, this is a possible configuration: property stonith-enabled="true" property stonith-timeout="120s" B: if I is too low for I and the system overhead, sbd will never be able to successfully complete a fence request. This will create a fencing loop. Note that the sbd fencing agent will try to detect this and automatically extend the I setting to a reasonable value, on the assumption that sbd modifying your configuration is preferable to not fencing. =head1 Management tasks =head2 Recovering from temporary SBD device outage If you have multiple devices, failure of a single device is not immediately fatal. C will retry to restart the monitor for the device every 5 seconds by default. However, you can tune this via the options to the I command. In case you wish the immediately force a restart of all currently disabled monitor processes, you can send a I to the SBD I process. =head1 LICENSE Copyright (C) 2008-2013 Lars Marowsky-Bree This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. For details see the GNU General Public License at http://www.gnu.org/licenses/gpl-2.0.html (version 2) and/or http://www.gnu.org/licenses/gpl.html (the newest as per "any later"). diff --git a/src/sbd-common.c b/src/sbd-common.c index 96f4ead..2c9fc24 100644 --- a/src/sbd-common.c +++ b/src/sbd-common.c @@ -1,1220 +1,1220 @@ /* * Copyright (C) 2013 Lars Marowsky-Bree * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License along * with this program; if not, write to the Free Software Foundation, Inc., * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. */ #include "sbd.h" #include #include #ifdef __GLIBC__ #include #endif #include #include #include #include #include #include #include #ifdef _POSIX_MEMLOCK # include #endif /* Tunable defaults: */ unsigned long timeout_watchdog = SBD_WATCHDOG_TIMEOUT_DEFAULT; int timeout_msgwait = 2 * SBD_WATCHDOG_TIMEOUT_DEFAULT; -unsigned long timeout_watchdog_warn = 3; +unsigned long timeout_watchdog_warn = calculate_timeout_watchdog_warn(SBD_WATCHDOG_TIMEOUT_DEFAULT); int timeout_allocate = 2; int timeout_loop = 1; int timeout_io = 3; int timeout_startup = 120; int watchdog_use = 1; int watchdog_set_timeout = 1; unsigned long timeout_watchdog_crashdump = 0; int skip_rt = 0; int debug = 0; int debug_mode = 0; char *watchdogdev = NULL; bool watchdogdev_is_default = false; char * local_uname; /* Global, non-tunable variables: */ int sector_size = 0; int watchdogfd = -1; int servant_health = 0; /*const char *devname;*/ const char *cmdname; void usage(void) { fprintf(stderr, "Shared storage fencing tool.\n" "Syntax:\n" " %s \n" "Options:\n" "-d Block device to use (mandatory; can be specified up to 3 times)\n" "-h Display this help.\n" "-n Set local node name; defaults to uname -n (optional)\n" "\n" "-R Do NOT enable realtime priority (debugging only)\n" "-W Use watchdog (recommended) (watch only)\n" "-w Specify watchdog device (optional) (watch only)\n" "-T Do NOT initialize the watchdog timeout (watch only)\n" "-S <0|1> Set start mode if the node was previously fenced (watch only)\n" "-p Write pidfile to the specified path (watch only)\n" "-v|-vv|-vvv Enable verbose|debug|debug-library logging (optional)\n" "\n" "-1 Set watchdog timeout to N seconds (optional, create only)\n" "-2 Set slot allocation timeout to N seconds (optional, create only)\n" "-3 Set daemon loop timeout to N seconds (optional, create only)\n" "-4 Set msgwait timeout to N seconds (optional, create only)\n" "-5 Warn if loop latency exceeds threshold (optional, watch only)\n" " (default is 3, set to 0 to disable)\n" "-C Watchdog timeout to set before crashdumping\n" " (def: 0s = disable gracefully, optional)\n" "-I Async IO read timeout (defaults to 3 * loop timeout, optional)\n" "-s Timeout to wait for devices to become available (def: 120s)\n" "-t Dampening delay before faulty servants are restarted (optional)\n" " (default is 5, set to 0 to disable)\n" "-F # of failures before a servant is considered faulty (optional)\n" " (default is 1, set to 0 to disable)\n" "-P Check Pacemaker quorum and node health (optional, watch only)\n" "-Z Enable trace mode. WARNING: UNSAFE FOR PRODUCTION!\n" "-r Set timeout-action to comma-separated combination of\n" " noflush|flush plus reboot|crashdump|off (default is flush,reboot)\n" "Commands:\n" #if SUPPORT_SHARED_DISK "create initialize N slots on - OVERWRITES DEVICE!\n" "list List all allocated slots on device, and messages.\n" "dump Dump meta-data header from device.\n" "allocate \n" " Allocate a slot for node (optional)\n" "message (test|reset|off|crashdump|clear|exit)\n" " Writes the specified message to node's slot.\n" #endif "watch Loop forever, monitoring own slot\n" "query-watchdog Check for available watchdog-devices and print some info\n" "test-watchdog Test the watchdog-device selected.\n" " Attention: This will arm the watchdog and have your system reset\n" " in case your watchdog is working properly!\n" , cmdname); } static int watchdog_init_interval_fd(int wdfd, int timeout) { if (ioctl(wdfd, WDIOC_SETTIMEOUT, &timeout) < 0) { cl_perror( "WDIOC_SETTIMEOUT" ": Failed to set watchdog timer to %u seconds.", timeout); cl_log(LOG_CRIT, "Please validate your watchdog configuration!"); cl_log(LOG_CRIT, "Choose a different watchdog driver or specify -T to skip this if you are completely sure."); return -1; } return 0; } int watchdog_init_interval(void) { if (watchdogfd < 0) { return 0; } if (watchdog_set_timeout == 0) { cl_log(LOG_INFO, "NOT setting watchdog timeout on explicit user request!"); return 0; } if (watchdog_init_interval_fd(watchdogfd, timeout_watchdog) < 0) { return -1; } cl_log(LOG_INFO, "Set watchdog timeout to %u seconds.", (int) timeout_watchdog); return 0; } static int watchdog_tickle_fd(int wdfd, char *wddev) { if (write(wdfd, "", 1) != 1) { cl_perror("Watchdog write failure: %s!", wddev); return -1; } return 0; } int watchdog_tickle(void) { if (watchdogfd >= 0) { return watchdog_tickle_fd(watchdogfd, watchdogdev); } return 0; } static int watchdog_init_fd(char *wddev, int timeout) { int wdfd; wdfd = open(wddev, O_WRONLY); if (wdfd >= 0) { if (((timeout >= 0) && (watchdog_init_interval_fd(wdfd, timeout) < 0)) || (watchdog_tickle_fd(wdfd, wddev) < 0)) { close(wdfd); return -1; } } else { cl_perror("Cannot open watchdog device '%s'", wddev); return -1; } return wdfd; } int watchdog_init(void) { if (watchdogfd < 0 && watchdogdev != NULL) { int timeout = timeout_watchdog; if (watchdog_set_timeout == 0) { cl_log(LOG_INFO, "NOT setting watchdog timeout on explicit user request!"); timeout = -1; } watchdogfd = watchdog_init_fd(watchdogdev, timeout); if (watchdogfd >= 0) { cl_log(LOG_NOTICE, "Using watchdog device '%s'", watchdogdev); if (watchdog_set_timeout) { cl_log(LOG_INFO, "Set watchdog timeout to %u seconds.", (int) timeout_watchdog); } } else { return -1; } } return 0; } static void watchdog_close_fd(int wdfd, char *wddev, bool disarm) { if (disarm) { int r; int flags = WDIOS_DISABLECARD;; /* Explicitly disarm it */ r = ioctl(wdfd, WDIOC_SETOPTIONS, &flags); if (r < 0) { cl_perror("Failed to disable hardware watchdog %s", wddev); } /* To be sure, use magic close logic, too */ for (;;) { if (write(wdfd, "V", 1) > 0) { break; } cl_perror("Cannot disable watchdog device %s", wddev); } } if (close(wdfd) < 0) { cl_perror("Watchdog close(%d) failed", wdfd); } } void watchdog_close(bool disarm) { if (watchdogfd < 0) { return; } watchdog_close_fd(watchdogfd, watchdogdev, disarm); watchdogfd = -1; } #define MAX_WATCHDOGS 64 #define SYS_CLASS_WATCHDOG "/sys/class/watchdog" #define SYS_CHAR_DEV_DIR "/sys/dev/char" #define WATCHDOG_NODEDIR "/dev/" #define WATCHDOG_NODEDIR_LEN 5 struct watchdog_list_item { dev_t dev; char *dev_node; char *dev_ident; char *dev_driver; struct watchdog_list_item *next; }; struct link_list_item { char *dev_node; char *link_name; struct link_list_item *next; }; static struct watchdog_list_item *watchdog_list = NULL; static int watchdog_list_items = 0; static void watchdog_populate_list(void) { dev_t watchdogs[MAX_WATCHDOGS + 1] = {makedev(10,130), 0}; int num_watchdogs = 1; struct dirent *entry; char entry_name[280]; DIR *dp; char buf[280] = ""; struct link_list_item *link_list = NULL; if (watchdog_list != NULL) { return; } /* get additional devices from /sys/class/watchdog */ dp = opendir(SYS_CLASS_WATCHDOG); if (dp) { while ((entry = readdir(dp))) { if (entry->d_type == DT_LNK) { FILE *file; snprintf(entry_name, sizeof(entry_name), SYS_CLASS_WATCHDOG "/%s/dev", entry->d_name); file = fopen(entry_name, "r"); if (file) { int major, minor; if (fscanf(file, "%d:%d", &major, &minor) == 2) { watchdogs[num_watchdogs++] = makedev(major, minor); } fclose(file); if (num_watchdogs == MAX_WATCHDOGS) { break; } } } } closedir(dp); } /* search for watchdog nodes in /dev */ dp = opendir(WATCHDOG_NODEDIR); if (dp) { /* first go for links and memorize them */ while ((entry = readdir(dp))) { if (entry->d_type == DT_LNK) { int len; snprintf(entry_name, sizeof(entry_name), WATCHDOG_NODEDIR "%s", entry->d_name); /* !realpath(entry_name, buf) unfortunately does a stat on * target so we can't really use it to check if links stay * within /dev without triggering e.g. AVC-logs (with * SELinux policy that just allows stat within /dev). * Without canonicalization that doesn't actually touch the * filesystem easily available introduce some limitations * for simplicity: * - just simple path without '..' * - just one level of symlinks (avoid e.g. loop-checking) */ len = readlink(entry_name, buf, sizeof(buf) - 1); if ((len < 1) || (len > sizeof(buf) - WATCHDOG_NODEDIR_LEN - 1)) { continue; } buf[len] = '\0'; if (buf[0] != '/') { memmove(&buf[WATCHDOG_NODEDIR_LEN], buf, len+1); memcpy(buf, WATCHDOG_NODEDIR, WATCHDOG_NODEDIR_LEN); len += WATCHDOG_NODEDIR_LEN; } if (strstr(buf, "/../") || strncmp(WATCHDOG_NODEDIR, buf, WATCHDOG_NODEDIR_LEN)) { continue; } else { /* just memorize to avoid statting the target - SELinux */ struct link_list_item *lli = calloc(1, sizeof(struct link_list_item)); lli->dev_node = strdup(buf); lli->link_name = strdup(entry_name); lli->next = link_list; link_list = lli; } } } rewinddir(dp); while ((entry = readdir(dp))) { if (entry->d_type == DT_CHR) { struct stat statbuf; snprintf(entry_name, sizeof(entry_name), WATCHDOG_NODEDIR "%s", entry->d_name); if(!stat(entry_name, &statbuf) && S_ISCHR(statbuf.st_mode)) { int i; for (i=0; idev = watchdogs[i]; wdg->dev_node = strdup(entry_name); wdg->next = watchdog_list; watchdog_list = wdg; watchdog_list_items++; if (wdfd >= 0) { struct watchdog_info ident; ident.identity[0] = '\0'; ioctl(wdfd, WDIOC_GETSUPPORT, &ident); watchdog_close_fd(wdfd, entry_name, true); if (ident.identity[0]) { wdg->dev_ident = strdup((char *) ident.identity); } } snprintf(entry_name, sizeof(entry_name), SYS_CHAR_DEV_DIR "/%d:%d/device/driver", major(watchdogs[i]), minor(watchdogs[i])); len = readlink(entry_name, buf, sizeof(buf) - 1); if (len > 0) { buf[len] = '\0'; wdg->dev_driver = strdup(basename(buf)); } else if ((wdg->dev_ident) && (strcmp(wdg->dev_ident, "Software Watchdog") == 0)) { wdg->dev_driver = strdup("softdog"); } /* create dupes if we have memorized links * to this node */ for (tmp_list = link_list; tmp_list; tmp_list = tmp_list->next) { if (!strcmp(tmp_list->dev_node, wdg->dev_node)) { struct watchdog_list_item *dupe_wdg = calloc(1, sizeof(struct watchdog_list_item)); /* as long as we never purge watchdog_list * there is no need to dupe strings */ *dupe_wdg = *wdg; dupe_wdg->dev_node = strdup(tmp_list->link_name); dupe_wdg->next = watchdog_list; watchdog_list = dupe_wdg; watchdog_list_items++; } /* for performance reasons we could remove * the link_list entry */ } break; } } } } } closedir(dp); } /* cleanup link list */ while (link_list) { struct link_list_item *tmp_list = link_list; link_list = link_list->next; free(tmp_list->dev_node); free(tmp_list->link_name); free(tmp_list); } } int watchdog_info(void) { struct watchdog_list_item *wdg; int wdg_cnt = 0; watchdog_populate_list(); printf("\nDiscovered %d watchdog devices:\n", watchdog_list_items); for (wdg = watchdog_list; wdg != NULL; wdg = wdg->next) { wdg_cnt++; printf("\n[%d] %s\nIdentity: %s\nDriver: %s\n", wdg_cnt, wdg->dev_node, wdg->dev_ident?wdg->dev_ident:"Error: Check if hogged by e.g. sbd-daemon!", wdg->dev_driver?wdg->dev_driver:""); if ((wdg->dev_driver) && (strcmp(wdg->dev_driver, "softdog") == 0)) { printf("CAUTION: Not recommended for use with sbd.\n"); } } return 0; } int watchdog_test(void) { int i; if ((watchdog_set_timeout == 0) || !watchdog_use) { printf("\nWatchdog is disabled - aborting test!!!\n"); return 0; } if (watchdogdev_is_default) { watchdog_populate_list(); if (watchdog_list_items > 1) { printf("\nError: Multiple watchdog devices discovered.\n" " Use -w or SBD_WATCHDOG_DEV to specify\n" " which device to reset the system with\n"); watchdog_info(); return -1; } } if ((isatty(fileno(stdin)))) { char buffer[16]; printf("\nWARNING: This operation is expected to force-reboot this system\n" " without following any shutdown procedures.\n\n" "Proceed? [NO/Proceed] "); if ((fgets(buffer, 16, stdin) == NULL) || strcmp(buffer, "Proceed\n")) { printf("\nAborting watchdog test!!!\n"); return 0; } printf("\n"); } printf("Initializing %s with a reset countdown of %d seconds ...\n", watchdogdev, (int) timeout_watchdog); if ((watchdog_init() < 0) || (watchdog_init_interval() < 0)) { printf("Failed to initialize watchdog!!!\n"); return -1; } printf("\n"); printf("NOTICE: The watchdog device is expected to reset the system\n" " in %d seconds. If system remains active beyond that time,\n" " watchdog may not be functional.\n\n", (int) timeout_watchdog); for (i=timeout_watchdog; i>1; i--) { printf("Reset countdown ... %d seconds\n", i); sleep(1); } for (i=2; i>0; i--) { printf("System expected to reset any moment ...\n"); sleep(1); } for (i=5; i>0; i--) { printf("System should have reset ...\n"); sleep(1); } printf("Error: The watchdog device has failed to reboot the system,\n" " and it may not be suitable for usage with sbd.\n"); /* test should trigger a reboot thus returning is actually bad */ return -1; } /* This duplicates some code from linux/ioprio.h since these are not included * even in linux-kernel-headers. Sucks. See also * /usr/src/linux/Documentation/block/ioprio.txt and ioprio_set(2) */ extern int sys_ioprio_set(int, int, int); int ioprio_set(int which, int who, int ioprio); inline int ioprio_set(int which, int who, int ioprio) { return syscall(__NR_ioprio_set, which, who, ioprio); } enum { IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE, }; enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP, IOPRIO_WHO_USER, }; #define IOPRIO_BITS (16) #define IOPRIO_CLASS_SHIFT (13) #define IOPRIO_PRIO_MASK ((1UL << IOPRIO_CLASS_SHIFT) - 1) #define IOPRIO_PRIO_CLASS(mask) ((mask) >> IOPRIO_CLASS_SHIFT) #define IOPRIO_PRIO_DATA(mask) ((mask) & IOPRIO_PRIO_MASK) #define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data) static void sbd_stack_hogger(unsigned char * inbuf, int kbytes) { unsigned char buf[1024]; if(kbytes <= 0) { return; } if (inbuf == NULL) { memset(buf, HOG_CHAR, sizeof(buf)); } else { memcpy(buf, inbuf, sizeof(buf)); } if (kbytes > 0) { sbd_stack_hogger(buf, kbytes-1); } return; } static void sbd_malloc_hogger(int kbytes) { int j; void**chunks; int chunksize = 1024; if(kbytes <= 0) { return; } /* * We could call mallopt(M_MMAP_MAX, 0) to disable it completely, * but we've already called mlockall() * * We could also call mallopt(M_TRIM_THRESHOLD, -1) to prevent malloc * from giving memory back to the system, but we've already called * mlockall(MCL_FUTURE), so there's no need. */ chunks = malloc(kbytes * sizeof(void *)); if (chunks == NULL) { cl_log(LOG_WARNING, "Could not preallocate chunk array"); return; } for (j=0; j < kbytes; ++j) { chunks[j] = malloc(chunksize); if (chunks[j] == NULL) { cl_log(LOG_WARNING, "Could not preallocate block %d", j); } else { memset(chunks[j], 0, chunksize); } } for (j=0; j < kbytes; ++j) { free(chunks[j]); } free(chunks); } static void sbd_memlock(int stackgrowK, int heapgrowK) { #ifdef _POSIX_MEMLOCK /* * We could call setrlimit(RLIMIT_MEMLOCK,...) with a large * number, but the mcp runs as root and mlock(2) says: * * Since Linux 2.6.9, no limits are placed on the amount of memory * that a privileged process may lock, and this limit instead * governs the amount of memory that an unprivileged process may * lock. */ if (mlockall(MCL_CURRENT|MCL_FUTURE) >= 0) { cl_log(LOG_INFO, "Locked ourselves in memory"); /* Now allocate some extra pages (MCL_FUTURE will ensure they stay around) */ sbd_malloc_hogger(heapgrowK); sbd_stack_hogger(NULL, stackgrowK); } else { cl_perror("Unable to lock ourselves into memory"); } #else cl_log(LOG_ERR, "Unable to lock ourselves into memory"); #endif } static int get_realtime_budget(void) { FILE *f; char fname[PATH_MAX]; int res = -1, lnum = 0, num; char *cgroup = NULL, *namespecs = NULL; snprintf(fname, PATH_MAX, "/proc/%jd/cgroup", (intmax_t)getpid()); f = fopen(fname, "rt"); if (f == NULL) { cl_log(LOG_WARNING, "Can't open cgroup file for pid=%jd", (intmax_t)getpid()); goto exit_res; } while( (num = fscanf(f, "%d:%m[^:]:%m[^\n]\n", &lnum, &namespecs, &cgroup)) !=EOF ) { if (namespecs && strstr(namespecs, "cpuacct")) { free(namespecs); break; } if (cgroup) { free(cgroup); cgroup = NULL; } if (namespecs) { free(namespecs); namespecs = NULL; } /* not to get stuck if format changes */ if ((num < 3) && ((fscanf(f, "%*[^\n]") == EOF) || (fscanf(f, "\n") == EOF))) { break; } } fclose(f); if (cgroup == NULL) { cl_log(LOG_WARNING, "Failed getting cgroup for pid=%jd", (intmax_t)getpid()); goto exit_res; } snprintf(fname, PATH_MAX, "/sys/fs/cgroup/cpu%s/cpu.rt_runtime_us", cgroup); f = fopen(fname, "rt"); if (f == NULL) { cl_log(LOG_WARNING, "cpu.rt_runtime_us existed for root-slice but " "doesn't for '%s'", cgroup); goto exit_res; } if (fscanf(f, "%d", &res) != 1) { cl_log(LOG_WARNING, "failed reading rt-budget from %s", fname); } else { cl_log(LOG_INFO, "slice='%s' has rt-budget=%d", cgroup, res); } fclose(f); exit_res: if (cgroup) { free(cgroup); } return res; } /* stolen from corosync */ static int sbd_move_to_root_cgroup(bool enforce_root_cgroup) { FILE *f; int res = -1; /* * /sys/fs/cgroup is hardcoded, because most of Linux distributions are now * using systemd and systemd uses hardcoded path of cgroup mount point. * * This feature is expected to be removed as soon as systemd gets support * for managing RT configuration. */ f = fopen("/sys/fs/cgroup/cpu/cpu.rt_runtime_us", "rt"); if (f == NULL) { cl_log(LOG_DEBUG, "cpu.rt_runtime_us doesn't exist -> " "system without cgroup or with disabled CONFIG_RT_GROUP_SCHED"); res = 0; goto exit_res; } fclose(f); if ((!enforce_root_cgroup) && (get_realtime_budget() > 0)) { cl_log(LOG_DEBUG, "looks as if we have rt-budget in the slice we are " "-> skip moving to root-slice"); res = 0; goto exit_res; } f = fopen("/sys/fs/cgroup/cpu/tasks", "w"); if (f == NULL) { cl_log(LOG_WARNING, "Can't open cgroups tasks file for writing"); goto exit_res; } if (fprintf(f, "%jd\n", (intmax_t)getpid()) <= 0) { cl_log(LOG_WARNING, "Can't write sbd pid into cgroups tasks file"); goto close_and_exit_res; } close_and_exit_res: if (fclose(f) != 0) { cl_log(LOG_WARNING, "Can't close cgroups tasks file"); goto exit_res; } exit_res: return (res); } void sbd_make_realtime(int priority, int stackgrowK, int heapgrowK) { if(priority < 0) { return; } do { #ifdef SCHED_RR if (move_to_root_cgroup) { sbd_move_to_root_cgroup(enforce_moving_to_root_cgroup); } { int pmin = sched_get_priority_min(SCHED_RR); int pmax = sched_get_priority_max(SCHED_RR); struct sched_param sp; int pcurrent; if (priority == 0) { priority = pmax; } else if (priority < pmin) { priority = pmin; } else if (priority > pmax) { priority = pmax; } if (sched_getparam(0, &sp) < 0) { cl_perror("Unable to get scheduler priority"); } else if ((pcurrent = sched_getscheduler(0)) < 0) { cl_perror("Unable to get scheduler policy"); } else if ((pcurrent == SCHED_RR) && (sp.sched_priority >= priority)) { cl_log(LOG_INFO, "Stay with priority (%d) for policy SCHED_RR", sp.sched_priority); break; } else { memset(&sp, 0, sizeof(sp)); sp.sched_priority = priority; if (sched_setscheduler(0, SCHED_RR, &sp) < 0) { cl_perror( "Unable to set scheduler policy to SCHED_RR priority %d", priority); } else { cl_log(LOG_INFO, "Scheduler policy is now SCHED_RR priority %d", priority); break; } } } #else cl_log(LOG_ERR, "System does not support updating the scheduler policy"); #endif #ifdef PRIO_PGRP if (setpriority(PRIO_PGRP, 0, INT_MIN) < 0) { cl_perror("Unable to raise the scheduler priority"); } else { cl_log(LOG_INFO, "Scheduler priority raised to the maximum"); } #else cl_perror("System does not support setting the scheduler priority"); #endif } while (0); sbd_memlock(heapgrowK, stackgrowK); } void maximize_priority(void) { if (skip_rt) { cl_log(LOG_INFO, "Not elevating to realtime (-R specified)."); return; } sbd_make_realtime(0, 256, 256); if (ioprio_set(IOPRIO_WHO_PROCESS, getpid(), IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 1)) != 0) { cl_perror("ioprio_set() call failed."); } } void sysrq_init(void) { FILE* procf; int c; procf = fopen("/proc/sys/kernel/sysrq", "r"); if (!procf) { cl_perror("cannot open /proc/sys/kernel/sysrq for read."); return; } if (fscanf(procf, "%d", &c) != 1) { cl_perror("Parsing sysrq failed"); c = 0; } fclose(procf); if (c == 1) return; /* 8 for debugging dumps of processes, 128 for reboot/poweroff */ c |= 136; procf = fopen("/proc/sys/kernel/sysrq", "w"); if (!procf) { cl_perror("cannot open /proc/sys/kernel/sysrq for writing"); return; } fprintf(procf, "%d", c); fclose(procf); return; } void sysrq_trigger(char t) { FILE *procf; procf = fopen("/proc/sysrq-trigger", "a"); if (!procf) { cl_perror("Opening sysrq-trigger failed."); return; } cl_log(LOG_INFO, "sysrq-trigger: %c\n", t); fprintf(procf, "%c\n", t); fclose(procf); return; } static void do_exit(char kind, bool do_flush) { /* TODO: Turn debug_mode into a bit field? Delay + kdump for example */ const char *reason = NULL; if (kind == 'c') { cl_log(LOG_NOTICE, "Initiating kdump"); } else if (debug_mode == 1) { cl_log(LOG_WARNING, "Initiating kdump instead of panicking the node (debug mode)"); kind = 'c'; } if (debug_mode == 2) { cl_log(LOG_WARNING, "Shutting down SBD instead of panicking the node (debug mode)"); watchdog_close(true); exit(0); } if (debug_mode == 3) { /* Give the system some time to flush logs to disk before rebooting. */ cl_log(LOG_WARNING, "Delaying node panic by 10s (debug mode)"); watchdog_close(true); sync(); sleep(10); } switch(kind) { case 'b': reason = "reboot"; break; case 'c': reason = "crashdump"; break; case 'o': reason = "off"; break; default: reason = "unknown"; break; } cl_log(LOG_EMERG, "Rebooting system: %s", reason); if (do_flush) { sync(); } if (kind == 'c') { if (timeout_watchdog_crashdump) { if (timeout_watchdog != timeout_watchdog_crashdump) { timeout_watchdog = timeout_watchdog_crashdump; watchdog_init_interval(); } watchdog_close(false); } else { watchdog_close(true); } sysrq_trigger(kind); } else { watchdog_close(false); sysrq_trigger(kind); if (reboot((kind == 'o')?RB_POWER_OFF:RB_AUTOBOOT) < 0) { cl_perror("%s failed", (kind == 'o')?"Poweroff":"Reboot"); } } exit(1); } void do_crashdump(void) { do_exit('c', true); } void do_reset(void) { do_exit('b', true); } void do_off(void) { do_exit('o', true); } void do_timeout_action(void) { do_exit(timeout_sysrq_char, do_flush); } /* * Change directory to the directory our core file needs to go in * Call after you establish the userid you're running under. */ int sbd_cdtocoredir(void) { int rc; static const char *dir = NULL; if (dir == NULL) { dir = CRM_CORE_DIR; } if ((rc=chdir(dir)) < 0) { int errsave = errno; cl_perror("Cannot chdir to [%s]", dir); errno = errsave; } return rc; } pid_t make_daemon(void) { pid_t pid; const char * devnull = "/dev/null"; pid = fork(); if (pid < 0) { cl_log(LOG_ERR, "%s: could not start daemon\n", cmdname); cl_perror("fork"); exit(1); }else if (pid > 0) { return pid; } qb_log_ctl(QB_LOG_STDERR, QB_LOG_CONF_ENABLED, QB_FALSE); /* This is the child; ensure privileges have not been lost. */ maximize_priority(); sysrq_init(); umask(022); close(0); (void)open(devnull, O_RDONLY); close(1); (void)open(devnull, O_WRONLY); close(2); (void)open(devnull, O_WRONLY); sbd_cdtocoredir(); return 0; } void sbd_get_uname(void) { struct utsname uname_buf; int i; if (uname(&uname_buf) < 0) { cl_perror("uname() failed?"); exit(1); } local_uname = strdup(uname_buf.nodename); for (i = 0; i < strlen(local_uname); i++) local_uname[i] = tolower(local_uname[i]); } #define FMT_MAX 256 void sbd_set_format_string(int method, const char *daemon) { int offset = 0; char fmt[FMT_MAX]; struct utsname res; switch(method) { case QB_LOG_STDERR: break; case QB_LOG_SYSLOG: if(daemon && strcmp(daemon, "sbd") != 0) { offset += snprintf(fmt + offset, FMT_MAX - offset, "%10s: ", daemon); } break; default: /* When logging to a file */ if (uname(&res) == 0) { offset += snprintf(fmt + offset, FMT_MAX - offset, "%%t [%d] %s %10s: ", getpid(), res.nodename, daemon); } else { offset += snprintf(fmt + offset, FMT_MAX - offset, "%%t [%d] %10s: ", getpid(), daemon); } } if (debug && method >= QB_LOG_STDERR) { offset += snprintf(fmt + offset, FMT_MAX - offset, "(%%-12f:%%5l %%g) %%-7p: %%n: "); } else { offset += snprintf(fmt + offset, FMT_MAX - offset, "%%g %%-7p: %%n: "); } if (method == QB_LOG_SYSLOG) { offset += snprintf(fmt + offset, FMT_MAX - offset, "%%b"); } else { offset += snprintf(fmt + offset, FMT_MAX - offset, "\t%%b"); } if(offset > 0) { qb_log_format_set(method, fmt); } } void notify_parent(void) { pid_t ppid; union sigval signal_value; memset(&signal_value, 0, sizeof(signal_value)); ppid = getppid(); if (ppid == 1) { /* Our parent died unexpectedly. Triggering * self-fence. */ cl_log(LOG_WARNING, "Our parent is dead."); do_timeout_action(); } switch (servant_health) { case pcmk_health_pending: case pcmk_health_shutdown: case pcmk_health_transient: DBGLOG(LOG_DEBUG, "Not notifying parent: state transient (%d)", servant_health); break; case pcmk_health_unknown: case pcmk_health_unclean: case pcmk_health_noquorum: DBGLOG(LOG_WARNING, "Notifying parent: UNHEALTHY (%d)", servant_health); sigqueue(ppid, SIG_PCMK_UNHEALTHY, signal_value); break; case pcmk_health_online: DBGLOG(LOG_DEBUG, "Notifying parent: healthy"); sigqueue(ppid, SIG_LIVENESS, signal_value); break; default: DBGLOG(LOG_WARNING, "Notifying parent: UNHEALTHY %d", servant_health); sigqueue(ppid, SIG_PCMK_UNHEALTHY, signal_value); break; } } void set_servant_health(enum pcmk_health state, int level, char const *format, ...) { if (servant_health != state) { va_list ap; int len = 0; char *string = NULL; servant_health = state; va_start(ap, format); len = vasprintf (&string, format, ap); if(len > 0) { cl_log(level, "%s", string); } va_end(ap); free(string); } } bool sbd_is_disk(struct servants_list_item *servant) { if ((servant != NULL) && (servant->devname != NULL) && (servant->devname[0] == '/')) { return true; } return false; } bool sbd_is_cluster(struct servants_list_item *servant) { if ((servant != NULL) && (servant->devname != NULL) && (strcmp("cluster", servant->devname) == 0)) { return true; } return false; } bool sbd_is_pcmk(struct servants_list_item *servant) { if ((servant != NULL) && (servant->devname != NULL) && (strcmp("pcmk", servant->devname) == 0)) { return true; } return false; } diff --git a/src/sbd.h b/src/sbd.h index 3b6647c..ea37b4d 100644 --- a/src/sbd.h +++ b/src/sbd.h @@ -1,212 +1,217 @@ /* * Copyright (C) 2013 Lars Marowsky-Bree * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License along * with this program; if not, write to the Free Software Foundation, Inc., * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* signals reserved for multi-disk sbd */ #define SIG_LIVENESS (SIGRTMIN + 1) /* report liveness of the disk */ #define SIG_EXITREQ (SIGRTMIN + 2) /* exit request to inquisitor */ #define SIG_TEST (SIGRTMIN + 3) /* trigger self test */ #define SIG_RESTART (SIGRTMIN + 4) /* trigger restart of all failed disk */ #define SIG_PCMK_UNHEALTHY (SIGRTMIN + 5) /* FIXME: should add dynamic check of SIG_XX >= SIGRTMAX */ /* exit status for disk-servant */ #define EXIT_MD_SERVANT_IO_FAIL 20 #define EXIT_MD_SERVANT_REQUEST_RESET 21 #define EXIT_MD_SERVANT_REQUEST_SHUTOFF 22 #define EXIT_MD_SERVANT_REQUEST_CRASHDUMP 23 /* exit status for pcmk-servant */ #define EXIT_PCMK_SERVANT_GRACEFUL_SHUTDOWN 30 #define HOG_CHAR 0xff #define SECTOR_NAME_MAX 63 /* Sector data types */ struct sector_header_s { char magic[8]; unsigned char version; unsigned char slots; /* Caveat: stored in network byte-order */ uint32_t sector_size; uint32_t timeout_watchdog; uint32_t timeout_allocate; uint32_t timeout_loop; uint32_t timeout_msgwait; /* Minor version for extensions to the core data set: * compatible and optional values. */ unsigned char minor_version; uuid_t uuid; /* 16 bytes */ }; struct sector_mbox_s { signed char cmd; char from[SECTOR_NAME_MAX+1]; }; struct sector_node_s { /* slots will be created with in_use == 0 */ char in_use; char name[SECTOR_NAME_MAX+1]; }; struct servants_list_item { const char* devname; pid_t pid; int restarts; int restart_blocked; int outdated; int first_start; struct timespec t_last, t_started; struct servants_list_item *next; }; struct sbd_context { int devfd; io_context_t ioctx; struct iocb io; }; enum pcmk_health { pcmk_health_unknown, pcmk_health_pending, pcmk_health_transient, pcmk_health_unclean, pcmk_health_shutdown, pcmk_health_online, pcmk_health_noquorum, }; void usage(void); int watchdog_init_interval(void); int watchdog_tickle(void); int watchdog_init(void); void sysrq_init(void); void watchdog_close(bool disarm); int watchdog_info(void); int watchdog_test(void); void sysrq_trigger(char t); void do_crashdump(void); void do_reset(void); void do_off(void); void do_timeout_action(void); pid_t make_daemon(void); void maximize_priority(void); void sbd_get_uname(void); void sbd_set_format_string(int method, const char *daemon); void notify_parent(void); /* Tunable defaults: */ extern unsigned long timeout_watchdog; extern unsigned long timeout_watchdog_warn; extern unsigned long timeout_watchdog_crashdump; extern int timeout_allocate; extern int timeout_loop; extern int timeout_msgwait; extern int timeout_io; extern int timeout_startup; extern int watchdog_use; extern int watchdog_set_timeout; extern int skip_rt; extern int debug; extern int debug_mode; extern char *watchdogdev; extern bool watchdogdev_is_default; extern char* local_uname; extern bool do_flush; extern char timeout_sysrq_char; extern bool move_to_root_cgroup; extern bool enforce_moving_to_root_cgroup; extern bool sync_resource_startup; /* Global, non-tunable variables: */ extern int sector_size; extern int watchdogfd; extern const char* cmdname; typedef int (*functionp_t)(const char* devname, int mode, const void* argp); int assign_servant(const char* devname, functionp_t functionp, int mode, const void* argp); #if SUPPORT_SHARED_DISK void open_any_device(struct servants_list_item *servants); int init_devices(struct servants_list_item *servants); int allocate_slots(const char *name, struct servants_list_item *servants); int list_slots(struct servants_list_item *servants); int ping_via_slots(const char *name, struct servants_list_item *servants); int dump_headers(struct servants_list_item *servants); unsigned long get_first_msgwait(struct servants_list_item *servants); int messenger(const char *name, const char *msg, struct servants_list_item *servants); int servant_md(const char *diskname, int mode, const void* argp); #endif int servant_pcmk(const char *diskname, int mode, const void* argp); int servant_cluster(const char *diskname, int mode, const void* argp); struct servants_list_item *lookup_servant_by_dev(const char *devname); struct servants_list_item *lookup_servant_by_pid(pid_t pid); int init_set_proc_title(int argc, char *argv[], char *envp[]); void set_proc_title(const char *fmt,...); #define cl_log(level, fmt, args...) qb_log_from_external_source( __func__, __FILE__, fmt, level, __LINE__, 0, ##args) # define cl_perror(fmt, args...) do { \ const char *err = strerror(errno); \ cl_log(LOG_ERR, fmt ": %s (%d)", ##args, err, errno); \ } while(0) #define DBGLOG(lvl, fmt, args...) do { \ if (debug > 0) cl_log(lvl, fmt, ##args); \ } while(0) extern int servant_health; void set_servant_health(enum pcmk_health state, int level, char const *format, ...) __attribute__ ((__format__ (__printf__, 3, 4))); bool sbd_is_disk(struct servants_list_item *servant); bool sbd_is_pcmk(struct servants_list_item *servant); bool sbd_is_cluster(struct servants_list_item *servant); + +#define calculate_timeout_watchdog_warn(timeout) \ + (timeout < 5 ? 2 : \ + (timeout < (ULONG_MAX / 3) ? \ + (((unsigned long) timeout) * 3 / 5) : (((unsigned long) timeout) / 5 * 3)))