No OneTemporary
Actions

Size

77 KB

Referenced Files

None

Subscribers

None

View Options

	diff --git a/man/sbd.8.pod b/man/sbd.8.pod
	index 3254a07..6968ad6 100644
	--- a/man/sbd.8.pod
	+++ b/man/sbd.8.pod
	@@ -1,607 +1,611 @@
	=head1 NAME

	sbd - STONITH Block Device daemon

	=head1 SYNPOSIS

	sbd <-d F</dev/...>> [options] C<command>

	=head1 SUMMARY

	SBD provides a node fencing mechanism (Shoot the other node in the head,
	STONITH) for Pacemaker-based clusters through the exchange of messages
	via shared block storage such as for example a SAN, iSCSI, FCoE. This
	isolates the fencing mechanism from changes in firmware version or
	dependencies on specific firmware controllers, and it can be used as a
	STONITH mechanism in all configurations that have reliable shared
	storage.

	The F<sbd> binary implements both the daemon that watches the message
	slots as well as the management tool for interacting with the block
	storage device(s). This mode of operation is specified via the
	C<command> parameter; some of these modes take additional parameters.

	To use, you must first C<create> the messaging layout on one to three
	block devices. Second, configure F</etc/sysconfig/sbd> to list those
	devices (and possibly adjust other options), and restart the cluster
	stack on each node to ensure that C<sbd> is started. Third, configure
	the C<external/sbd> fencing resource in the Pacemaker CIB.

	Each of these steps is documented in more detail below the description
	of the command options.

	C<sbd> can only be used as root.

	=head2 GENERAL OPTIONS

	=over

	=item B<-d> F</dev/...>

	Specify the block device(s) to be used. If you have more than one,
	specify this option up to three times. This parameter is mandatory for
	all modes, since SBD always needs a block device to interact with.

	This man page uses F</dev/sda1>, F</dev/sdb1>, and F</dev/sdc1> as
	example device names for brevity. However, in your production
	environment, you should instead always refer to them by using the long,
	stable device name (e.g.,
	F</dev/disk/by-id/dm-uuid-part1-mpath-3600508b400105b5a0001500000250000>).

	=item B<-v>

	Enable some verbose debug logging.

	=item B<-h>

	Display a concise summary of C<sbd> options.

	=item B<-c> I<node>

	Set local node name; defaults to C<uname -n>. This should not need to be
	set.

	=item B<-R>

	Do B<not> enable realtime priority. By default, C<sbd> runs at realtime
	priority, locks itself into memory, and also acquires highest IO
	priority to protect itself against interference from other processes on
	the system. This is a debugging-only option.

	=item B<-I> I<N>

	Async IO timeout (defaults to 3 seconds, optional). You should not need
	to adjust this unless your IO setup is really very slow.

	(In daemon mode, the watchdog is refreshed when the majority of devices
	could be read within this time.)

	=back

	=head2 create

	Example usage:

	sbd -d /dev/sdc2 -d /dev/sdd3 create

	If you specify the I<create> command, sbd will write a metadata header
	to the device(s) specified and also initialize the messaging slots for
	up to 255 nodes.

	B<Warning>: This command will not prompt for confirmation. Roughly the
	first megabyte of the specified block device(s) will be overwritten
	immediately and without backup.

	This command accepts a few options to adjust the default timings that
	are written to the metadata (to ensure they are identical across all
	nodes accessing the device).

	=over

	=item B<-1> I<N>

	Set watchdog timeout to N seconds. This depends mostly on your storage
	latency; the majority of devices must be successfully read within this
	time, or else the node will self-fence.

	If your sbd device(s) reside on a multipath setup or iSCSI, this should
	be the time required to detect a path failure. You may be able to reduce
	this if your device outages are independent, or if you are using the
	Pacemaker integration.

	=item B<-2> I<N>

	Set slot allocation timeout to N seconds. You should not need to tune
	this.

	=item B<-3> I<N>

	Set daemon loop timeout to N seconds. You should not need to tune this.

	=item B<-4> I<N>

	Set I<msgwait> timeout to N seconds. This should be twice the I<watchdog>
	timeout. This is the time after which a message written to a node's slot
	will be considered delivered. (Or long enough for the node to detect
	that it needed to self-fence.)

	This also affects the I<stonith-timeout> in Pacemaker's CIB; see below.

	=back

	=head2 list

	Example usage:

	# sbd -d /dev/sda1 list
	0 hex-0 clear
	1 hex-7 clear
	2 hex-9 clear

	List all allocated slots on device, and messages. You should see all
	cluster nodes that have ever been started against this device. Nodes
	that are currently running should have a I<clear> state; nodes that have
	been fenced, but not yet restarted, will show the appropriate fencing
	message.

	=head2 dump

	Example usage:

	# sbd -d /dev/sda1 dump
	==Dumping header on disk /dev/sda1
	Header version : 2
	Number of slots : 255
	Sector size : 512
	Timeout (watchdog) : 15
	Timeout (allocate) : 2
	Timeout (loop) : 1
	Timeout (msgwait) : 30
	==Header on disk /dev/sda1 is dumped

	Dump meta-data header from device.

	=head2 watch

	Example usage:

	sbd -d /dev/sdc2 -d /dev/sdd3 -W -P watch

	This command will make C<sbd> start in daemon mode. It will constantly monitor
	the message slot of the local node for incoming messages, reachability, and
	optionally take Pacemaker's state into account.

	C<sbd> B<must> be started on boot before the cluster stack! See below
	for enabling this according to your boot environment.

	The options for this mode are rarely specified directly on the
	commandline directly, but most frequently set via F</etc/sysconfig/sbd>.

	It also constantly monitors connectivity to the storage device, and
	self-fences in case the partition becomes unreachable, guaranteeing that it
	does not disconnect from fencing messages.

	A node slot is automatically allocated on the device(s) the first time
	the daemon starts watching the device; hence, manual allocation is not
	usually required.

	If a watchdog is used together with the C<sbd> as is strongly
	recommended, the watchdog is activated at initial start of the sbd
	daemon. The watchdog is refreshed every time the majority of SBD devices
	has been successfully read. Using a watchdog provides additional
	protection against C<sbd> crashing.

	If the Pacemaker integration is activated, C<sbd> will B<not> self-fence
	if device majority is lost, if:

	=over

	=item 1.

	The partition the node is in is still quorate according to the CIB;

	=item 2.

	it is still quorate according to Corosync's node count;

	=item 3.

	the node itself is considered online and healthy by Pacemaker.

	=back

	This allows C<sbd> to survive temporary outages of the majority of
	devices. However, while the cluster is in such a degraded state, it can
	neither successfully fence nor be shutdown cleanly (as taking the
	cluster below the quorum threshold will immediately cause all remaining
	nodes to self-fence). In short, it will not tolerate any further faults.
	Please repair the system before continuing.

	There is one C<sbd> process that acts as a master to which all watchers
	report; one per device to monitor the node's slot; and, optionally, one
	that handles the Pacemaker integration.

	=over

	=item B<-W>

	-Enable use of the system watchdog. This is I<highly> recommended.
	+Enable or disable use of the system watchdog to protect against the sbd
	+processes failing and the node being left in an undefined state. Specify
	+this once to enable, twice to disable.
	+
	+Defaults to I<enabled>.

	=item B<-w> F</dev/watchdog>

	This can be used to override the default watchdog device used and should not
	usually be necessary.

	=item B<-p> F</var/run/sbd.pid>

	This option can be used to specify a pidfile for the main sbd process.

	=item B<-F> I<N>

	Number of failures before a failing servant process will not be restarted
	immediately until the dampening delay has expired. If set to zero, servants
	will be restarted immediately and indefinitely. If set to one, a failed
	servant will be restarted once every B<-t> seconds. If set to a different
	value, the servant will be restarted that many times within the dampening
	period and then delay.

	Defaults to I<1>.

	=item B<-t> I<N>

	Dampening delay before faulty servants are restarted. Combined with C<-F 1>,
	the most logical way to tune the restart frequency of servant processes.
	Default is 5 seconds.

	If set to zero, processes will be restarted indefinitely and immediately.

	=item B<-P>

	Check Pacemaker quorum and node health.

	=item B<-S> I<N>

	Set the start mode. (Defaults to I<0>.)

	If this is set to zero, sbd will always start up unconditionally,
	regardless of whether the node was previously fenced or not.

	If set to one, sbd will only start if the node was previously shutdown
	cleanly (as indicated by an exit request message in the slot), or if the
	slot is empty. A reset, crashdump, or power-off request in any slot will
	halt the start up.

	This is useful to prevent nodes from rejoining if they were faulty. The
	node must be manually "unfenced" by sending an empty message to it:

	sbd -d /dev/sda1 message node1 clear

	=item B<-s> I<N>

	Set the start-up wait time for devices. (Defaults to I<120>.)

	Dynamic block devices such as iSCSI might not be fully initialized and
	present yet. This allows to set a timeout for waiting for devices to
	appear on start-up. If set to 0, start-up will be aborted immediately if
	no devices are available.

	=item B<-Z>

	Enable trace mode. B<Warning: this is unsafe for production, use at your
	own risk!> Specifying this once will turn all reboots or power-offs, be
	they caused by self-fence decisions or messages, into a crashdump.
	Specifying this twice will just log them but not continue running.

	=item B<-T>

	By default, the daemon will set the watchdog timeout as specified in the
	device metadata. However, this does not work for every watchdog device.
	In this case, you must manually ensure that the watchdog timeout used by
	the system correctly matches the SBD settings, and then specify this
	option to allow C<sbd> to continue with start-up.

	=item B<-5> I<N>

	Warn if the time interval for tickling the watchdog exceeds this many seconds.
	Since the node is unable to log the watchdog expiry (it reboots immediately
	without a chance to write its logs to disk), this is very useful for getting
	an indication that the watchdog timeout is too short for the IO load of the
	system.

	Default is 3 seconds, set to zero to disable.

	=item B<-C> I<N>

	Watchdog timeout to set before crashdumping. If SBD is set to crashdump
	instead of reboot - either via the trace mode settings or the I<external/sbd>
	fencing agent's parameter -, SBD will adjust the watchdog timeout to this
	setting before triggering the dump. Otherwise, the watchdog might trigger and
	prevent a successful crashdump from ever being written.

	Defaults to 240 seconds. Set to zero to disable.

	=back

	=head2 allocate

	Example usage:

	sbd -d /dev/sda1 allocate node1

	Explicitly allocates a slot for the specified node name. This should
	rarely be necessary, as every node will automatically allocate itself a
	slot the first time it starts up on watch mode.

	=head2 message

	Example usage:

	sbd -d /dev/sda1 message node1 test

	Writes the specified message to node's slot. This is rarely done
	directly, but rather abstracted via the C<external/sbd> fencing agent
	configured as a cluster resource.

	Supported message types are:

	=over

	=item test

	This only generates a log message on the receiving node and can be used
	to check if SBD is seeing the device. Note that this could overwrite a
	fencing request send by the cluster, so should not be used during
	production.

	=item reset

	Reset the target upon receipt of this message.

	=item off

	Power-off the target.

	=item crashdump

	Cause the target node to crashdump.

	=item exit

	This will make the C<sbd> daemon exit cleanly on the target. You should
	B<not> send this message manually; this is handled properly during
	shutdown of the cluster stack. Manually stopping the daemon means the
	node is unprotected!

	=item clear

	This message indicates that no real message has been sent to the node.
	You should not set this manually; C<sbd> will clear the message slot
	automatically during start-up, and setting this manually could overwrite
	a fencing message by the cluster.

	=back

	=head1 Base system configuration

	=head2 Configure a watchdog

	It is highly recommended that you configure your Linux system to load a
	watchdog driver with hardware assistance (as is available on most modern
	systems), such as I<hpwdt>, I<iTCO_wdt>, or others. As a fall-back, you
	can use the I<softdog> module.

	No other software must access the watchdog timer; it can only be
	accessed by one process at any given time. Some hardware vendors ship
	systems management software that use the watchdog for system resets
	(f.e. HP ASR daemon). Such software has to be disabled if the watchdog
	is to be used by SBD.

	=head2 Choosing and initializing the block device(s)

	First, you have to decide if you want to use one, two, or three devices.

	If you are using multiple ones, they should reside on independent
	storage setups. Putting all three of them on the same logical unit for
	example would not provide any additional redundancy.

	The SBD device can be connected via Fibre Channel, Fibre Channel over
	Ethernet, or even iSCSI. Thus, an iSCSI target can become a sort-of
	network-based quorum server; the advantage is that it does not require
	a smart host at your third location, just block storage.

	The SBD partitions themselves B<must not> be mirrored (via MD,
	DRBD, or the storage layer itself), since this could result in a
	split-mirror scenario. Nor can they reside on cLVM2 volume groups, since
	they must be accessed by the cluster stack before it has started the
	cLVM2 daemons; hence, these should be either raw partitions or logical
	units on (multipath) storage.

	The block device(s) must be accessible from all nodes. (While it is not
	necessary that they share the same path name on all nodes, this is
	considered a very good idea.)

	SBD will only use about one megabyte per device, so you can easily
	create a small partition, or very small logical units. (The size of the
	SBD device depends on the block size of the underlying device. Thus, 1MB
	is fine on plain SCSI devices and SAN storage with 512 byte blocks. On
	the IBM s390x architecture in particular, disks default to 4k blocks,
	and thus require roughly 4MB.)

	The number of devices will affect the operation of SBD as follows:

	=over

	=item One device

	In its most simple implementation, you use one device only. This is
	appropriate for clusters where all your data is on the same shared
	storage (with internal redundancy) anyway; the SBD device does not
	introduce an additional single point of failure then.

	If the SBD device is not accessible, the daemon will fail to start and
	inhibit openais startup.

	=item Two devices

	This configuration is a trade-off, primarily aimed at environments where
	host-based mirroring is used, but no third storage device is available.

	SBD will not commit suicide if it loses access to one mirror leg; this
	allows the cluster to continue to function even in the face of one outage.

	However, SBD will not fence the other side while only one mirror leg is
	available, since it does not have enough knowledge to detect an asymmetric
	split of the storage. So it will not be able to automatically tolerate a
	second failure while one of the storage arrays is down. (Though you
	can use the appropriate crm command to acknowledge the fence manually.)

	It will not start unless both devices are accessible on boot.

	=item Three devices

	In this most reliable and recommended configuration, SBD will only
	self-fence if more than one device is lost; hence, this configuration is
	resilient against temporary single device outages (be it due to failures
	or maintenance). Fencing messages can still be successfully relayed if
	at least two devices remain accessible.

	This configuration is appropriate for more complex scenarios where
	storage is not confined to a single array. For example, host-based
	mirroring solutions could have one SBD per mirror leg (not mirrored
	itself), and an additional tie-breaker on iSCSI.

	It will only start if at least two devices are accessible on boot.

	=back

	After you have chosen the devices and created the appropriate partitions
	and perhaps multipath alias names to ease management, use the C<sbd create>
	command described above to initialize the SBD metadata on them.

	=head3 Sharing the block device(s) between multiple clusters

	It is possible to share the block devices between multiple clusters,
	provided the total number of nodes accessing them does not exceed I<255>
	nodes, and they all must share the same SBD timeouts (since these are
	part of the metadata).

	If you are using multiple devices this can reduce the setup overhead
	required. However, you should B<not> share devices between clusters in
	different security domains.

	=head2 Configure SBD to start on boot

	On systems using C<sysvinit>, the C<openais> or C<corosync> system
	start-up scripts must handle starting or stopping C<sbd> as required
	before starting the rest of the cluster stack.

	For C<systemd>, sbd simply has to be enabled using

	systemctl enable sbd.service

	The daemon is brought online on each node before the Pacemaker is
	started, and terminated only after all other cluster components have
	been shut down - ensuring that cluster resources are never activated
	without SBD supervision.

	=head2 Configuration via sysconfig

	The system instance of C<sbd> is configured via F</etc/sysconfig/sbd>.
	In this file, you must specify the device(s) used, as well as any
	options to pass to the daemon:

	SBD_DEVICE="/dev/sda1;/dev/sdb1;/dev/sdc1"
	SBD_PACEMAKER="true"

	C<sbd> will fail to start if no C<SBD_DEVICE> is specified. See the
	installed template for more options that can be configured here.

	=head2 Testing the sbd installation

	After a restart of the cluster stack on this node, you can now try
	sending a test message to it as root, from this or any other node:

	sbd -d /dev/sda1 message node1 test

	The node will acknowledge the receipt of the message in the system logs:

	Aug 29 14:10:00 node1 sbd: [13412]: info: Received command test from node2

	This confirms that SBD is indeed up and running on the node, and that it
	is ready to receive messages.

	Make B<sure> that F</etc/sysconfig/sbd> is identical on all cluster
	nodes, and that all cluster nodes are running the daemon.

	=head1 Pacemaker CIB integration

	=head2 Fencing resource

	Pacemaker can only interact with SBD to issue a node fence if there is a
	configure fencing resource. This should be a primitive, not a clone, as
	follows:

	primitive fencing-sbd external/sbd \
	op start start-delay="15"

	This will automatically use the same devices as configured in
	F</etc/sysconfig/sbd>.

	While you should not configure this as a clone (as Pacemaker will start
	a fencing agent in each partition automatically), the I<start-delay>
	setting ensures, in a scenario where a split-brain scenario did occur in
	a two node cluster, that the one that still needs to instantiate a
	fencing agent is slightly disadvantaged to avoid fencing loops.

	SBD also supports turning the reset request into a crash request, which
	may be helpful for debugging if you have kernel crashdumping configured;
	then, every fence request will cause the node to dump core. You can
	enable this via the C<crashdump="true"> parameter on the fencing
	resource. This is B<not> recommended for production use, but only for
	debugging phases.

	=head2 General cluster properties

	You must also enable STONITH in general, and set the STONITH timeout to
	be at least twice the I<msgwait> timeout you have configured, to allow
	enough time for the fencing message to be delivered. If your I<msgwait>
	timeout is 60 seconds, this is a possible configuration:

	property stonith-enabled="true"
	property stonith-timeout="120s"

	B<Caution>: if I<stonith-timeout> is too low for I<msgwait> and the
	system overhead, sbd will never be able to successfully complete a fence
	request. This will create a fencing loop.

	=head1 Management tasks

	=head2 Recovering from temporary SBD device outage

	If you have multiple devices, failure of a single device is not immediately
	fatal. C<sbd> will retry to restart the monitor for the device every 5
	seconds by default. However, you can tune this via the options to the
	I<watch> command.

	In case you wish the immediately force a restart of all currently
	disabled monitor processes, you can send a I<SIGUSR1> to the SBD
	I<inquisitor> process.


	=head1 LICENSE

	Copyright (C) 2008-2013 Lars Marowsky-Bree

	This program is free software; you can redistribute it and/or
	modify it under the terms of the GNU General Public
	License as published by the Free Software Foundation; either
	version 2.1 of the License, or (at your option) any later version.

	This software is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	General Public License for more details.

	For details see the GNU General Public License at
	http://www.gnu.org/licenses/gpl.html

	diff --git a/src/sbd-common.c b/src/sbd-common.c
	index db35732..79c8890 100644
	--- a/src/sbd-common.c
	+++ b/src/sbd-common.c
	@@ -1,1054 +1,1054 @@
	/*
	* Copyright (C) 2013 Lars Marowsky-Bree <lmb@suse.com>
	*
	* This program is free software; you can redistribute it and/or
	* modify it under the terms of the GNU General Public
	* License as published by the Free Software Foundation; either
	* version 2.1 of the License, or (at your option) any later version.
	*
	* This software is distributed in the hope that it will be useful,
	* but WITHOUT ANY WARRANTY; without even the implied warranty of
	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	* General Public License for more details.
	*
	* You should have received a copy of the GNU General Public
	* License along with this library; if not, write to the Free Software
	* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
	*/

	#include "sbd.h"

	/* These have to match the values in the header of the partition */
	static char sbd_magic[8] = "SBD_SBD_";
	static char sbd_version = 0x02;

	/* Tunable defaults: */
	unsigned long timeout_watchdog = 5;
	unsigned long timeout_watchdog_warn = 3;
	int timeout_allocate = 2;
	int timeout_loop = 1;
	int timeout_msgwait = 10;
	int timeout_io = 3;
	int timeout_startup = 120;

	-int watchdog_use = 0;
	+int watchdog_use = 1;
	int watchdog_set_timeout = 1;
	unsigned long timeout_watchdog_crashdump = 240;
	int skip_rt = 0;
	int check_pcmk = 0;
	int debug = 0;
	int debug_mode = 0;
	const char *watchdogdev = "/dev/watchdog";
	char * local_uname;

	/* Global, non-tunable variables: */
	int sector_size = 0;
	int watchdogfd = -1;

	/const char devname;*/
	const char *cmdname;

	void
	usage(void)
	{
	fprintf(stderr,
	"Shared storage fencing tool.\n"
	"Syntax:\n"
	" %s <options> <command> <cmdarguments>\n"
	"Options:\n"
	"-d <devname> Block device to use (mandatory; can be specified up to 3 times)\n"
	"-h Display this help.\n"
	"-n <node> Set local node name; defaults to uname -n (optional)\n"
	"\n"
	"-R Do NOT enable realtime priority (debugging only)\n"
	"-W Use watchdog (recommended) (watch only)\n"
	"-w <dev> Specify watchdog device (optional) (watch only)\n"
	"-T Do NOT initialize the watchdog timeout (watch only)\n"
	"-S <0\|1> Set start mode if the node was previously fenced (watch only)\n"
	"-p <path> Write pidfile to the specified path (watch only)\n"
	"-v Enable some verbose debug logging (optional)\n"
	"\n"
	"-1 <N> Set watchdog timeout to N seconds (optional, create only)\n"
	"-2 <N> Set slot allocation timeout to N seconds (optional, create only)\n"
	"-3 <N> Set daemon loop timeout to N seconds (optional, create only)\n"
	"-4 <N> Set msgwait timeout to N seconds (optional, create only)\n"
	"-5 <N> Warn if loop latency exceeds threshold (optional, watch only)\n"
	" (default is 3, set to 0 to disable)\n"
	"-C <N> Watchdog timeout to set before crashdumping (def: 240s, optional)\n"
	"-I <N> Async IO read timeout (defaults to 3 * loop timeout, optional)\n"
	"-s <N> Timeout to wait for devices to become available (def: 120s)\n"
	"-t <N> Dampening delay before faulty servants are restarted (optional)\n"
	" (default is 5, set to 0 to disable)\n"
	"-F <N> # of failures before a servant is considered faulty (optional)\n"
	" (default is 1, set to 0 to disable)\n"
	"-P Check Pacemaker quorum and node health (optional, watch only)\n"
	"-Z Enable trace mode. WARNING: UNSAFE FOR PRODUCTION!\n"
	"Commands:\n"
	"create initialize N slots on <dev> - OVERWRITES DEVICE!\n"
	"list List all allocated slots on device, and messages.\n"
	"dump Dump meta-data header from device.\n"
	"watch Loop forever, monitoring own slot\n"
	"allocate <node>\n"
	" Allocate a slot for node (optional)\n"
	"message <node> (test\|reset\|off\|clear\|exit)\n"
	" Writes the specified message to node's slot.\n"
	, cmdname);
	}

	int
	watchdog_init_interval(void)
	{
	int timeout = timeout_watchdog;

	if (watchdogfd < 0) {
	return 0;
	}


	if (watchdog_set_timeout == 0) {
	cl_log(LOG_INFO, "NOT setting watchdog timeout on explicit user request!");
	return 0;
	}

	if (ioctl(watchdogfd, WDIOC_SETTIMEOUT, &timeout) < 0) {
	cl_perror( "WDIOC_SETTIMEOUT"
	": Failed to set watchdog timer to %u seconds.",
	timeout);
	cl_log(LOG_CRIT, "Please validate your watchdog configuration!");
	cl_log(LOG_CRIT, "Choose a different watchdog driver or specify -T to skip this if you are completely sure.");
	return -1;
	} else {
	cl_log(LOG_INFO, "Set watchdog timeout to %u seconds.",
	timeout);
	}
	return 0;
	}

	int
	watchdog_tickle(void)
	{
	if (watchdogfd >= 0) {
	if (write(watchdogfd, "", 1) != 1) {
	cl_perror("Watchdog write failure: %s!",
	watchdogdev);
	return -1;
	}
	}
	return 0;
	}

	int
	watchdog_init(void)
	{
	if (watchdogfd < 0 && watchdogdev != NULL) {
	watchdogfd = open(watchdogdev, O_WRONLY);
	if (watchdogfd >= 0) {
	cl_log(LOG_NOTICE, "Using watchdog device: %s",
	watchdogdev);
	if ((watchdog_init_interval() < 0)
	\|\| (watchdog_tickle() < 0)) {
	return -1;
	}
	}else{
	cl_perror("Cannot open watchdog device: %s",
	watchdogdev);
	return -1;
	}
	}
	return 0;
	}

	void
	watchdog_close(void)
	{
	if (watchdogfd >= 0) {
	if (write(watchdogfd, "V", 1) != 1) {
	cl_perror(
	"Watchdog write magic character failure: closing %s!",
	watchdogdev);
	}
	if (close(watchdogfd) < 0) {
	cl_perror("Watchdog close(2) failed.");
	}
	watchdogfd = -1;
	}
	}

	/* This duplicates some code from linux/ioprio.h since these are not included
	* even in linux-kernel-headers. Sucks. See also
	* /usr/src/linux/Documentation/block/ioprio.txt and ioprio_set(2) */
	extern int sys_ioprio_set(int, int, int);
	int ioprio_set(int which, int who, int ioprio);
	inline int ioprio_set(int which, int who, int ioprio)
	{
	return syscall(__NR_ioprio_set, which, who, ioprio);
	}

	enum {
	IOPRIO_CLASS_NONE,
	IOPRIO_CLASS_RT,
	IOPRIO_CLASS_BE,
	IOPRIO_CLASS_IDLE,
	};

	enum {
	IOPRIO_WHO_PROCESS = 1,
	IOPRIO_WHO_PGRP,
	IOPRIO_WHO_USER,
	};

	#define IOPRIO_BITS (16)
	#define IOPRIO_CLASS_SHIFT (13)
	#define IOPRIO_PRIO_MASK ((1UL << IOPRIO_CLASS_SHIFT) - 1)

	#define IOPRIO_PRIO_CLASS(mask) ((mask) >> IOPRIO_CLASS_SHIFT)
	#define IOPRIO_PRIO_DATA(mask) ((mask) & IOPRIO_PRIO_MASK)
	#define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) \| data)

	void
	maximize_priority(void)
	{
	if (skip_rt) {
	cl_log(LOG_INFO, "Not elevating to realtime (-R specified).");
	return;
	}

	cl_make_realtime(-1, 100, 256, 256);

	if (ioprio_set(IOPRIO_WHO_PROCESS, getpid(),
	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 1)) != 0) {
	cl_perror("ioprio_set() call failed.");
	}
	}

	void
	close_device(struct sbd_context *st)
	{
	close(st->devfd);
	free(st);
	}

	struct sbd_context *
	open_device(const char* devname, int loglevel)
	{
	struct sbd_context *st;

	if (!devname)
	return NULL;

	st = malloc(sizeof(struct sbd_context));
	if (!st)
	return NULL;
	memset(st, 0, sizeof(struct sbd_context));

	if (io_setup(1, &st->ioctx) != 0) {
	cl_perror("io_setup failed");
	free(st);
	return NULL;
	}

	st->devfd = open(devname, O_SYNC\|O_RDWR\|O_DIRECT);

	if (st->devfd == -1) {
	if (loglevel == LOG_DEBUG) {
	DBGLOG(loglevel, "Opening device %s failed.", devname);
	} else {
	cl_log(loglevel, "Opening device %s failed.", devname);
	}
	free(st);
	return NULL;
	}

	ioctl(st->devfd, BLKSSZGET, &sector_size);

	if (sector_size == 0) {
	cl_perror("Get sector size failed.\n");
	close_device(st);
	return NULL;
	}

	return st;
	}

	signed char
	cmd2char(const char *cmd)
	{
	if (strcmp("clear", cmd) == 0) {
	return SBD_MSG_EMPTY;
	} else if (strcmp("test", cmd) == 0) {
	return SBD_MSG_TEST;
	} else if (strcmp("reset", cmd) == 0) {
	return SBD_MSG_RESET;
	} else if (strcmp("off", cmd) == 0) {
	return SBD_MSG_OFF;
	} else if (strcmp("exit", cmd) == 0) {
	return SBD_MSG_EXIT;
	} else if (strcmp("crashdump", cmd) == 0) {
	return SBD_MSG_CRASHDUMP;
	}
	return -1;
	}

	void *
	sector_alloc(void)
	{
	void *x;

	x = valloc(sector_size);
	if (!x) {
	exit(1);
	}
	memset(x, 0, sector_size);

	return x;
	}

	const char*
	char2cmd(const char cmd)
	{
	switch (cmd) {
	case SBD_MSG_EMPTY:
	return "clear";
	break;
	case SBD_MSG_TEST:
	return "test";
	break;
	case SBD_MSG_RESET:
	return "reset";
	break;
	case SBD_MSG_OFF:
	return "off";
	break;
	case SBD_MSG_EXIT:
	return "exit";
	break;
	case SBD_MSG_CRASHDUMP:
	return "crashdump";
	break;
	default:
	return "undefined";
	break;
	}
	}

	static int
	sector_io(struct sbd_context st, int sector, void data, int rw)
	{
	struct timespec timeout;
	struct io_event event;
	struct iocb *ios[1] = { &st->io };
	long r;

	timeout.tv_sec = timeout_io;
	timeout.tv_nsec = 0;

	memset(&st->io, 0, sizeof(struct iocb));
	if (rw) {
	io_prep_pwrite(&st->io, st->devfd, data, sector_size, sector_size * sector);
	} else {
	io_prep_pread(&st->io, st->devfd, data, sector_size, sector_size * sector);
	}

	if (io_submit(st->ioctx, 1, ios) != 1) {
	cl_log(LOG_ERR, "Failed to submit IO request! (rw=%d)", rw);
	return -1;
	}

	errno = 0;
	r = io_getevents(st->ioctx, 1L, 1L, &event, &timeout);

	if (r < 0 ) {
	cl_log(LOG_ERR, "Failed to retrieve IO events (rw=%d)", rw);
	return -1;
	} else if (r < 1L) {
	cl_log(LOG_INFO, "Cancelling IO request due to timeout (rw=%d)", rw);
	r = io_cancel(st->ioctx, ios[0], &event);
	if (r) {
	DBGLOG(LOG_INFO, "Could not cancel IO request (rw=%d)", rw);
	/* Doesn't really matter, debugging information.
	*/
	}
	return -1;
	} else if (r > 1L) {
	cl_log(LOG_ERR, "More than one IO was returned (r=%ld)", r);
	return -1;
	}


	/* IO is happy */
	if (event.res == sector_size) {
	return 0;
	} else {
	cl_log(LOG_ERR, "Short IO (rw=%d, res=%lu, sector_size=%d)",
	rw, event.res, sector_size);
	return -1;
	}
	}

	int
	sector_write(struct sbd_context st, int sector, void data)
	{
	return sector_io(st, sector, data, 1);
	}

	int
	sector_read(struct sbd_context st, int sector, void data)
	{
	return sector_io(st, sector, data, 0);
	}

	int
	slot_read(struct sbd_context st, int slot, struct sector_node_s s_node)
	{
	return sector_read(st, SLOT_TO_SECTOR(slot), s_node);
	}

	int
	slot_write(struct sbd_context st, int slot, struct sector_node_s s_node)
	{
	return sector_write(st, SLOT_TO_SECTOR(slot), s_node);
	}

	int
	mbox_write(struct sbd_context st, int mbox, struct sector_mbox_s s_mbox)
	{
	return sector_write(st, MBOX_TO_SECTOR(mbox), s_mbox);
	}

	int
	mbox_read(struct sbd_context st, int mbox, struct sector_mbox_s s_mbox)
	{
	return sector_read(st, MBOX_TO_SECTOR(mbox), s_mbox);
	}

	int
	mbox_write_verify(struct sbd_context st, int mbox, struct sector_mbox_s s_mbox)
	{
	void *data;
	int rc = 0;

	if (sector_write(st, MBOX_TO_SECTOR(mbox), s_mbox) < 0)
	return -1;

	data = sector_alloc();
	if (sector_read(st, MBOX_TO_SECTOR(mbox), data) < 0) {
	rc = -1;
	goto out;
	}


	if (memcmp(s_mbox, data, sector_size) != 0) {
	cl_log(LOG_ERR, "Write verification failed!");
	rc = -1;
	goto out;
	}
	rc = 0;
	out:
	free(data);
	return rc;
	}

	int header_write(struct sbd_context st, struct sector_header_s s_header)
	{
	s_header->sector_size = htonl(s_header->sector_size);
	s_header->timeout_watchdog = htonl(s_header->timeout_watchdog);
	s_header->timeout_allocate = htonl(s_header->timeout_allocate);
	s_header->timeout_loop = htonl(s_header->timeout_loop);
	s_header->timeout_msgwait = htonl(s_header->timeout_msgwait);
	return sector_write(st, 0, s_header);
	}

	int
	header_read(struct sbd_context st, struct sector_header_s s_header)
	{
	if (sector_read(st, 0, s_header) < 0)
	return -1;

	s_header->sector_size = ntohl(s_header->sector_size);
	s_header->timeout_watchdog = ntohl(s_header->timeout_watchdog);
	s_header->timeout_allocate = ntohl(s_header->timeout_allocate);
	s_header->timeout_loop = ntohl(s_header->timeout_loop);
	s_header->timeout_msgwait = ntohl(s_header->timeout_msgwait);
	/* This sets the global defaults: */
	timeout_watchdog = s_header->timeout_watchdog;
	timeout_allocate = s_header->timeout_allocate;
	timeout_loop = s_header->timeout_loop;
	timeout_msgwait = s_header->timeout_msgwait;

	return 0;
	}

	int
	valid_header(const struct sector_header_s *s_header)
	{
	if (memcmp(s_header->magic, sbd_magic, sizeof(s_header->magic)) != 0) {
	cl_log(LOG_ERR, "Header magic does not match.");
	return -1;
	}
	if (s_header->version != sbd_version) {
	cl_log(LOG_ERR, "Header version does not match.");
	return -1;
	}
	if (s_header->sector_size != sector_size) {
	cl_log(LOG_ERR, "Header sector size does not match.");
	return -1;
	}
	return 0;
	}

	struct sector_header_s *
	header_get(struct sbd_context *st)
	{
	struct sector_header_s *s_header;
	s_header = sector_alloc();

	if (header_read(st, s_header) < 0) {
	cl_log(LOG_ERR, "Unable to read header from device %d", st->devfd);
	return NULL;
	}

	if (valid_header(s_header) < 0) {
	cl_log(LOG_ERR, "header on device %d is not valid.", st->devfd);
	return NULL;
	}

	/* cl_log(LOG_INFO, "Found version %d header with %d slots",
	s_header->version, s_header->slots); */

	return s_header;
	}

	int
	init_device(struct sbd_context *st)
	{
	struct sector_header_s *s_header;
	struct sector_node_s *s_node;
	struct sector_mbox_s *s_mbox;
	struct stat s;
	char uuid[37];
	int i;
	int rc = 0;

	s_header = sector_alloc();
	s_node = sector_alloc();
	s_mbox = sector_alloc();
	memcpy(s_header->magic, sbd_magic, sizeof(s_header->magic));
	s_header->version = sbd_version;
	s_header->slots = 255;
	s_header->sector_size = sector_size;
	s_header->timeout_watchdog = timeout_watchdog;
	s_header->timeout_allocate = timeout_allocate;
	s_header->timeout_loop = timeout_loop;
	s_header->timeout_msgwait = timeout_msgwait;

	s_header->minor_version = 1;
	uuid_generate(s_header->uuid);
	uuid_unparse_lower(s_header->uuid, uuid);

	fstat(st->devfd, &s);
	/* printf("st_size = %ld, st_blksize = %ld, st_blocks = %ld\n",
	s.st_size, s.st_blksize, s.st_blocks); */

	cl_log(LOG_INFO, "Creating version %d.%d header on device %d (uuid: %s)",
	s_header->version, s_header->minor_version,
	st->devfd, uuid);
	fprintf(stdout, "Creating version %d.%d header on device %d (uuid: %s)\n",
	s_header->version, s_header->minor_version,
	st->devfd, uuid);
	if (header_write(st, s_header) < 0) {
	rc = -1; goto out;
	}
	cl_log(LOG_INFO, "Initializing %d slots on device %d",
	s_header->slots,
	st->devfd);
	fprintf(stdout, "Initializing %d slots on device %d\n",
	s_header->slots,
	st->devfd);
	for (i=0;i < s_header->slots;i++) {
	if (slot_write(st, i, s_node) < 0) {
	rc = -1; goto out;
	}
	if (mbox_write(st, i, s_mbox) < 0) {
	rc = -1; goto out;
	}
	}

	out: free(s_node);
	free(s_header);
	free(s_mbox);
	return(rc);
	}

	/* Check if there already is a slot allocated to said name; returns the
	* slot number. If not found, returns -1.
	* This is necessary because slots might not be continuous. */
	int
	slot_lookup(struct sbd_context st, const struct sector_header_s s_header, const char *name)
	{
	struct sector_node_s *s_node = NULL;
	int i;
	int rc = -1;

	if (!name) {
	cl_log(LOG_ERR, "slot_lookup(): No name specified.\n");
	goto out;
	}

	s_node = sector_alloc();

	for (i=0; i < s_header->slots; i++) {
	if (slot_read(st, i, s_node) < 0) {
	rc = -2; goto out;
	}
	if (s_node->in_use != 0) {
	if (strncasecmp(s_node->name, name,
	sizeof(s_node->name)) == 0) {
	DBGLOG(LOG_INFO, "%s owns slot %d", name, i);
	rc = i; goto out;
	}
	}
	}

	out: free(s_node);
	return rc;
	}

	int
	slot_unused(struct sbd_context st, const struct sector_header_s s_header)
	{
	struct sector_node_s *s_node;
	int i;
	int rc = -1;

	s_node = sector_alloc();

	for (i=0; i < s_header->slots; i++) {
	if (slot_read(st, i, s_node) < 0) {
	rc = -1; goto out;
	}
	if (s_node->in_use == 0) {
	rc = i; goto out;
	}
	}

	out: free(s_node);
	return rc;
	}


	int
	slot_allocate(struct sbd_context st, const char name)
	{
	struct sector_header_s *s_header = NULL;
	struct sector_node_s *s_node = NULL;
	struct sector_mbox_s *s_mbox = NULL;
	int i;
	int rc = 0;

	if (!name) {
	cl_log(LOG_ERR, "slot_allocate(): No name specified.\n");
	fprintf(stderr, "slot_allocate(): No name specified.\n");
	rc = -1; goto out;
	}

	s_header = header_get(st);
	if (!s_header) {
	rc = -1; goto out;
	}

	s_node = sector_alloc();
	s_mbox = sector_alloc();

	while (1) {
	i = slot_lookup(st, s_header, name);
	if ((i >= 0) \|\| (i == -2)) {
	/* -1 is "no slot found", in which case we
	* proceed to allocate a new one.
	* -2 is "read error during lookup", in which
	* case we error out too
	* >= 0 is "slot already allocated" */
	rc = i; goto out;
	}

	i = slot_unused(st, s_header);
	if (i >= 0) {
	cl_log(LOG_INFO, "slot %d is unused - trying to own", i);
	fprintf(stdout, "slot %d is unused - trying to own\n", i);
	memset(s_node, 0, sizeof(*s_node));
	s_node->in_use = 1;
	strncpy(s_node->name, name, sizeof(s_node->name));
	if (slot_write(st, i, s_node) < 0) {
	rc = -1; goto out;
	}
	sleep(timeout_allocate);
	} else {
	cl_log(LOG_ERR, "No more free slots.");
	fprintf(stderr, "No more free slots.\n");
	rc = -1; goto out;
	}
	}

	out: free(s_node);
	free(s_header);
	free(s_mbox);
	return(rc);
	}

	int
	slot_list(struct sbd_context *st)
	{
	struct sector_header_s *s_header = NULL;
	struct sector_node_s *s_node = NULL;
	struct sector_mbox_s *s_mbox = NULL;
	int i;
	int rc = 0;

	s_header = header_get(st);
	if (!s_header) {
	rc = -1; goto out;
	}

	s_node = sector_alloc();
	s_mbox = sector_alloc();

	for (i=0; i < s_header->slots; i++) {
	if (slot_read(st, i, s_node) < 0) {
	rc = -1; goto out;
	}
	if (s_node->in_use > 0) {
	if (mbox_read(st, i, s_mbox) < 0) {
	rc = -1; goto out;
	}
	printf("%d\t%s\t%s\t%s\n",
	i, s_node->name, char2cmd(s_mbox->cmd),
	s_mbox->from);
	}
	}

	out: free(s_node);
	free(s_header);
	free(s_mbox);
	return rc;
	}

	int
	slot_msg(struct sbd_context st, const char name, const char *cmd)
	{
	struct sector_header_s *s_header = NULL;
	struct sector_mbox_s *s_mbox = NULL;
	int mbox;
	int rc = 0;
	char uuid[37];

	if (!name \|\| !cmd) {
	cl_log(LOG_ERR, "slot_msg(): No recipient / cmd specified.\n");
	rc = -1; goto out;
	}

	s_header = header_get(st);
	if (!s_header) {
	rc = -1; goto out;
	}

	if (strcmp(name, "LOCAL") == 0) {
	name = local_uname;
	}

	if (s_header->minor_version > 0) {
	uuid_unparse_lower(s_header->uuid, uuid);
	cl_log(LOG_INFO, "Device UUID: %s", uuid);
	}

	mbox = slot_lookup(st, s_header, name);
	if (mbox < 0) {
	cl_log(LOG_ERR, "slot_msg(): No slot found for %s.", name);
	rc = -1; goto out;
	}

	s_mbox = sector_alloc();

	s_mbox->cmd = cmd2char(cmd);
	if (s_mbox->cmd < 0) {
	cl_log(LOG_ERR, "slot_msg(): Invalid command %s.", cmd);
	rc = -1; goto out;
	}

	strncpy(s_mbox->from, local_uname, sizeof(s_mbox->from)-1);

	cl_log(LOG_INFO, "Writing %s to node slot %s",
	cmd, name);
	if (mbox_write_verify(st, mbox, s_mbox) < -1) {
	rc = -1; goto out;
	}
	if (strcasecmp(cmd, "exit") != 0) {
	cl_log(LOG_INFO, "Messaging delay: %d",
	(int)timeout_msgwait);
	sleep(timeout_msgwait);
	}
	cl_log(LOG_INFO, "%s successfully delivered to %s",
	cmd, name);

	out: free(s_mbox);
	free(s_header);
	return rc;
	}

	int
	slot_ping(struct sbd_context st, const char name)
	{
	struct sector_header_s *s_header = NULL;
	struct sector_mbox_s *s_mbox = NULL;
	int mbox;
	int waited = 0;
	int rc = 0;

	if (!name) {
	cl_log(LOG_ERR, "slot_ping(): No recipient specified.\n");
	rc = -1; goto out;
	}

	s_header = header_get(st);
	if (!s_header) {
	rc = -1; goto out;
	}

	if (strcmp(name, "LOCAL") == 0) {
	name = local_uname;
	}

	mbox = slot_lookup(st, s_header, name);
	if (mbox < 0) {
	cl_log(LOG_ERR, "slot_msg(): No slot found for %s.", name);
	rc = -1; goto out;
	}

	s_mbox = sector_alloc();
	s_mbox->cmd = SBD_MSG_TEST;

	strncpy(s_mbox->from, local_uname, sizeof(s_mbox->from)-1);

	DBGLOG(LOG_DEBUG, "Pinging node %s", name);
	if (mbox_write(st, mbox, s_mbox) < -1) {
	rc = -1; goto out;
	}

	rc = -1;
	while (waited <= timeout_msgwait) {
	if (mbox_read(st, mbox, s_mbox) < 0)
	break;
	if (s_mbox->cmd != SBD_MSG_TEST) {
	rc = 0;
	break;
	}
	sleep(1);
	waited++;
	}

	if (rc == 0) {
	cl_log(LOG_DEBUG, "%s successfully pinged.", name);
	} else {
	cl_log(LOG_ERR, "%s failed to ping.", name);
	}

	out: free(s_mbox);
	free(s_header);
	return rc;
	}

	void
	sysrq_init(void)
	{
	FILE* procf;
	int c;
	procf = fopen("/proc/sys/kernel/sysrq", "r");
	if (!procf) {
	cl_perror("cannot open /proc/sys/kernel/sysrq for read.");
	return;
	}
	if (fscanf(procf, "%d", &c) != 1) {
	cl_perror("Parsing sysrq failed");
	c = 0;
	}
	fclose(procf);
	if (c == 1)
	return;
	/* 8 for debugging dumps of processes,
	128 for reboot/poweroff */
	c \|= 136;
	procf = fopen("/proc/sys/kernel/sysrq", "w");
	if (!procf) {
	cl_perror("cannot open /proc/sys/kernel/sysrq for writing");
	return;
	}
	fprintf(procf, "%d", c);
	fclose(procf);
	return;
	}

	void
	sysrq_trigger(char t)
	{
	FILE *procf;

	procf = fopen("/proc/sysrq-trigger", "a");
	if (!procf) {
	cl_perror("Opening sysrq-trigger failed.");
	return;
	}
	cl_log(LOG_INFO, "sysrq-trigger: %c\n", t);
	fprintf(procf, "%c\n", t);
	fclose(procf);
	return;
	}

	void
	do_crashdump(void)
	{
	if (timeout_watchdog_crashdump) {
	timeout_watchdog = timeout_watchdog_crashdump;
	watchdog_init_interval();
	watchdog_tickle();
	}
	sysrq_trigger('c');
	/* is it possible to reach the following line? */
	cl_reboot(5, "sbd is triggering crashdumping");
	exit(1);
	}

	void
	do_reset(void)
	{
	if (debug_mode == 1) {
	cl_log(LOG_ERR, "Request to suicide changed to kdump due to DEBUG MODE!");
	watchdog_close();
	sysrq_trigger('c');
	exit(0);
	} else if (debug_mode == 2) {
	cl_log(LOG_ERR, "Skipping request to suicide due to DEBUG MODE!");
	watchdog_close();
	exit(0);
	} else if (debug_mode == 3) {
	/* The idea is to give the system some time to flush
	* logs to disk before rebooting. */
	cl_log(LOG_ERR, "Delaying request to suicide by 10s due to DEBUG MODE!");
	watchdog_close();
	sync();
	sync();
	sleep(10);
	cl_log(LOG_ERR, "Debug mode is now becoming real ...");
	}
	sysrq_trigger('b');
	cl_reboot(5, "sbd is self-fencing (reset)");
	sleep(timeout_watchdog * 2);
	exit(1);
	}

	void
	do_off(void)
	{
	if (debug_mode == 1) {
	cl_log(LOG_ERR, "Request to power-off changed to kdump due to DEBUG MODE!");
	watchdog_close();
	sysrq_trigger('c');
	exit(0);
	} else if (debug_mode == 2) {
	cl_log(LOG_ERR, "Skipping request to power-off due to DEBUG MODE!");
	watchdog_close();
	exit(0);
	} else if (debug_mode == 3) {
	/* The idea is to give the system some time to flush
	* logs to disk before rebooting. */
	cl_log(LOG_ERR, "Delaying request to power-off by 10s due to DEBUG MODE!");
	watchdog_close();
	sync();
	sync();
	sleep(10);
	cl_log(LOG_ERR, "Debug mode is now becoming real ...");
	}
	sysrq_trigger('o');
	cl_reboot(5, "sbd is self-fencing (power-off)");
	sleep(timeout_watchdog * 2);
	exit(1);
	}

	pid_t
	make_daemon(void)
	{
	pid_t pid;
	const char * devnull = "/dev/null";

	pid = fork();
	if (pid < 0) {
	cl_log(LOG_ERR, "%s: could not start daemon\n",
	cmdname);
	cl_perror("fork");
	exit(1);
	}else if (pid > 0) {
	return pid;
	}

	cl_log_enable_stderr(FALSE);

	/* This is the child; ensure privileges have not been lost. */
	maximize_priority();
	sysrq_init();

	umask(022);
	close(0);
	(void)open(devnull, O_RDONLY);
	close(1);
	(void)open(devnull, O_WRONLY);
	close(2);
	(void)open(devnull, O_WRONLY);
	cl_cdtocoredir();
	return 0;
	}

	int
	header_dump(struct sbd_context *st)
	{
	struct sector_header_s *s_header;
	char uuid[37];

	s_header = header_get(st);
	if (s_header == NULL)
	return -1;

	printf("Header version : %u.%u\n", s_header->version,
	s_header->minor_version);
	if (s_header->minor_version > 0) {
	uuid_unparse_lower(s_header->uuid, uuid);
	printf("UUID : %s\n", uuid);
	}

	printf("Number of slots : %u\n", s_header->slots);
	printf("Sector size : %lu\n",
	(unsigned long)s_header->sector_size);
	printf("Timeout (watchdog) : %lu\n",
	(unsigned long)s_header->timeout_watchdog);
	printf("Timeout (allocate) : %lu\n",
	(unsigned long)s_header->timeout_allocate);
	printf("Timeout (loop) : %lu\n",
	(unsigned long)s_header->timeout_loop);
	printf("Timeout (msgwait) : %lu\n",
	(unsigned long)s_header->timeout_msgwait);
	return 0;
	}

	void
	sbd_get_uname(void)
	{
	struct utsname uname_buf;
	int i;

	if (uname(&uname_buf) < 0) {
	cl_perror("uname() failed?");
	exit(1);
	}

	local_uname = strdup(uname_buf.nodename);

	for (i = 0; i < strlen(local_uname); i++)
	local_uname[i] = tolower(local_uname[i]);
	}

	diff --git a/src/sbd-md.c b/src/sbd-md.c
	index 029947e..0a7278c 100644
	--- a/src/sbd-md.c
	+++ b/src/sbd-md.c
	@@ -1,1161 +1,1171 @@
	/*
	* Copyright (C) 2013 Lars Marowsky-Bree <lmb@suse.com>
	*
	* This program is free software; you can redistribute it and/or
	* modify it under the terms of the GNU General Public
	* License as published by the Free Software Foundation; either
	* version 2.1 of the License, or (at your option) any later version.
	*
	* This software is distributed in the hope that it will be useful,
	* but WITHOUT ANY WARRANTY; without even the implied warranty of
	* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	* General Public License for more details.
	*
	* You should have received a copy of the GNU General Public
	* License along with this library; if not, write to the Free Software
	* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
	*/

	#include "sbd.h"

	struct servants_list_item *servants_leader = NULL;

	static int servant_count = 0;
	static int servant_restart_interval = 5;
	static int servant_restart_count = 1;
	static int servant_inform_parent = 0;
	static int check_pcmk = 0;
	static int start_mode = 0;
	static char* pidfile = NULL;

	static void open_any_device(void);
	static int check_timeout_inconsistent(struct sector_header_s *hdr);

	int quorum_write(int good_servants)
	{
	return (good_servants > servant_count/2);
	}

	int quorum_read(int good_servants)
	{
	if (servant_count >= 3)
	return (good_servants > servant_count/2);
	else
	return (good_servants >= 1);
	}

	int assign_servant(const char* devname, functionp_t functionp, const void* argp)
	{
	pid_t pid = 0;
	int rc = 0;

	pid = fork();
	if (pid == 0) { /* child */
	maximize_priority();
	rc = (*functionp)(devname, argp);
	if (rc == -1)
	exit(1);
	else
	exit(0);
	} else if (pid != -1) { /* parent */
	return pid;
	} else {
	cl_log(LOG_ERR,"Failed to fork servant");
	exit(1);
	}
	}

	int init_devices()
	{
	int rc = 0;
	struct sbd_context *st;
	struct servants_list_item *s;

	for (s = servants_leader; s; s = s->next) {
	fprintf(stdout, "Initializing device %s\n",
	s->devname);
	st = open_device(s->devname, LOG_ERR);
	if (!st) {
	return -1;
	}
	rc = init_device(st);
	close_device(st);
	if (rc == -1) {
	fprintf(stderr, "Failed to init device %s\n", s->devname);
	return rc;
	}
	fprintf(stdout, "Device %s is initialized.\n", s->devname);
	}
	return 0;
	}

	int slot_msg_wrapper(const char* devname, const void* argp)
	{
	int rc = 0;
	struct sbd_context *st;
	const struct slot_msg_arg_t* arg = (const struct slot_msg_arg_t*)argp;

	st = open_device(devname, LOG_WARNING);
	if (!st)
	return -1;
	cl_log(LOG_INFO, "Delivery process handling %s",
	devname);
	rc = slot_msg(st, arg->name, arg->msg);
	close_device(st);
	return rc;
	}

	int slot_ping_wrapper(const char* devname, const void* argp)
	{
	int rc = 0;
	const char* name = (const char*)argp;
	struct sbd_context *st;

	st = open_device(devname, LOG_WARNING);
	if (!st)
	return -1;
	rc = slot_ping(st, name);
	close_device(st);
	return rc;
	}

	int allocate_slots(const char *name)
	{
	int rc = 0;
	struct sbd_context *st;
	struct servants_list_item *s;

	for (s = servants_leader; s; s = s->next) {
	fprintf(stdout, "Trying to allocate slot for %s on device %s.\n",
	name,
	s->devname);
	st = open_device(s->devname, LOG_WARNING);
	if (!st) {
	return -1;
	}
	rc = slot_allocate(st, name);
	close_device(st);
	if (rc < 0)
	return rc;
	fprintf(stdout, "Slot for %s has been allocated on %s.\n",
	name,
	s->devname);
	}
	return 0;
	}

	int list_slots()
	{
	int rc = 0;
	struct servants_list_item *s;
	struct sbd_context *st;

	for (s = servants_leader; s; s = s->next) {
	st = open_device(s->devname, LOG_WARNING);
	if (!st) {
	fprintf(stdout, "== disk %s unreadable!\n", s->devname);
	continue;
	}
	rc = slot_list(st);
	close_device(st);
	if (rc == -1) {
	fprintf(stdout, "== Slots on disk %s NOT dumped\n", s->devname);
	}
	}
	return 0;
	}

	int ping_via_slots(const char *name)
	{
	int sig = 0;
	pid_t pid = 0;
	int status = 0;
	int servants_finished = 0;
	sigset_t procmask;
	siginfo_t sinfo;
	struct servants_list_item *s;

	sigemptyset(&procmask);
	sigaddset(&procmask, SIGCHLD);
	sigprocmask(SIG_BLOCK, &procmask, NULL);

	for (s = servants_leader; s; s = s->next) {
	s->pid = assign_servant(s->devname, &slot_ping_wrapper, (const void*)name);
	}

	while (servants_finished < servant_count) {
	sig = sigwaitinfo(&procmask, &sinfo);
	if (sig == SIGCHLD) {
	while ((pid = wait(&status))) {
	if (pid == -1 && errno == ECHILD) {
	break;
	} else {
	s = lookup_servant_by_pid(pid);
	if (s) {
	servants_finished++;
	}
	}
	}
	}
	}
	return 0;
	}

	/* This is a bit hackish, but the easiest way to rewire all process
	* exits to send the desired signal to the parent. */
	void servant_exit(void)
	{
	pid_t ppid;
	union sigval signal_value;

	ppid = getppid();
	if (servant_inform_parent) {
	memset(&signal_value, 0, sizeof(signal_value));
	sigqueue(ppid, SIG_IO_FAIL, signal_value);
	}
	}

	int servant(const char diskname, const void argp)
	{
	struct sector_mbox_s *s_mbox = NULL;
	struct sector_node_s *s_node = NULL;
	struct sector_header_s *s_header = NULL;
	int mbox;
	int rc = 0;
	time_t t0, t1, latency;
	union sigval signal_value;
	sigset_t servant_masks;
	struct sbd_context *st;
	pid_t ppid;
	char uuid[37];
	const struct servants_list_item *s = argp;

	if (!diskname) {
	cl_log(LOG_ERR, "Empty disk name %s.", diskname);
	return -1;
	}

	cl_log(LOG_INFO, "Servant starting for device %s", diskname);

	/* Block most of the signals */
	sigfillset(&servant_masks);
	sigdelset(&servant_masks, SIGKILL);
	sigdelset(&servant_masks, SIGFPE);
	sigdelset(&servant_masks, SIGILL);
	sigdelset(&servant_masks, SIGSEGV);
	sigdelset(&servant_masks, SIGBUS);
	sigdelset(&servant_masks, SIGALRM);
	/* FIXME: check error */
	sigprocmask(SIG_SETMASK, &servant_masks, NULL);

	atexit(servant_exit);
	servant_inform_parent = 1;

	st = open_device(diskname, LOG_WARNING);
	if (!st) {
	return -1;
	}

	s_header = header_get(st);
	if (!s_header) {
	cl_log(LOG_ERR, "Not a valid header on %s", diskname);
	return -1;
	}

	if (check_timeout_inconsistent(s_header) < 0) {
	cl_log(LOG_ERR, "Timeouts on %s do not match first device",
	diskname);
	return -1;
	}

	if (s_header->minor_version > 0) {
	uuid_unparse_lower(s_header->uuid, uuid);
	cl_log(LOG_INFO, "Device %s uuid: %s", diskname, uuid);
	}

	mbox = slot_allocate(st, local_uname);
	if (mbox < 0) {
	cl_log(LOG_ERR,
	"No slot allocated, and automatic allocation failed for disk %s.",
	diskname);
	rc = -1;
	goto out;
	}
	s_node = sector_alloc();
	if (slot_read(st, mbox, s_node) < 0) {
	cl_log(LOG_ERR, "Unable to read node entry on %s",
	diskname);
	exit(1);
	}

	DBGLOG(LOG_INFO, "Monitoring slot %d on disk %s", mbox, diskname);
	if (s_header->minor_version == 0) {
	set_proc_title("sbd: watcher: %s - slot: %d", diskname, mbox);
	} else {
	set_proc_title("sbd: watcher: %s - slot: %d - uuid: %s",
	diskname, mbox, uuid);
	}

	s_mbox = sector_alloc();
	if (s->first_start) {
	if (start_mode > 0) {
	if (mbox_read(st, mbox, s_mbox) < 0) {
	cl_log(LOG_ERR, "mbox read failed during start-up in servant.");
	rc = -1;
	goto out;
	}
	if (s_mbox->cmd != SBD_MSG_EXIT &&
	s_mbox->cmd != SBD_MSG_EMPTY) {
	/* Not a clean stop. Abort start-up */
	cl_log(LOG_WARNING, "Found fencing message - aborting start-up. Manual intervention required!");
	ppid = getppid();
	sigqueue(ppid, SIG_EXITREQ, signal_value);
	rc = 0;
	goto out;
	}
	}
	DBGLOG(LOG_INFO, "First servant start - zeroing inbox");
	memset(s_mbox, 0, sizeof(*s_mbox));
	if (mbox_write(st, mbox, s_mbox) < 0) {
	rc = -1;
	goto out;
	}
	}

	memset(&signal_value, 0, sizeof(signal_value));

	while (1) {
	struct sector_header_s *s_header_retry = NULL;
	struct sector_node_s *s_node_retry = NULL;

	t0 = time(NULL);
	sleep(timeout_loop);

	ppid = getppid();

	if (ppid == 1) {
	/* Our parent died unexpectedly. Triggering
	* self-fence. */
	do_reset();
	}

	/* These attempts are, by definition, somewhat racy. If
	* the device is wiped out or corrupted between here and
	* us reading our mbox, there is nothing we can do about
	* that. But at least we tried. */
	s_header_retry = header_get(st);
	if (!s_header_retry) {
	cl_log(LOG_ERR, "No longer found a valid header on %s", diskname);
	exit(1);
	}
	if (memcmp(s_header, s_header_retry, sizeof(*s_header)) != 0) {
	cl_log(LOG_ERR, "Header on %s changed since start-up!", diskname);
	exit(1);
	}
	free(s_header_retry);

	s_node_retry = sector_alloc();
	if (slot_read(st, mbox, s_node_retry) < 0) {
	cl_log(LOG_ERR, "slot read failed in servant.");
	exit(1);
	}
	if (memcmp(s_node, s_node_retry, sizeof(*s_node)) != 0) {
	cl_log(LOG_ERR, "Node entry on %s changed since start-up!", diskname);
	exit(1);
	}
	free(s_node_retry);

	if (mbox_read(st, mbox, s_mbox) < 0) {
	cl_log(LOG_ERR, "mbox read failed in servant.");
	exit(1);
	}

	if (s_mbox->cmd > 0) {
	cl_log(LOG_INFO,
	"Received command %s from %s on disk %s",
	char2cmd(s_mbox->cmd), s_mbox->from, diskname);

	switch (s_mbox->cmd) {
	case SBD_MSG_TEST:
	memset(s_mbox, 0, sizeof(*s_mbox));
	mbox_write(st, mbox, s_mbox);
	sigqueue(ppid, SIG_TEST, signal_value);
	break;
	case SBD_MSG_RESET:
	do_reset();
	break;
	case SBD_MSG_OFF:
	do_off();
	break;
	case SBD_MSG_EXIT:
	sigqueue(ppid, SIG_EXITREQ, signal_value);
	break;
	case SBD_MSG_CRASHDUMP:
	do_crashdump();
	break;
	default:
	/* FIXME:
	An "unknown" message might result
	from a partial write.
	log it and clear the slot.
	*/
	cl_log(LOG_ERR, "Unknown message on disk %s",
	diskname);
	memset(s_mbox, 0, sizeof(*s_mbox));
	mbox_write(st, mbox, s_mbox);
	break;
	}
	}
	sigqueue(ppid, SIG_LIVENESS, signal_value);

	t1 = time(NULL);
	latency = t1 - t0;
	if (timeout_watchdog_warn && (latency > timeout_watchdog_warn)) {
	cl_log(LOG_WARNING,
	"Latency: %d exceeded threshold %d on disk %s",
	(int)latency, (int)timeout_watchdog_warn,
	diskname);
	} else if (debug) {
	DBGLOG(LOG_INFO, "Latency: %d on disk %s", (int)latency,
	diskname);
	}
	}
	out:
	free(s_mbox);
	close_device(st);
	if (rc == 0) {
	servant_inform_parent = 0;
	}
	return rc;
	}

	void recruit_servant(const char *devname, pid_t pid)
	{
	struct servants_list_item *s = servants_leader;
	struct servants_list_item *newbie;

	newbie = malloc(sizeof(*newbie));
	if (!newbie) {
	fprintf(stderr, "malloc failed in recruit_servant.\n");
	exit(1);
	}
	memset(newbie, 0, sizeof(*newbie));
	newbie->devname = strdup(devname);
	newbie->pid = pid;
	newbie->first_start = 1;

	if (!s) {
	servants_leader = newbie;
	} else {
	while (s->next)
	s = s->next;
	s->next = newbie;
	}

	servant_count++;
	}

	struct servants_list_item lookup_servant_by_dev(const char devname)
	{
	struct servants_list_item *s;

	for (s = servants_leader; s; s = s->next) {
	if (strncasecmp(s->devname, devname, strlen(s->devname)))
	break;
	}
	return s;
	}

	struct servants_list_item *lookup_servant_by_pid(pid_t pid)
	{
	struct servants_list_item *s;

	for (s = servants_leader; s; s = s->next) {
	if (s->pid == pid)
	break;
	}
	return s;
	}

	int check_all_dead(void)
	{
	struct servants_list_item *s;
	int r = 0;
	union sigval svalue;

	for (s = servants_leader; s; s = s->next) {
	if (s->pid != 0) {
	r = sigqueue(s->pid, 0, svalue);
	if (r == -1 && errno == ESRCH)
	continue;
	return 0;
	}
	}
	return 1;
	}


	void servant_start(struct servants_list_item *s)
	{
	int r = 0;
	union sigval svalue;

	if (s->pid != 0) {
	r = sigqueue(s->pid, 0, svalue);
	if ((r != -1 \|\| errno != ESRCH))
	return;
	}
	s->restarts++;
	if (strcmp("pcmk",s->devname) == 0) {
	DBGLOG(LOG_INFO, "Starting Pacemaker servant");
	s->pid = assign_servant(s->devname, servant_pcmk, NULL);
	} else {
	DBGLOG(LOG_INFO, "Starting servant for device %s",
	s->devname);
	s->pid = assign_servant(s->devname, servant, s);
	}

	clock_gettime(CLOCK_MONOTONIC, &s->t_started);
	return;
	}

	void servants_start(void)
	{
	struct servants_list_item *s;

	for (s = servants_leader; s; s = s->next) {
	s->restarts = 0;
	servant_start(s);
	}
	}

	void servants_kill(void)
	{
	struct servants_list_item *s;
	union sigval svalue;

	for (s = servants_leader; s; s = s->next) {
	if (s->pid != 0)
	sigqueue(s->pid, SIGKILL, svalue);
	}
	}

	void open_any_device(void)
	{
	struct sector_header_s *hdr_cur = NULL;
	struct timespec t_0;
	int t_wait = 0;

	clock_gettime(CLOCK_MONOTONIC, &t_0);

	while (!hdr_cur && t_wait < timeout_startup) {
	struct timespec t_now;
	struct servants_list_item* s;

	for (s = servants_leader; s; s = s->next) {
	struct sbd_context *st = open_device(s->devname, LOG_DEBUG);
	if (!st)
	continue;
	hdr_cur = header_get(st);
	close_device(st);
	if (hdr_cur)
	break;
	}
	clock_gettime(CLOCK_MONOTONIC, &t_now);
	t_wait = t_now.tv_sec - t_0.tv_sec;
	if (!hdr_cur) {
	sleep(timeout_loop);
	}
	}

	if (hdr_cur) {
	timeout_watchdog = hdr_cur->timeout_watchdog;
	timeout_allocate = hdr_cur->timeout_allocate;
	timeout_loop = hdr_cur->timeout_loop;
	timeout_msgwait = hdr_cur->timeout_msgwait;
	} else {
	cl_log(LOG_ERR, "No devices were available at start-up within %i seconds.",
	timeout_startup);
	exit(1);
	}

	free(hdr_cur);
	return;
	}

	int check_timeout_inconsistent(struct sector_header_s *hdr)
	{
	if (timeout_watchdog != hdr->timeout_watchdog) {
	cl_log(LOG_WARNING, "watchdog timeout: %d versus %d on this device",
	(int)timeout_watchdog, (int)hdr->timeout_watchdog);
	return -1;
	}
	if (timeout_allocate != hdr->timeout_allocate) {
	cl_log(LOG_WARNING, "allocate timeout: %d versus %d on this device",
	(int)timeout_allocate, (int)hdr->timeout_allocate);
	return -1;
	}
	if (timeout_loop != hdr->timeout_loop) {
	cl_log(LOG_WARNING, "loop timeout: %d versus %d on this device",
	(int)timeout_loop, (int)hdr->timeout_loop);
	return -1;
	}
	if (timeout_msgwait != hdr->timeout_msgwait) {
	cl_log(LOG_WARNING, "msgwait timeout: %d versus %d on this device",
	(int)timeout_msgwait, (int)hdr->timeout_msgwait);
	return -1;
	}
	return 0;
	}

	inline void cleanup_servant_by_pid(pid_t pid)
	{
	struct servants_list_item* s;

	s = lookup_servant_by_pid(pid);
	if (s) {
	cl_log(LOG_WARNING, "Servant for %s (pid: %i) has terminated",
	s->devname, s->pid);
	s->pid = 0;
	} else {
	/* This most likely is a stray signal from somewhere, or
	* a SIGCHLD for a process that has previously
	* explicitly disconnected. */
	DBGLOG(LOG_INFO, "cleanup_servant: Nothing known about pid %i",
	pid);
	}
	}

	int inquisitor_decouple(void)
	{
	pid_t ppid = getppid();
	union sigval signal_value;

	/* During start-up, we only arm the watchdog once we've got
	* quorum at least once. */
	if (watchdog_use) {
	if (watchdog_init() < 0) {
	return -1;
	}
	}

	if (ppid > 1) {
	sigqueue(ppid, SIG_LIVENESS, signal_value);
	}
	return 0;
	}

	void inquisitor_child(void)
	{
	int sig, pid;
	sigset_t procmask;
	siginfo_t sinfo;
	int status;
	struct timespec timeout;
	int exiting = 0;
	int decoupled = 0;
	int pcmk_healthy = 0;
	int pcmk_override = 0;
	time_t latency;
	struct timespec t_last_tickle, t_now;
	struct servants_list_item* s;

	if (debug_mode) {
	cl_log(LOG_ERR, "DEBUG MODE IS ACTIVE - DO NOT RUN IN PRODUCTION!");
	}

	set_proc_title("sbd: inquisitor");

	if (pidfile) {
	if (cl_lock_pidfile(pidfile) < 0) {
	exit(1);
	}
	}

	sigemptyset(&procmask);
	sigaddset(&procmask, SIGCHLD);
	sigaddset(&procmask, SIG_LIVENESS);
	sigaddset(&procmask, SIG_EXITREQ);
	sigaddset(&procmask, SIG_TEST);
	sigaddset(&procmask, SIG_IO_FAIL);
	sigaddset(&procmask, SIG_PCMK_UNHEALTHY);
	sigaddset(&procmask, SIG_RESTART);
	sigaddset(&procmask, SIGUSR1);
	sigaddset(&procmask, SIGUSR2);
	sigprocmask(SIG_BLOCK, &procmask, NULL);

	/* We only want this to have an effect during watch right now;
	* pinging and fencing would be too confused */
	if (check_pcmk) {
	recruit_servant("pcmk", 0);
	servant_count--;
	}

	servants_start();

	timeout.tv_sec = timeout_loop;
	timeout.tv_nsec = 0;
	clock_gettime(CLOCK_MONOTONIC, &t_last_tickle);

	while (1) {
	int good_servants = 0;

	sig = sigtimedwait(&procmask, &sinfo, &timeout);

	clock_gettime(CLOCK_MONOTONIC, &t_now);

	if (sig == SIG_EXITREQ) {
	servants_kill();
	watchdog_close();
	exiting = 1;
	} else if (sig == SIGCHLD) {
	while ((pid = waitpid(-1, &status, WNOHANG))) {
	if (pid == -1 && errno == ECHILD) {
	break;
	} else {
	cleanup_servant_by_pid(pid);
	}
	}
	} else if (sig == SIG_PCMK_UNHEALTHY) {
	s = lookup_servant_by_pid(sinfo.si_pid);
	if (s && strcmp(s->devname, "pcmk") == 0) {
	if (pcmk_healthy != 0) {
	cl_log(LOG_WARNING, "Pacemaker health check: UNHEALTHY");
	}
	pcmk_healthy = 0;
	clock_gettime(CLOCK_MONOTONIC, &s->t_last);
	} else {
	cl_log(LOG_WARNING, "Ignoring SIG_PCMK_UNHEALTHY from unknown source");
	}
	} else if (sig == SIG_IO_FAIL) {
	s = lookup_servant_by_pid(sinfo.si_pid);
	if (s) {
	DBGLOG(LOG_INFO, "Servant for %s requests to be disowned",
	s->devname);
	cleanup_servant_by_pid(sinfo.si_pid);
	}
	} else if (sig == SIG_LIVENESS) {
	s = lookup_servant_by_pid(sinfo.si_pid);
	if (s) {
	if (strcmp(s->devname, "pcmk") == 0) {
	if (pcmk_healthy != 1) {
	cl_log(LOG_INFO, "Pacemaker health check: OK");
	}
	pcmk_healthy = 1;
	};
	s->first_start = 0;
	clock_gettime(CLOCK_MONOTONIC, &s->t_last);
	}
	} else if (sig == SIG_TEST) {
	} else if (sig == SIGUSR1) {
	if (exiting)
	continue;
	servants_start();
	}

	if (exiting) {
	if (check_all_dead()) {
	if (pidfile) {
	cl_unlock_pidfile(pidfile);
	}
	exit(0);
	} else
	continue;
	}

	good_servants = 0;
	for (s = servants_leader; s; s = s->next) {
	int age = t_now.tv_sec - s->t_last.tv_sec;

	if (!s->t_last.tv_sec)
	continue;

	if (age < (int)(timeout_io+timeout_loop)) {
	if (strcmp(s->devname, "pcmk") != 0) {
	good_servants++;
	}
	s->outdated = 0;
	} else if (!s->outdated) {
	if (strcmp(s->devname, "pcmk") == 0) {
	/* If the state is outdated, we
	* override the last reported
	* state */
	pcmk_healthy = 0;
	cl_log(LOG_WARNING, "Pacemaker state outdated (age: %d)",
	age);
	} else if (!s->restart_blocked) {
	cl_log(LOG_WARNING, "Servant for %s outdated (age: %d)",
	s->devname, age);
	}
	s->outdated = 1;
	}
	}

	if (quorum_read(good_servants) \|\| pcmk_healthy) {
	if (!decoupled) {
	if (inquisitor_decouple() < 0) {
	servants_kill();
	exiting = 1;
	continue;
	} else {
	decoupled = 1;
	}
	}

	if (!quorum_read(good_servants)) {
	if (!pcmk_override) {
	cl_log(LOG_WARNING, "Majority of devices lost - surviving on pacemaker");
	pcmk_override = 1; /* Just to ensure the message is only logged once */
	}
	} else {
	pcmk_override = 0;
	}

	watchdog_tickle();
	clock_gettime(CLOCK_MONOTONIC, &t_last_tickle);
	}

	/* Note that this can actually be negative, since we set
	* last_tickle after we set now. */
	latency = t_now.tv_sec - t_last_tickle.tv_sec;
	if (timeout_watchdog && (latency > (int)timeout_watchdog)) {
	if (!decoupled) {
	/* We're still being watched by our
	* parent. We don't fence, but exit. */
	cl_log(LOG_ERR, "SBD: Not enough votes to proceed. Aborting start-up.");
	servants_kill();
	exiting = 1;
	continue;
	}
	if (debug_mode < 2) {
	/* At level 2 or above, we do nothing, but expect
	* things to eventually return to
	* normal. */
	do_reset();
	} else {
	cl_log(LOG_ERR, "SBD: DEBUG MODE: Would have fenced due to timeout!");
	}
	}
	if (timeout_watchdog_warn && (latency > (int)timeout_watchdog_warn)) {
	cl_log(LOG_WARNING,
	"Latency: No liveness for %d s exceeds threshold of %d s (healthy servants: %d)",
	(int)latency, (int)timeout_watchdog_warn, good_servants);
	}

	for (s = servants_leader; s; s = s->next) {
	int age = t_now.tv_sec - s->t_started.tv_sec;

	if (age > servant_restart_interval) {
	s->restarts = 0;
	s->restart_blocked = 0;
	}

	if (servant_restart_count
	&& (s->restarts >= servant_restart_count)
	&& !s->restart_blocked) {
	if (servant_restart_count > 1) {
	cl_log(LOG_WARNING, "Max retry count (%d) reached: not restarting servant for %s",
	(int)servant_restart_count, s->devname);
	}
	s->restart_blocked = 1;
	}

	if (!s->restart_blocked) {
	servant_start(s);
	}
	}
	}
	/* not reached */
	exit(0);
	}

	int inquisitor(void)
	{
	int sig, pid, inquisitor_pid;
	int status;
	sigset_t procmask;
	siginfo_t sinfo;

	/* Where's the best place for sysrq init ?*/
	sysrq_init();

	sigemptyset(&procmask);
	sigaddset(&procmask, SIGCHLD);
	sigaddset(&procmask, SIG_LIVENESS);
	sigprocmask(SIG_BLOCK, &procmask, NULL);

	open_any_device();

	inquisitor_pid = make_daemon();
	if (inquisitor_pid == 0) {
	inquisitor_child();
	}

	/* We're the parent. Wait for a happy signal from our child
	* before we proceed - we either get "SIG_LIVENESS" when the
	* inquisitor has completed the first successful round, or
	* ECHLD when it exits with an error. */

	while (1) {
	sig = sigwaitinfo(&procmask, &sinfo);
	if (sig == SIGCHLD) {
	while ((pid = waitpid(-1, &status, WNOHANG))) {
	if (pid == -1 && errno == ECHILD) {
	break;
	}
	/* We got here because the inquisitor
	* did not succeed. */
	return -1;
	}
	} else if (sig == SIG_LIVENESS) {
	/* Inquisitor started up properly. */
	return 0;
	} else {
	fprintf(stderr, "Nobody expected the spanish inquisition!\n");
	continue;
	}
	}
	/* not reached */
	return -1;
	}

	int messenger(const char name, const char msg)
	{
	int sig = 0;
	pid_t pid = 0;
	int status = 0;
	int servants_finished = 0;
	int successful_delivery = 0;
	sigset_t procmask;
	siginfo_t sinfo;
	struct servants_list_item *s;
	struct slot_msg_arg_t slot_msg_arg = {name, msg};

	sigemptyset(&procmask);
	sigaddset(&procmask, SIGCHLD);
	sigprocmask(SIG_BLOCK, &procmask, NULL);

	for (s = servants_leader; s; s = s->next) {
	s->pid = assign_servant(s->devname, &slot_msg_wrapper, &slot_msg_arg);
	}

	while (!(quorum_write(successful_delivery) \|\|
	(servants_finished == servant_count))) {
	sig = sigwaitinfo(&procmask, &sinfo);
	if (sig == SIGCHLD) {
	while ((pid = waitpid(-1, &status, WNOHANG))) {
	if (pid == -1 && errno == ECHILD) {
	break;
	} else {
	servants_finished++;
	if (WIFEXITED(status)
	&& WEXITSTATUS(status) == 0) {
	DBGLOG(LOG_INFO, "Process %d succeeded.",
	(int)pid);
	successful_delivery++;
	} else {
	cl_log(LOG_WARNING, "Process %d failed to deliver!",
	(int)pid);
	}
	}
	}
	}
	}
	if (quorum_write(successful_delivery)) {
	cl_log(LOG_INFO, "Message successfully delivered.");
	return 0;
	} else {
	cl_log(LOG_ERR, "Message is not delivered via more then a half of devices");
	return -1;
	}
	}

	int dump_headers(void)
	{
	int rc = 0;
	struct servants_list_item *s = servants_leader;
	struct sbd_context *st;

	for (s = servants_leader; s; s = s->next) {
	fprintf(stdout, "==Dumping header on disk %s\n", s->devname);
	st = open_device(s->devname, LOG_WARNING);
	if (!st) {
	fprintf(stdout, "== disk %s unreadable!\n", s->devname);
	continue;
	}

	rc = header_dump(st);
	close_device(st);

	if (rc == -1) {
	fprintf(stdout, "==Header on disk %s NOT dumped\n", s->devname);
	} else {
	fprintf(stdout, "==Header on disk %s is dumped\n", s->devname);
	}
	}
	return rc;
	}

	int main(int argc, char argv, char envp)
	{
	int exit_status = 0;
	int c;
	+ int w = 0;

	if ((cmdname = strrchr(argv[0], '/')) == NULL) {
	cmdname = argv[0];
	} else {
	++cmdname;
	}

	cl_log_set_entity(cmdname);
	cl_log_enable_stderr(0);
	cl_log_set_facility(LOG_DAEMON);

	sbd_get_uname();

	while ((c = getopt(argc, argv, "C:DPRTWZhvw:d:n:p:1:2:3:4:5:t:I:F:S:s:")) != -1) {
	switch (c) {
	case 'D':
	break;
	case 'Z':
	debug_mode++;
	cl_log(LOG_INFO, "Debug mode now at level %d", (int)debug_mode);
	break;
	case 'R':
	skip_rt = 1;
	cl_log(LOG_INFO, "Realtime mode deactivated.");
	break;
	case 'S':
	start_mode = atoi(optarg);
	cl_log(LOG_INFO, "Start mode set to: %d", (int)start_mode);
	break;
	case 's':
	timeout_startup = atoi(optarg);
	cl_log(LOG_INFO, "Start timeout set to: %d", (int)timeout_startup);
	break;
	case 'v':
	debug = 1;
	cl_log(LOG_INFO, "Verbose mode enabled.");
	break;
	case 'T':
	watchdog_set_timeout = 0;
	cl_log(LOG_INFO, "Setting watchdog timeout disabled; using defaults.");
	break;
	case 'W':
	- watchdog_use = 1;
	- cl_log(LOG_INFO, "Watchdog enabled.");
	+ w++;
	break;
	case 'w':
	watchdogdev = strdup(optarg);
	break;
	case 'd':
	recruit_servant(optarg, 0);
	break;
	case 'P':
	check_pcmk = 1;
	break;
	case 'n':
	local_uname = strdup(optarg);
	cl_log(LOG_INFO, "Overriding local hostname to %s", local_uname);
	break;
	case 'p':
	pidfile = strdup(optarg);
	cl_log(LOG_INFO, "pidfile set to %s", pidfile);
	break;
	case 'C':
	timeout_watchdog_crashdump = atoi(optarg);
	cl_log(LOG_INFO, "Setting crashdump watchdog timeout to %d",
	(int)timeout_watchdog_crashdump);
	break;
	case '1':
	timeout_watchdog = atoi(optarg);
	break;
	case '2':
	timeout_allocate = atoi(optarg);
	break;
	case '3':
	timeout_loop = atoi(optarg);
	break;
	case '4':
	timeout_msgwait = atoi(optarg);
	break;
	case '5':
	timeout_watchdog_warn = atoi(optarg);
	cl_log(LOG_INFO, "Setting latency warning to %d",
	(int)timeout_watchdog_warn);
	break;
	case 't':
	servant_restart_interval = atoi(optarg);
	cl_log(LOG_INFO, "Setting servant restart interval to %d",
	(int)servant_restart_interval);
	break;
	case 'I':
	timeout_io = atoi(optarg);
	cl_log(LOG_INFO, "Setting IO timeout to %d",
	(int)timeout_io);
	break;
	case 'F':
	servant_restart_count = atoi(optarg);
	cl_log(LOG_INFO, "Servant restart count set to %d",
	(int)servant_restart_count);
	break;
	case 'h':
	usage();
	return (0);
	default:
	exit_status = -2;
	goto out;
	break;
	}
	}
	-
	+
	+ if (w > 0) {
	+ watchdog_use = w % 2;
	+ }
	+
	+ if (watchdog_use) {
	+ cl_log(LOG_INFO, "Watchdog enabled.");
	+ } else {
	+ cl_log(LOG_INFO, "Watchdog disabled.");
	+ }
	+
	if (servant_count < 1 \|\| servant_count > 3) {
	fprintf(stderr, "You must specify 1 to 3 devices via the -d option.\n");
	exit_status = -1;
	goto out;
	}

	/* There must at least be one command following the options: */
	if ((argc - optind) < 1) {
	fprintf(stderr, "Not enough arguments.\n");
	exit_status = -2;
	goto out;
	}

	if (init_set_proc_title(argc, argv, envp) < 0) {
	fprintf(stderr, "Allocation of proc title failed.\n");
	exit_status = -1;
	goto out;
	}

	maximize_priority();

	if (strcmp(argv[optind], "create") == 0) {
	exit_status = init_devices();
	} else if (strcmp(argv[optind], "dump") == 0) {
	exit_status = dump_headers();
	} else if (strcmp(argv[optind], "allocate") == 0) {
	exit_status = allocate_slots(argv[optind + 1]);
	} else if (strcmp(argv[optind], "list") == 0) {
	exit_status = list_slots();
	} else if (strcmp(argv[optind], "message") == 0) {
	exit_status = messenger(argv[optind + 1], argv[optind + 2]);
	} else if (strcmp(argv[optind], "ping") == 0) {
	exit_status = ping_via_slots(argv[optind + 1]);
	} else if (strcmp(argv[optind], "watch") == 0) {
	exit_status = inquisitor();
	} else {
	exit_status = -2;
	}

	out:
	if (exit_status < 0) {
	if (exit_status == -2) {
	usage();
	} else {
	fprintf(stderr, "sbd failed; please check the logs.\n");
	}
	return (1);
	}
	return (0);
	}
	diff --git a/src/sbd.sh b/src/sbd.sh
	index eb7d81e..343c5ff 100644
	--- a/src/sbd.sh
	+++ b/src/sbd.sh
	@@ -1,99 +1,99 @@
	#!/bin/bash
	#
	# Copyright (C) 2013 Lars Marowsky-Bree <lmb@suse.com>
	#
	# This program is free software; you can redistribute it and/or
	# modify it under the terms of the GNU General Public
	# License as published by the Free Software Foundation; either
	# version 2.1 of the License, or (at your option) any later version.
	#
	# This software is distributed in the hope that it will be useful,
	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	# General Public License for more details.
	#
	# You should have received a copy of the GNU General Public
	# License along with this library; if not, write to the Free Software
	# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
	#

	SBD_CONFIG=/etc/sysconfig/sbd
	SBD_BIN="/usr/sbin/sbd"

	test -x $SBD_BIN \|\| exit 1
	test -f $SBD_CONFIG \|\| exit 1

	. $SBD_CONFIG

	unset LC_ALL; export LC_ALL
	unset LANGUAGE; export LANGUAGE

	: ${OCF_ROOT:=/usr/lib/ocf}
	: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
	. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs

	# Construct commandline for some common options
	if [ -z "$SBD_DEVICE" ]; then
	echo "No sbd devices defined"
	exit 1
	fi
	SBD_DEVS=${SBD_DEVICE%;}
	SBD_DEVICE=${SBD_DEVS//;/ -d }

	: ${SBD_PIDFILE:=/var/run/sbd.pid}
	SBD_OPTS+=" -p $SBD_PIDFILE"
	: ${SBD_PACEMAKER:="true"}
	if ocf_is_true "$SBD_PACEMAKER" ; then
	SBD_OPTS+=" -P"
	fi
	: ${SBD_WATCHDOG:="true"}
	-if ocf_is_true "$SBD_WATCHDOG" ; then
	- SBD_OPTS+=" -W"
	+if ! ocf_is_true "$SBD_WATCHDOG" ; then
	+ SBD_OPTS+=" -W -W"
	fi
	if [ -n "$SBD_WATCHDOG_DEV" ]; then
	- SBD_OPTS+="-w $SBD_WATCHDOG_DEV"
	+ SBD_OPTS+=" -w $SBD_WATCHDOG_DEV"
	fi
	: ${SBD_STARTMODE:="always"}
	case "$SBD_STARTMODE" in
	always) SBD_OPTS+=" -S 0" ;;
	clean) SBD_OPTS+=" -S 1" ;;
	esac

	start() {
	if ! pidofproc -p $SBD_PIDFILE $SBD_BIN >/dev/null 2>&1 ; then
	if ! $SBD_BIN -d $SBD_DEVICE $SBD_OPTS watch ; then
	echo "SBD failed to start; aborting."
	exit 1
	fi
	else
	return 0
	fi
	}

	stop() {
	if ! $SBD_BIN -d $SBD_DEVICE -D $SBD_OPTS message LOCAL exit ; then
	echo "SBD failed to stop; aborting."
	exit 1
	fi
	while pidofproc -p $SBD_PIDFILE $SBD_BIN >/dev/null 2>&1 ; do
	sleep 1
	done
	}

	case "$1" in
	start\|stop)
	$1 ;;
	*)
	echo "Usage: $0 (start\|stop)"
	exit 1
	;;
	esac

	# TODO:
	# - Make openais init script call out to this script too
	# - How to handle the former "force-start" option?
	# force-start)
	# SBD_OPTS="$SBD_OPTS -S 0"
	# start
	# ;;

File Metadata

Mime Type: text/x-diff
Expires: Sat, Nov 23, 6:28 AM (15 h, 31 m)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 1018293
Default Alt Text: (77 KB)

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions