qdisk.5
No OneTemporary
Actions

Size

16 KB

Referenced Files

None

Subscribers

None

qdisk.5
View Options

	.TH "QDisk" "5" "20 Feb 2007" "" "Cluster Quorum Disk"
	.SH "NAME"
	qdisk \- a disk-based quorum daemon for CMAN / Linux-Cluster
	.SH "1. Overview"
	.SH "1.1 Problem"
	In some situations, it may be necessary or desirable to sustain
	a majority node failure of a cluster without introducing the need for
	asymmetric cluster configurations (e.g. client-server, or heavily-weighted
	voting nodes).

	.SH "1.2. Design Requirements"
	* Ability to sustain 1..(n-1)/n simultaneous node failures, without the
	danger of a simple network partition causing a split brain. That is, we
	need to be able to ensure that the majority failure case is not merely
	the result of a network partition.

	* Ability to use external reasons for deciding which partition is the
	the quorate partition in a partitioned cluster. For example, a user may
	have a service running on one node, and that node must always be the master
	in the event of a network partition. Or, a node might lose all network
	connectivity except the cluster communication path - in which case, a
	user may wish that node to be evicted from the cluster.

	* Integration with CMAN. We must not require CMAN to run with us (or
	without us). Linux-Cluster does not require a quorum disk normally -
	introducing new requirements on the base of how Linux-Cluster operates
	is not allowed.

	* Data integrity. In order to recover from a majority failure, fencing
	is required. The fencing subsystem is already provided by Linux-Cluster.

	* Non-reliance on hardware or protocol specific methods (i.e. SCSI
	reservations). This ensures the quorum disk algorithm can be used on the
	widest range of hardware configurations possible.

	* Little or no memory allocation after initialization. In critical paths
	during failover, we do not want to have to worry about being killed during
	a memory pressure situation because we request a page fault, and the Linux
	OOM killer responds...

	.SH "1.3. Hardware Considerations and Requirements"
	.SH "1.3.1. Concurrent, Synchronous, Read/Write Access"
	This quorum daemon requires a shared block device with concurrent read/write
	access from all nodes in the cluster. The shared block device can be
	a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI
	target, or even GNBD. The quorum daemon uses O_DIRECT to write to the
	device.

	.SH "1.3.2. Bargain-basement JBODs need not apply"
	There is a minimum performance requirement inherent when using disk-based
	cluster quorum algorithms, so design your cluster accordingly. Using a
	cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause
	problems at the first load spike. Plan your loads accordingly; a node's
	inability to write to the quorum disk in a timely manner will cause the
	cluster to evict the node. Using host-RAID or multi-initiator parallel
	SCSI configurations with the qdisk daemon is unlikely to work, and will
	probably cause administrators a lot of frustration. That having been
	said, because the timeouts are configurable, most hardware should work
	if the timeouts are set high enough.

	.SH "1.3.3. Fencing is Required"
	In order to maintain data integrity under all failure scenarios, use of
	this quorum daemon requires adequate fencing, preferably power-based
	fencing. Watchdog timers and software-based solutions to reboot the node
	internally, while possibly sufficient, are not considered 'fencing' for
	the purposes of using the quorum disk.

	.SH "1.4. Limitations"
	* At this time, this daemon supports a maximum of 16 nodes. This is
	primarily a scalability issue: As we increase the node count, we increase
	the amount of synchronous I/O contention on the shared quorum disk.

	* Cluster node IDs must be statically configured in cluster.conf and
	must be numbered from 1..16 (there can be gaps, of course).

	* Cluster node votes should be more or less equal.

	* CMAN must be running before the qdisk program can operate in full
	capacity. If CMAN is not running, qdisk will wait for it.

	* CMAN's eviction timeout should be at least 2x the quorum daemon's
	to give the quorum daemon adequate time to converge on a master during a
	failure + load spike situation.

	* For 'all-but-one' failure operation, the total number of votes assigned
	to the quorum device should be equal to or greater than the total number
	of node-votes in the cluster. While it is possible to assign only one
	(or a few) votes to the quorum device, the effects of doing so have not
	been explored.

	* For 'tiebreaker' operation in a two-node cluster, unset CMAN's two_node
	flag (or set it to 0), set CMAN's expected votes to '3', set each node's
	vote to '1', and set qdisk's vote count to '1' as well. This will allow
	the cluster to operate if either both nodes are online, or a single node &
	the heuristics.

	* Currently, the quorum disk daemon is difficult to use with CLVM if
	the quorum disk resides on a CLVM logical volume. CLVM requires a
	quorate cluster to correctly operate, which introduces a chicken-and-egg
	problem for starting the cluster: CLVM needs quorum, but the quorum daemon
	needs CLVM (if and only if the quorum device lies on CLVM-managed storage).
	One way to work around this is to not set the cluster's expected votes
	to include the quorum daemon's votes. Bring all nodes online, and start
	the quorum daemon after the whole cluster is running. This will allow
	the expected votes to increase naturally.

	.SH "2. Algorithms"
	.SH "2.1. Heartbeating & Liveliness Determination"
	Nodes update individual status blocks on the quorum disk at a user-
	defined rate. Each write of a status block alters the timestamp, which
	is what other nodes use to decide whether a node has hung or not. If,
	after a user-defined number of 'misses' (that is, failure to update a
	timestamp), a node is declared offline. After a certain number of 'hits'
	(changed timestamp + "i am alive" state), the node is declared online.

	The status block contains additional information, such as a bitmask of
	the nodes that node believes are online. Some of this information is
	used by the master - while some is just for performance recording, and
	may be used at a later time. The most important pieces of information
	a node writes to its status block are:

	.in 12
	- Timestamp
	.br
	- Internal state (available / not available)
	.br
	- Score
	.br
	- Known max score (may be used in the future to detect invalid configurations)
	.br
	- Vote/bid messages
	.br
	- Other nodes it thinks are online
	.in 0

	.SH "2.2. Scoring & Heuristics"
	The administrator can configure up to 10 purely arbitrary heuristics, and
	must exercise caution in doing so. At least one administrator-
	defined heuristic is required for operation, but it is generally a good
	idea to have more than one heuristic. By default, only nodes scoring over
	1/2 of the total maximum score will claim they are available via the
	quorum disk, and a node (master or otherwise) whose score drops too low
	will remove itself (usually, by rebooting).

	The heuristics themselves can be any command executable by 'sh -c'. For
	example, in early testing the following was used:

	.ti 12
	<\fBheuristic \fP\fIprogram\fP\fB="\fP[ -f /quorum ]\fB" \fP\fIscore\fP\fB="\fP10\fB" \fP\fIinterval\fP\fB="\fP2\fB"/>\fP

	This is a literal sh-ism which tests for the existence of a file called
	"/quorum". Without that file, the node would claim it was unavailable.
	This is an awful example, and should never, ever be used in production,
	but is provided as an example as to what one could do...

	Typically, the heuristics should be snippets of shell code or commands which
	help determine a node's usefulness to the cluster or clients. Ideally, you
	want to add traces for all of your network paths (e.g. check links, or
	ping routers), and methods to detect availability of shared storage.

	.SH "2.3. Master Election"
	Only one master is present at any one time in the cluster, regardless of
	how many partitions exist within the cluster itself. The master is
	elected by a simple voting scheme in which the lowest node which believes
	it is capable of running (i.e. scores high enough) bids for master status.
	If the other nodes agree, it becomes the master. This algorithm is
	run whenever no master is present.

	If another node comes online with a lower node ID while a node is still
	bidding for master status, it will rescind its bid and vote for the lower
	node ID. If a master dies or a bidding node dies, the voting algorithm
	is started over. The voting algorithm typically takes two passes to
	complete.

	Master deaths take marginally longer to recover from than non-master
	deaths, because a new master must be elected before the old master can
	be evicted & fenced.

	.SH "2.4. Master Duties"
	The master node decides who is or is not in the master partition, as
	well as handles eviction of dead nodes (both via the quorum disk and via
	the linux-cluster fencing system by using the cman_kill_node() API).

	.SH "2.5. How it All Ties Together"
	When a master is present, and if the master believes a node to be online,
	that node will advertise to CMAN that the quorum disk is available. The
	master will only grant a node membership if:

	.in 12
	(a) CMAN believes the node to be online, and
	.br
	(b) that node has made enough consecutive, timely writes
	.in 16
	to the quorum disk, and
	.in 12
	(c) the node has a high enough score to consider itself online.
	.in 0

	.SH "3. Configuration"
	.SH "3.1. The <quorumd> tag"
	This tag is a child of the top-level <cluster> tag.

	.in 8
	\fB<quorumd\fP
	.in 9
	\fIinterval\fP\fB="\fP1\fB"\fP
	.in 12
	This is the frequency of read/write cycles, in seconds.

	.in 9
	\fItko\fP\fB="\fP10\fB"\fP
	.in 12
	This is the number of cycles a node must miss in order to be declared dead.

	.in 9
	\fItko_up\fP\fB="\fPX\fB"\fP
	.in 12
	This is the number of cycles a node must be seen in order to be declared
	online. Default is \fBfloor(tko/3)\fP.

	.in 9
	\fIupgrade_wait\fP\fB="\fP2\fB"\fP
	.in 12
	This is the number of cycles a node must wait before initiating a bid
	for master status after heuristic scoring becomes sufficient. The
	default is 2. This can not be set to 0, and should not exceed \fBtko\fP.

	.in 9
	\fImaster_wait\fP\fB="\fPX\fB"\fP
	.in 12
	This is the number of cycles a node must wait for votes before declaring
	itself master after making a bid. Default is \fBfloor(tko/2)\fP.
	This can not be less than 2, must be greater than tko_up, and should not
	exceed \fBtko\fP.

	.in 9
	\fIvotes\fP\fB="\fP3\fB"\fP
	.in 12
	This is the number of votes the quorum daemon advertises to CMAN when it
	has a high enough score.

	.in 9
	\fIlog_level\fP\fB="\fP4\fB"\fP
	.in 12
	This controls the verbosity of the quorum daemon in the system logs.
	0 = emergencies; 7 = debug.

	.in 9
	\fIlog_facility\fP\fB="\fPdaemon\fB"\fP
	.in 12
	This controls the syslog facility used by the quorum daemon when logging.
	For a complete list of available facilities, see \fBsyslog.conf(5)\fP.
	The default value for this is 'daemon'.

	.in 9
	\fIstatus_file\fP\fB="\fP/foo\fB"\fP
	.in 12
	Write internal states out to this file periodically ("-" = use stdout).
	This is primarily used for debugging. The default value for this
	attribute is undefined.

	.in 9
	\fImin_score\fP\fB="\fP3\fB"\fP
	.in 12
	Absolute minimum score to be consider one's self "alive". If omitted,
	or set to 0, the default function "floor((n+1)/2)" is used, where \fIn\fP
	is the total of all of defined heuristics' \fIscore\fP attribute. This
	must never exceed the sum of the heuristic scores, or else the quorum
	disk will never be available.

	.in 9
	\fIreboot\fP\fB="\fP1\fB"\fP
	.in 12
	If set to 0 (off), qdiskd will not reboot after a negative transition
	as a result in a change in score (see section 2.2). The default for
	this value is 1 (on).

	.in 9
	\fIallow_kill\fP\fB="\fP1\fB"\fP
	.in 12
	If set to 0 (off), qdiskd will not instruct to kill nodes it thinks
	are dead (as a result of not writing to the quorum disk). The default
	for this value is 1 (on).

	.in 9
	\fIparanoid\fP\fB="\fP0\fB"\fP
	.in 12
	If set to 1 (on), qdiskd will watch internal timers and reboot the node
	if it takes more than (interval * tko) seconds to complete a quorum disk
	pass. The default for this value is 0 (off).

	.in 9
	\fIscheduler\fP\fB="\fPrr\fB"\fP
	.in 12
	Valid values are 'rr', 'fifo', and 'other'. Selects the scheduling queue
	in the Linux kernel for operation of the main & score threads (does not
	affect the heuristics; they are always run in the 'other' queue). Default
	is 'rr'. See sched_setscheduler(2) for more details.

	.in 9
	\fIpriority\fP\fB="\fP1\fB"\fP
	.in 12
	Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid values
	for 'other' are -20..20 inclusive. Sets the priority of the main & score
	threads. The default value is 1 (in the RR and FIFO queues, higher numbers
	denote higher priority; in OTHER, lower values denote higher priority).

	.in 9
	\fIstop_cman\fP\fB="\fP0\fB"\fP
	.in 12
	Ordinarily, cluster membership is left up to CMAN, not qdisk.
	If this parameter is set to 1 (on), qdiskd will tell CMAN to leave the
	cluster if it is unable to initialize the quorum disk during startup. This
	can be used to prevent cluster participation by a node which has been
	disconnected from the SAN. The default for this value is 0 (off).

	.in 9
	\fIuse_uptime\fP\fB="\fP1\fB"\fP
	.in 12
	If this parameter is set to 1 (on), qdiskd will use values from
	/proc/uptime for internal timings. This is a bit less precise
	than \fBgettimeofday(2)\fP, but the benefit is that changing the
	system clock will not affect qdiskd's behavior - even if \fBparanoid\fP
	is enabled. If set to 0, qdiskd will use \fBgettimeofday(2)\fP, which
	is more precise. The default for this value is 1 (on / use uptime).

	.in 9
	\fIdevice\fP\fB="\fP/dev/sda1\fB"\fP
	.in 12
	This is the device the quorum daemon will use. This device must be the
	same on all nodes.

	.in 9
	\fIlabel\fP\fB="\fPmylabel\fB"\fP
	.in 12
	This overrides the device field if present. If specified, the quorum
	daemon will read /proc/partitions and check for qdisk signatures
	on every block device found, comparing the label against the specified
	label. This is useful in configurations where the block device name
	differs on a per-node basis.
	.in 8
	\fB/>\fP
	.in 0

	.SH "3.2. The <heuristic> tag"
	This tag is a child of the <quorumd> tag.

	.in 8
	\fB<heuristic\fP
	.in 9
	\fIprogram\fP\fB="\fP/test.sh\fB"\fP
	.in 12
	This is the program used to determine if this heuristic is alive. This
	can be anything which may be executed by \fI/bin/sh -c\fP. A return
	value of zero indicates success; anything else indicates failure. This
	is required.

	.in 9
	\fIscore\fP\fB="\fP1\fB"\fP
	.in 12
	This is the weight of this heuristic. Be careful when determining scores
	for heuristics. The default score for each heuristic is 1.

	.in 9
	\fIinterval\fP\fB="\fP2\fB"\fP
	.in 12
	This is the frequency (in seconds) at which we poll the heuristic. The
	default interval for every heuristic is 2 seconds.
	.in 0

	.in 9
	\fItko\fP\fB="\fP1\fB"\fP
	.in 12
	After this many failed attempts to run the heuristic, it is considered DOWN,
	and its score is removed. The default tko for each heuristic is 1, which
	may be inadequate for things such as 'ping'.
	.in 8
	\fB/>\fP
	.in 0


	.SH "3.3. Examples"
	.SH "3.3.1. 3 cluster nodes & 3 routers"
	.in 8
	<cman expected_votes="6" .../>
	.br
	<clusternodes>
	.in 12
	<clusternode name="node1" votes="1" ... />
	.br
	<clusternode name="node2" votes="1" ... />
	.br
	<clusternode name="node3" votes="1" ... />
	.in 8
	</clusternodes>
	.br
	<quorumd interval="1" tko="10" votes="3" label="testing">
	.in 12
	<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/>
	.br
	<heuristic program="ping B -c1 -t1" score="1" interval="2" tko="3"/>
	.br
	<heuristic program="ping C -c1 -t1" score="1" interval="2" tko="3"/>
	.br
	.in 8
	</quorumd>

	.SH "3.3.2. 2 cluster nodes & 1 IP tiebreaker"
	.in 8
	<cman two_node="0" expected_votes="3" .../>
	.br
	<clusternodes>
	.in 12
	<clusternode name="node1" votes="1" ... />
	.br
	<clusternode name="node2" votes="1" ... />
	.in 8
	</clusternodes>
	.br
	<quorumd interval="1" tko="10" votes="1" label="testing">
	.in 12
	<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/>
	.br
	.in 8
	</quorumd>
	.in 0


	.SH "3.4. Heuristic score considerations"
	* Heuristic timeouts should be set high enough to allow the previous run
	of a given heuristic to complete.

	* Heuristic scripts returning anything except 0 as their return code
	are considered failed.

	* The worst-case for improperly configured quorum heuristics is a race
	to fence where two partitions simultaneously try to kill each other.

	.SH "3.5. Creating a quorum disk partition"
	The mkqdisk utility can create and list currently configured quorum disks
	visible to the local node; see
	.B mkqdisk(8)
	for more details.

	.SH "SEE ALSO"
	mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)

File Metadata

Mime Type: text/troff
Expires: Sat, Jan 25, 5:56 AM (20 h, 17 m)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 1321455
Default Alt Text: qdisk.5 (16 KB)

qdisk.5No OneTemporaryActions

qdisk.5View Options

File Metadata

Event Timeline

qdisk.5
No OneTemporary
Actions

qdisk.5
View Options