Improve fencing documentation
Open, WishlistPublic
Actions

Assigned To

None

Authored By

	kgaillot
	Dec 4 2024, 5:12 PM

Description

The fencing chapter of Pacemaker Explained is overlong and goes beyond reference documentation. The "Configuring Fencing" section and various fencing considerations/examples should be moved to a new fencing chapter in Pacemaker Administration, and the problem in RHBZ#1666937 should be documented there.

For convenience, Pacemaker Explained's fencing chapter should list (and link to) cluster options related to fencing (stonith-enabled, startup-fencing, etc.). Parts of the pcmk_delay_max and pcmk_delay_base descriptions could move to a new section document how fencing delays work (also note that pcmk_delay_max applies after fencing has been initiated).

Document how Pacemaker uses required, on_target, and automatic from fence agent action metadata.

Document pcmk_{on,meta-data}_{action,timeout,retries} (verify the options actually work with meta-data, and update help too).

Document the issue that T483 is meant to solve.

Document what happens when a fence monitor fails.

Add a section documenting how retries work. This informal description from an email can be a starting point:

First, you have the device configuration. pcmk_reboot_retries (which
defaults to 2) says how many times a reboot will be tried for a single
request for that device before the fencer reports failure. There is a
hardcoded 1-second delay between those attempts.

Despite the name, it's actually "tries" not "retries", so
pcmk_reboot_retries=2 means at most 2 attempts will be made for a
single request for a single device.

pcmk_reboot_timeout specifies the timeout for a reboot *request*, not
*attempt*. So, if the first attempt uses up all the timeout (i.e.
hangs/gets stuck) there will be no further attempts as part of that
request.

And of course devices can be in a topology, so a single client request
can actually involve multiple fencing attempts on multiple devices.
Each device used follows the above pattern.

Second, fencing can be initiated by the cluster itself, or externally.
External fencing is typically either by the admin running stonith_admin
on the command line (= pcs stonith fence), or by DLM.

If initiated externally, it's up to the external actor to decide
whether to re-attempt a failed request. Presumably you're not
interested in this case.

If initiated by the cluster itself, there is a cluster option stonith-
max-attempts (default 10) documented as "How many times fencing can
fail for a target before the cluster will no longer immediately re-
attempt it". The key word is immediately. The cluster will re-attempt
fencing as long as fencing is needed, but if it hits this limit (for a
given target node), it will *temporarily* give up -- until the next
transition. A transition occurs when an actionable event occurs (node
attribute change, resource monitor failure, etc.) or at least every
cluster-recheck-interval (default 15 minutes).

So -- yes fencing loops forever if initiated by the cluster, but it
will pause for an indeterminate (but bounded) period of time if
stonith-max-attempts is reached for a given target. Independently of
that, a single such attempt may involve multiple request to multiple
devices depending on the configuration, with a configurable number of
attempts for any given device with a 1-second delay between them.

Event Timeline

kgaillot triaged this task as Wishlist priority.Dec 4 2024, 5:12 PM

kgaillot created this task.

kgaillot created this object with edit policy "Restricted Project (Project)".

Improve fencing documentationOpen, WishlistPublicActions

Description

Event Timeline

Improve fencing documentation
Open, WishlistPublic
Actions