The fencing chapter of Pacemaker Explained is overlong and goes beyond reference documentation. The "Configuring Fencing" section and various fencing considerations/examples should be moved to a new fencing chapter in Pacemaker Administration, and the problem in RHBZ#1666937 should be documented there.
For convenience, Pacemaker Explained's fencing chapter should list (and link to) cluster options related to fencing (stonith-enabled, startup-fencing, etc.). Parts of the pcmk_delay_max and pcmk_delay_base descriptions could move to a new section document how fencing delays work (also note that pcmk_delay_max applies after fencing has been initiated).
Document how Pacemaker uses required, on_target, and automatic from fence agent action metadata.
Document pcmk_{on,meta-data}_{action,timeout,retries} (verify the options actually work with meta-data, and update help too).
Document the issue that T483 is meant to solve.
Document what happens when a fence monitor fails.
Add a section documenting how retries work. This informal description from an email can be a starting point:
First, you have the device configuration. pcmk_reboot_retries (which defaults to 2) says how many times a reboot will be tried for a single request for that device before the fencer reports failure. There is a hardcoded 1-second delay between those attempts. Despite the name, it's actually "tries" not "retries", so pcmk_reboot_retries=2 means at most 2 attempts will be made for a single request for a single device. pcmk_reboot_timeout specifies the timeout for a reboot *request*, not *attempt*. So, if the first attempt uses up all the timeout (i.e. hangs/gets stuck) there will be no further attempts as part of that request. And of course devices can be in a topology, so a single client request can actually involve multiple fencing attempts on multiple devices. Each device used follows the above pattern. Second, fencing can be initiated by the cluster itself, or externally. External fencing is typically either by the admin running stonith_admin on the command line (= pcs stonith fence), or by DLM. If initiated externally, it's up to the external actor to decide whether to re-attempt a failed request. Presumably you're not interested in this case. If initiated by the cluster itself, there is a cluster option stonith- max-attempts (default 10) documented as "How many times fencing can fail for a target before the cluster will no longer immediately re- attempt it". The key word is immediately. The cluster will re-attempt fencing as long as fencing is needed, but if it hits this limit (for a given target node), it will *temporarily* give up -- until the next transition. A transition occurs when an actionable event occurs (node attribute change, resource monitor failure, etc.) or at least every cluster-recheck-interval (default 15 minutes). So -- yes fencing loops forever if initiated by the cluster, but it will pause for an indeterminate (but bounded) period of time if stonith-max-attempts is reached for a given target. Independently of that, a single such attempt may involve multiple request to multiple devices depending on the configuration, with a configurable number of attempts for any given device with a 1-second delay between them.