New model for action failure handling
Open, NormalPublic
Actions

Assigned To

None

Authored By

	kgaillot
	Apr 25 2022, 5:57 PM

Description

Currently, Pacemaker supports on-fail, migration-threshold, and start-failure-is-fatal to configure action failure handling. However this does not handle all of users' desired recovery methods.

Proposed new interface (preliminary)

<op> would take new options max-fail-ignore, max-fail-restart, and fail-escalation.
- The first max-fail-ignore failures on a node would be reported but ignored.
- If further failures occurred, the next max-fail-restart failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node.
- If the resource did stay on the same node, and another failure occurred, the handling specified by fail-escalation would be taken. This would accept the current on-fail values, except not including restart, and adding ban to force the resource off the node.
The defaults would be chosen to keep the current default behavior: max-fail-ignore would default to 0, max-fail-restart would default to 0 for stop and start and INFINITY for other operations, and fail-escalation would default to block or fence for stop, and ban for other operations.
Examples of how the old options would translate to the new ones:
- on-fail=ignore -> max-fail-ignore=INFINITY
- migration-threshold=3 -> max-fail-restart=3
Example of attempting 4 restarts then leaving the resource stopped: max-fail-restart=4 fail-escalation=stop
Example of ignoring the first 2 failures then trying restart: max-fail-ignore=2 max-fail-restart=INFINITY

Related issues

"Clearing ... failure will wait until" (was: "Waiting for ... to complete before clearing ...") log message mentions the name of the operation being unpacked, but it may not be the failed op (fail count includes all ops); should we reword the logs, or should the op be taken into account?