Page MenuHomeClusterLabs Projects

New model for action failure handling
Open, NormalPublic

Assigned To
None
Authored By
kgaillot
Apr 25 2022, 5:57 PM
Tags
  • Restricted Project
  • Restricted Project
  • Restricted Project
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

Currently, Pacemaker supports on-fail, migration-threshold, and start-failure-is-fatal to configure action failure handling. However this does not handle all of users' desired recovery methods.

Proposed new interface (preliminary)

  • <op> would take new options max-fail-ignore, max-fail-restart, and fail-escalation.
    • The first max-fail-ignore failures on a node would be reported but ignored.
    • If further failures occurred, the next max-fail-restart failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node.
    • If the resource did stay on the same node, and another failure occurred, the handling specified by fail-escalation would be taken. This would accept the current on-fail values, except not including restart, and adding ban to force the resource off the node.
  • The defaults would be chosen to keep the current default behavior: max-fail-ignore would default to 0, max-fail-restart would default to 0 for stop and start and INFINITY for other operations, and fail-escalation would default to block or fence for stop, and ban for other operations.
  • Examples of how the old options would translate to the new ones:
    • on-fail=ignore -> max-fail-ignore=INFINITY
    • migration-threshold=3 -> max-fail-restart=3
  • Example of attempting 4 restarts then leaving the resource stopped: max-fail-restart=4 fail-escalation=stop
  • Example of ignoring the first 2 failures then trying restart: max-fail-ignore=2 max-fail-restart=INFINITY

Related issues

  • "Clearing ... failure will wait until" (was: "Waiting for ... to complete before clearing ...") log message mentions the name of the operation being unpacked, but it may not be the failed op (fail count includes all ops); should we reword the logs, or should the op be taken into account?

Related reports

Event Timeline

kgaillot triaged this task as Normal priority.Apr 25 2022, 5:57 PM
kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".
kgaillot added a subtask: Restricted Maniphest Task.
kgaillot moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 26 2022, 4:50 PM
kgaillot updated the task description. (Show Details)
kgaillot changed the visibility from "All Users" to "Public (No Login Required)".Jan 9 2024, 10:30 AM