Change Details

Currently, Pacemaker supports on-fail, migration-threshold, and start-failure-is-fatal to configure action failure handling. However this does not handle all of users' desired recovery methods. ## Proposed new interface (preliminary) * <op> would take new options ##max-fail-ignore##, ##max-fail-restart##, and ##fail-escalation##. * The first ##max-fail-ignore## failures on a node would be reported but ignored. * If further failures occurred, the next ##max-fail-restart## failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node. * If the resource did stay on the same node, and another failure occurred, the handling specified by ##fail-escalation## would be taken. This would accept the current ##on-fail## values, except not including ##restart##, and adding ##ban## to force the resource off the node. * The defaults would be chosen to keep the current default behavior: ##max-fail-ignore## would default to 0, ##max-fail-restart## would default to 0 for stop and start and INFINITY for other operations, and ##fail-escalation## would default to block or fence for stop, and ban for other operations. * Examples of how the old options would translate to the new ones: * ##on-fail=ignore## -> ##max-fail-ignore=INFINITY## * ##migration-threshold=3## -> ##max-fail-restart=3## * Example of attempting 4 restarts then leaving the resource stopped: ##max-fail-restart=4 fail-escalation=stop## * Example of ignoring the first 2 failures then trying restart: ##max-fail-ignore=2 max-fail-restart=INFINITY## ## Related issues * "Clearing ... failure will wait until" (was: "Waiting for ... to complete before clearing ...") log message mentions the name of the operation being unpacked, but it may not be the failed op (fail count includes all ops); should we reword the logs, or should the op be taken into account? ## Related reports * [[https://issues.redhat.com/browse/RHEL-7589 | RHEL-7589]] * [[https://bugzilla.redhat.com/show_bug.cgi?id=1747563 | RHBZ#1747563]] * [[https://issues.redhat.com/browse/RHEL-7634 | RHEL-7634]] (RHEL 8) * [[https://issues.redhat.com/browse/RHEL-7635 | RHEL-7635]] (RHEL 9)

Currently, Pacemaker supports on-fail, migration-threshold, and start-failure-is-fatal to configure action failure handling. However this does not handle all of users' desired recovery methods. ## Proposed new interface (preliminary) * <op> would take new options ##max-fail-ignore##, ##max-fail-restart##, and ##fail-escalation##. * The first ##max-fail-ignore## failures on a node would be reported but ignored. * If further failures occurred, the next ##max-fail-restart## failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node. * If the resource did stay on the same node, and another failure occurred, the handling specified by ##fail-escalation## would be taken. This would accept the current ##on-fail## values, except not including ##restart##, and adding ##ban## to force the resource off the node. * The defaults would be chosen to keep the current default behavior: ##max-fail-ignore## would default to 0, ##max-fail-restart## would default to 0 for stop and start and INFINITY for other operations, and ##fail-escalation## would default to block or fence for stop, and ban for other operations. * Examples of how the old options would translate to the new ones: * ##on-fail=ignore## -> ##max-fail-ignore=INFINITY## * ##migration-threshold=3## -> ##max-fail-restart=3## * Example of attempting 4 restarts then leaving the resource stopped: ##max-fail-restart=4 fail-escalation=stop## * Example of ignoring the first 2 failures then trying restart: ##max-fail-ignore=2 max-fail-restart=INFINITY## ## Related issues * "Clearing ... failure will wait until" (was: "Waiting for ... to complete before clearing ...") log message mentions the name of the operation being unpacked, but it may not be the failed op (fail count includes all ops); should we reword the logs, or should the op be taken into account? ## Related reports * [[https://bugzilla.redhat.com/show_bug.cgi?id=1747563 | RHBZ#1747563]] * [[https://issues.redhat.com/browse/RHEL-7589 | RHEL-7589]] * [[https://issues.redhat.com/browse/RHEL-7635 | RHEL-7635]] (RHEL 9)

Currently, Pacemaker supports on-fail, migration-threshold, and start-failure-is-fatal to configure action failure handling. However this does not handle all of users' desired recovery methods. ## Proposed new interface (preliminary) * <op> would take new options ##max-fail-ignore##, ##max-fail-restart##, and ##fail-escalation##. * The first ##max-fail-ignore## failures on a node would be reported but ignored. * If further failures occurred, the next ##max-fail-restart## failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node. * If the resource did stay on the same node, and another failure occurred, the handling specified by ##fail-escalation## would be taken. This would accept the current ##on-fail## values, except not including ##restart##, and adding ##ban## to force the resource off the node. * The defaults would be chosen to keep the current default behavior: ##max-fail-ignore## would default to 0, ##max-fail-restart## would default to 0 for stop and start and INFINITY for other operations, and ##fail-escalation## would default to block or fence for stop, and ban for other operations. * Examples of how the old options would translate to the new ones: * ##on-fail=ignore## -> ##max-fail-ignore=INFINITY## * ##migration-threshold=3## -> ##max-fail-restart=3## * Example of attempting 4 restarts then leaving the resource stopped: ##max-fail-restart=4 fail-escalation=stop## * Example of ignoring the first 2 failures then trying restart: ##max-fail-ignore=2 max-fail-restart=INFINITY## ## Related issues * "Clearing ... failure will wait until" (was: "Waiting for ... to complete before clearing ...") log message mentions the name of the operation being unpacked, but it may not be the failed op (fail count includes all ops); should we reword the logs, or should the op be taken into account? ## Related reports * [[https://issues.redhat.com/browse/RHEL-7589 | RHEL-7589]] * [[https://bugzilla.redhat.com/show_bug.cgi?id=1747563 | RHBZ#1747563]] * [[https://issues.redhat.com/browse/RHEL-7634589 | RHEL-7634]] (RHEL 8)589]] * [[https://issues.redhat.com/browse/RHEL-7635 | RHEL-7635]] (RHEL 9)