HomeClusterLabs Projects

Fix: controller: Don't double-increment failcount for simulated failures

Description

Fix: controller: Don't double-increment failcount for simulated failures

Currently, a simulated failure (for example, from crm_resource --fail)
often causes the resource's failcount on the given node to be
incremented twice: once for each of two incoming events.

A "_last_0" update event often comes in because the previous last event
was a recurring monitor. A "_last_failure_0" update comes in because
there is now a more recent failure (the newly simulated one).

Events generated by the cluster have corresponding actions in the
transition graph. These actions can be marked as "confirmed" once
they've been processed the first time. However, this can't be done for
simulated events, which are not in the transition graph. They receive a
dummy transition ID of -1 and a dummy action number that generally
increases over time.

This commit stores a set of outside events that have been processed by
process_graph_event() for the current update diff. The set is destroyed
when the diff is completely processed, so that it doesn't grow too large
or risk generating a false match.

A potential future improvement would be to add a similar safeguard for
foreign events (events with a different transitioner UUID) and
late-arriving events (events from the local transitioner that are no
longer in the current transition graph and thus can't be confirmed).

Closes T602

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Nov 18 2022, 10:08 PM
Parents
rP9bad702e1d62: Merge pull request #2952 from jonah2022/main
Branches
Unknown
Tags
Unknown
Tasks
Restricted Maniphest Task