HomeClusterLabs Projects

Fix: controller: send error reply if can't initiate action via executor

Description

Fix: controller: send error reply if can't initiate action via executor

Previously, for local execution failures, the controller would just exit.
(This can be reproduced by sending a SIGSTOP to the executor, and forcing the
DC to schedule some action on the node.)

The DC would never see a result for the action, and would only see the
controller leave. If the controller stayed down, fencing would be scheduled (if
enabled and working), but it's more likely that pacemakerd will respawn the
controller quickly enough for it to rejoin and not get fenced. This is likely
to repeat in a loop if the executor remains unresponsive, and status displays
will simply show the action as perpetually pending.

Now, try recording the action failure before exiting, so status displays can at
least show there is a failure.

We'll even schedule fencing if appropriate -- but unfortunately, it's unlikely
to be attempted if the controller respawns, since the failing node's controller
will block on the connection to the local executor if it's still unresponsive
(there does not seem to be any way to put a timeout on an IPC connection via
libqb), and the transition can't be attempted until the join process completes,
which it never will.

Details

Provenance
kgaillotAuthored on Aug 25 2021, 8:00 PM
Parents
rPce9a96cfddca: Log: controller: improve messages for failed resource agent actions
Branches
Unknown
Tags
Unknown

Event Timeline