HomeClusterLabs Projects

Fix: various: Correctly detect completion of systemd start/stop actions
5e4d0b6c916bUnpublished

Unpublished Commit ยท Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.
This commit no longer exists in the repository. It may have been part of a branch which was deleted.This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.

Description

Fix: various: Correctly detect completion of systemd start/stop actions

When systemd receives a StartUnit() or StopUnit() method call, it
returns almost immediately, as soon as a start/stop job is enqueued. A
successful return code does NOT indicate that the start/stop has
finished.

Previously, we worked around this in action_complete() with a hack that
scheduled a follow-up monitor after a successful start/stop method call,
which polled the service after 2 seconds to see whether it was actually
running. However, this was not a robust solution. Timing issues could
result in Pacemaker having an incorrect view of the resource's status or
prematurely declaring the action as failed.

Now, we follow the best practice as documented in the systemd D-Bus API
doc (see StartUnit()):
https://www.freedesktop.org/software/systemd/man/latest/org.freedesktop.systemd1.html#Methods

After kicking off a systemd start/stop action, we make note of the job's
D-Bus object path. Then we register a D-Bus message filter that looks
for a JobRemoved signal whose bus path matches. This signal indicates
that the job has completed and includes its result. When we find the
matching signal, we set the action's result. We then remove the filter,
which causes the action to be finalized and freed. In the case of the
executor daemon, the action has a callback (action_complete()) that runs
during finalization and sets the executor's view of the action result.

Monitor actions still need much of the existing workaround code in
action_complete(), so we keep it for now. We bail out for start/stop
actions after setting the result as described above.

We also have to finalize the action explicitly if a start/stop fails due
to a "not installed" error, because the action_complete() callback never
gets called in that case. The unit doesn't truly exist, so no job gets
enqueued and thus no JobRemoved signal is received. We can't check for
this situation by checking whether the LoadUnit() method call fails.
Pacemaker writes an override file that makes it look as if the unit
exists, so LoadUnit() returns success.

Ref T25

Co-authored-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
clumensAuthored on Jan 16 2025, 4:21 PM
nrwahl2Committed on Feb 18 2025, 11:47 PM

Event Timeline

Commit No Longer Exists

This commit no longer exists in the repository.