Page MenuHomeClusterLabs Projects

Controller should handle executor communication problems gracefully
Open, NormalPublic

Assigned To
None
Authored By
kgaillot
Oct 1 2024, 1:58 PM
Tags
  • Restricted Project
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

If the controller can't communicate with the executor (for example, send SIGSTOP to the executor) and has to for some action, the controller will exit, respawn, and block on the executor connection. The cluster will be unable to make progress because the join process cannot complete (eventually popping the election timer and retrying).

That behavior may have improved now that pacemakerd monitors IPC connectivity (it should kill and respawn the executor in this particular situation). However the controller should still handle an unresponsive executor more gracefully.

In particular, the controller should use libqb's new async connection API when available.

We may also want to consider being able to process scheduled fence actions during the join process if a node is unresponsive.

Related Objects

StatusAssignedTask
OpenNone

Event Timeline

kgaillot triaged this task as Normal priority.Oct 1 2024, 1:58 PM
kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".
kgaillot added a parent task: Restricted Maniphest Task.Thu, Jan 2, 4:14 PM