Controller should handle executor communication problems gracefully
Open, NormalPublic
Actions

Assigned To

None

Authored By

	kgaillot
	Oct 1 2024, 1:58 PM

Description

If the controller can't communicate with the executor (for example, send SIGSTOP to the executor) and has to for some action, the controller will exit, respawn, and block on the executor connection. The cluster will be unable to make progress because the join process cannot complete (eventually popping the election timer and retrying).

That behavior may have improved now that pacemakerd monitors IPC connectivity (it should kill and respawn the executor in this particular situation). However the controller should still handle an unresponsive executor more gracefully.

In particular, the controller should use libqb's new async connection API when available.

We may also want to consider being able to process scheduled fence actions during the join process if a node is unresponsive.

Related Objects
Search...

		Status	Assigned	Task
				Restricted Maniphest Task
		Open	None	T887 Controller should handle executor communication problems gracefully

Event Timeline

kgaillot triaged this task as Normal priority.Oct 1 2024, 1:58 PM

kgaillot created this task.

kgaillot created this object with edit policy "Restricted Project (Project)".

kgaillot added a parent task: Restricted Maniphest Task.Jan 2 2025, 4:14 PM

Controller should handle executor communication problems gracefullyOpen, NormalPublicActions

Description

Related ObjectsSearch...

Event Timeline

Controller should handle executor communication problems gracefully
Open, NormalPublic
Actions

Related Objects
Search...