HomeClusterLabs Projects

Fix: executor: return error for stonith probes if stonith connection was lost

Description

Fix: executor: return error for stonith probes if stonith connection was lost

Previously, stonith probes could return only PCMK_OCF_OK (if the executor had
registered the device with the fencer) or PCMK_OCF_NOT_RUNNING (if the executor
had unregistered or not yet registered the device).

However if the stonith connection is lost, the executor doesn't know whether
the device is still registered or not, and thus could be giving wrong
information back to the controller.

This fixes that by refactoring lrmd_rsc_t's stonith_started member from a
boolean (0 = not started, 1 = started) to an rc code (pcmk_ok = started,
-ENODEV = not started, pcmk_err_generic = stonith connection lost).
stonith_rc2status() will map these to PCMK_OCF_OK, PCMK_OCF_NOT_RUNNING, or
PCMK_OCF_UNKNOWN_ERROR.

This ensures that probes after the connection is lost will fail, which is
especially important if the controller respawned at the same time as the fencer
and so didn't receive client notification of failed monitors.

This means that if the executor loses its stonith connection, probes for *all*
stonith devices on that node will fail and require a stop on that node, which
may be unexpected for users accustomed to the old behavior, but is more
correct.

Details

Provenance
kgaillotAuthored on Apr 10 2019, 1:43 PM
Parents
rP18b7542bc085: Merge pull request #1813 from kgaillot/fixes
Branches
Unknown
Tags
Unknown

Event Timeline