HomeClusterLabs Projects

Fix: libpe_status: Use pcmk_monitor_timeout as stonith start timeout

Description

Fix: libpe_status: Use pcmk_monitor_timeout as stonith start timeout

Stonith start ops time out after the op timeout rather than waiting for
the pcmk_monitor_timeout to expire. The stonith monitor command called
by the start op is allowed the whole pcmk_monitor_timeout. This can
result in the monitor command continuing to run after the start op
timeout expires. If the monitor command does not complete before
(start op timeout + cluster-delay) expires, then the start action may
be lost.

For both monitor and start, if an op times out, the "Result of" error
message says the timeout was equal to the meta timeout (default:
20000ms).

This patch sets the op timeout to the pcmk_monitor_timeout for stonith
start and probe actions.

It would be convenient to set the monitor timeout based on
pcmk_monitor_timeout during unpack_operation() too. That way, the
op->timeout gets set correctly in construct_op() later, and "Result of"
log messages show "timeout=40000ms" if pcmk_monitor_timeout=40s.
However, this can cause issues related to digest changes during rolling
upgrades. See discussion in
https://github.com/ClusterLabs/pacemaker/pull/2108.

Later, if we find a convenient way to pass the stonith action timeout
from the fencer to the executor and on to the controller dynamically, we
can remove the special handling from the scheduler. This gets the job
done for now.

Resolves: RHBZ#1856015

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Jul 12 2020, 4:21 AM
Parents
rP5abaa30c7e3a: Feature: libcrmcommon: Add fence_params capability
Branches
Unknown
Tags
Unknown

Event Timeline