Fix: controller: confirm cancel of failed monitors
5470f1d9c776
Actions

Description

Fix: controller: confirm cancel of failed monitors

Usually after a monitor has been cancelled from executor, contoller
erases the corresponding lrm_rsc_op from the cib, and DC will confirm
the cancel action by process_op_deletion() according to the cib diff.

But if a monitor has failed, the lrm_rsc_op will be recorded as
"last_failure". When cancelling it, the lrm_rsc_op won't get erased from
the cib given the logic on purpose in erase_lrm_history_by_op(). So that
the cancel action won't have a chance to get confirmed by DC with
process_op_deletion().

Previously cluster transition would get stuck waiting for the remaining
action timer to time out.

This commit fixes the issue by directly acknowledging the cancel action
in this case and enabling DC to be able to confirm it.

This also moves get_node_id() function into controld_utils.c for common
use.

Producer:

# Insert a 10s sleep in the monitor action of RA
# /usr/lib/ocf/resource.d/pacemaker/Stateful:

 stateful_monitor() {
+    sleep 10
     stateful_check_state "master"

# Add a promotable clone resource:

crm configure primitive stateful ocf:pacemaker:Stateful \
        op monitor interval=5 role=Master \
        op monitor interval=10 role=Slave
crm configure clone p-clone stateful \
        meta promotable=true

# Wait for the resource instance to be started, promoted to be master,
# and monitor for master role to complete.

# Set is-managed=false for the promotable clone:
crm_resource --meta -p is-managed -v false -r p-clone

# Change the status of the master instance to be slave and immediately
# enforce refresh of it:
echo slave > /var/run/Stateful-stateful.state; crm_resource --refresh -r stateful --force

# Wait for probe to complete, and then monitor for slave role to be
# issued:
sleep 15

# While the monitor for slave role is still in progress, change the
# status to be master again:
echo master > /var/run/Stateful-stateful.state

# The monitor for slave role returns error. Cluster issues monitor for
# master role instead and tries to cancel the failed one for slave role.
# But cluster transition gets stuck. Depending on the monitor timeout
# configured for the slave role plus cluster-delay, only after that
# controller eventually says:

pacemaker-controld[21205] error: Node opensuse150 did not send cancel result (via controller) within 20000ms (action timeout plus cluster-delay)
pacemaker-controld[21205] error: [Action    1]: In-flight rsc op stateful_monitor_10000            on opensuse150 (priority: 0, waiting: none)
pacemaker-controld[21205] notice: Transition 6 aborted: Action lost