HomeClusterLabs Projects

Fix: attrd: prevent leftover attributes of shutdown node in cib

Description

Fix: attrd: prevent leftover attributes of shutdown node in cib

This commit prevents writing out of attributes from being triggered by
cib_replace event on a node that is requesting shutdown, so that it
prevents leftover attributes of the shutdown node in cib.

Race conditions were encountered on shutdown and startup of a node.
Pacemaker is v1.1.18+, but I think the latest revision should be
potentially impacted too.

Node2 was shutting down. When crmd was stopped from node2, node1 erased
all the transient attributes of node2 from cib:

Sep 21 08:33:49 [7849] node1       crmd:   notice: peer_update_callback:
Our peer on the DC (node2) is dead
Sep 21 08:33:49 [7844] node1        cib:     info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='node2']/transient_attributes: OK (rc=0,
origin=node1/crmd/114, version=0.649.144)

And node2 became the DC and did a cib_replace:

Sep 21 08:33:49 [7849] node1       crmd:     info: update_dc:     Set DC
to node1 (3.1.0)
Sep 21 08:33:49 [7849] node1       crmd:     info: do_dc_join_finalize:
join-2: Syncing our CIB to the rest of the cluster
Sep 21 08:33:49 [7844] node1        cib:     info: cib_process_replace:
Replaced 0.649.144 with 0.649.144 from node1
Sep 21 08:33:49 [7844] node1        cib:     info: cib_process_request:
Completed cib_replace operation for section 'all': OK (rc=0,
origin=node1/crmd/126, version=0.649.144)

Meanwhile cib and attrd daemons on node2 didn't receive SIGTERM yet and
were still running. Attrd reacted to the cib_replace and wrote all the
node attributes back into cib again including its "shutdown" attribute:

Sep 21 08:33:49 [4444] node2      attrd:   notice: attrd_cib_replaced_cb:
Updating all attributes after cib_refresh_notify event
Sep 21 08:33:49 [4441] node2        cib:     info: cib_perform_op:
++
/cib/status/node_state[@id='14548837']/transient_attributes[@id='14548837']/instance_attributes[@id='status-14548837']:
<nvpair id="status-14548837-shutdown" name="shutdown"
value="1600677133"/>
Sep 21 08:33:49 [4444] node2      attrd:     info: attrd_cib_callback:
Update 1103 for shutdown[node2]=1600677133: OK (0)

Later, attrd received SIGTERM and shut down:

Sep 21 08:33:49 [4439] node2 pacemakerd:   notice: stop_child:    Stopping
attrd | sent signal 15 to process 4444
Sep 21 08:33:49 [4444] node2      attrd:   notice: crm_signal_dinode1tch:
Caught 'Terminated' signal | 15 (invoking handler)
Sep 21 08:33:49 [4444] node2      attrd:     info: main:  Shutting down
attribute manager

When node2 started again, it cleared its node attributes from cib, but
cib of node1 didn't join yet by then:

Sep 21 08:42:46 [4844] node2      attrd:     info: attrd_erase_attrs:
Clearing transient attributes from CIB |
xpath=//node_state[@uname='node2']/transient_attributes
Sep 21 08:42:46 [4841] node2        cib:     info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='node2']/transient_attributes: OK (rc=0,
origin=node2/attrd/2, version=0.649.0)

Then cib of node1 joined:

Sep 21 08:42:47 [4841] node2        cib:     info: pcmk_cpg_membership:
Node 14548836 joined group cib (counter=1.0, pid=0, unchecked for
rivals)

Soon the node attributes of node2 got back into cib again by syncing of
cib from node1:

Sep 21 08:42:47 [4846] node2       crmd:     info: update_dc:     Set DC
to node1 (3.1.0)
Sep 21 08:42:48 [4841] node2        cib:     info: cib_process_replace:
Replaced 0.649.90 with 0.649.410 from node1
Sep 21 08:42:48 [4841] node2        cib:     info: cib_perform_op:
++
/cib/status/node_state[@id='14548837']/transient_attributes[@id='14548837']/instance_attributes[@id='status-14548837']:
<nvpair id="status-14548837-shutdown" name="shutdown"
value="1600677133"/>

So the "leftover" "shutdown" attribute of node2 caused it to shut down
again:

Sep 21 08:42:51 [4846] node2       crmd:    error: handle_request:      We
didn't ask to be shut down, yet our DC is telling us to.

Details

Provenance
gao-yanAuthored on Sep 25 2020, 9:47 AM
Parents
rP775afef8c6d6: Merge pull request #2171 from clumens/crm_resource-error
Branches
Unknown
Tags
Unknown

Event Timeline