Fix: attrd: prevent leftover attributes of shutdown node in cib
This commit prevents writing out of attributes from being triggered by
cib_replace event on a node that is requesting shutdown, so that it
prevents leftover attributes of the shutdown node in cib.
Race conditions were encountered on shutdown and startup of a node.
Pacemaker is v1.1.18+, but I think the latest revision should be
potentially impacted too.
Node2 was shutting down. When crmd was stopped from node2, node1 erased
all the transient attributes of node2 from cib:
Sep 21 08:33:49 [7849] node1 crmd: notice: peer_update_callback: Our peer on the DC (node2) is dead Sep 21 08:33:49 [7844] node1 cib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='node2']/transient_attributes: OK (rc=0, origin=node1/crmd/114, version=0.649.144)
And node2 became the DC and did a cib_replace:
Sep 21 08:33:49 [7849] node1 crmd: info: update_dc: Set DC to node1 (3.1.0) Sep 21 08:33:49 [7849] node1 crmd: info: do_dc_join_finalize: join-2: Syncing our CIB to the rest of the cluster Sep 21 08:33:49 [7844] node1 cib: info: cib_process_replace: Replaced 0.649.144 with 0.649.144 from node1 Sep 21 08:33:49 [7844] node1 cib: info: cib_process_request: Completed cib_replace operation for section 'all': OK (rc=0, origin=node1/crmd/126, version=0.649.144)
Meanwhile cib and attrd daemons on node2 didn't receive SIGTERM yet and
were still running. Attrd reacted to the cib_replace and wrote all the
node attributes back into cib again including its "shutdown" attribute:
Sep 21 08:33:49 [4444] node2 attrd: notice: attrd_cib_replaced_cb: Updating all attributes after cib_refresh_notify event Sep 21 08:33:49 [4441] node2 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='14548837']/transient_attributes[@id='14548837']/instance_attributes[@id='status-14548837']: <nvpair id="status-14548837-shutdown" name="shutdown" value="1600677133"/> Sep 21 08:33:49 [4444] node2 attrd: info: attrd_cib_callback: Update 1103 for shutdown[node2]=1600677133: OK (0)
Later, attrd received SIGTERM and shut down:
Sep 21 08:33:49 [4439] node2 pacemakerd: notice: stop_child: Stopping attrd | sent signal 15 to process 4444 Sep 21 08:33:49 [4444] node2 attrd: notice: crm_signal_dinode1tch: Caught 'Terminated' signal | 15 (invoking handler) Sep 21 08:33:49 [4444] node2 attrd: info: main: Shutting down attribute manager
When node2 started again, it cleared its node attributes from cib, but
cib of node1 didn't join yet by then:
Sep 21 08:42:46 [4844] node2 attrd: info: attrd_erase_attrs: Clearing transient attributes from CIB | xpath=//node_state[@uname='node2']/transient_attributes Sep 21 08:42:46 [4841] node2 cib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='node2']/transient_attributes: OK (rc=0, origin=node2/attrd/2, version=0.649.0)
Then cib of node1 joined:
Sep 21 08:42:47 [4841] node2 cib: info: pcmk_cpg_membership: Node 14548836 joined group cib (counter=1.0, pid=0, unchecked for rivals)
Soon the node attributes of node2 got back into cib again by syncing of
cib from node1:
Sep 21 08:42:47 [4846] node2 crmd: info: update_dc: Set DC to node1 (3.1.0) Sep 21 08:42:48 [4841] node2 cib: info: cib_process_replace: Replaced 0.649.90 with 0.649.410 from node1 Sep 21 08:42:48 [4841] node2 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='14548837']/transient_attributes[@id='14548837']/instance_attributes[@id='status-14548837']: <nvpair id="status-14548837-shutdown" name="shutdown" value="1600677133"/>
So the "leftover" "shutdown" attribute of node2 caused it to shut down
again:
Sep 21 08:42:51 [4846] node2 crmd: error: handle_request: We didn't ask to be shut down, yet our DC is telling us to.