Fix: pacemaker-attrd: wipe CIB along with memory
Previously, when the attribute manager purged a node, it would purge the
node's transient attributes only from memory, and assumed the controller
would purge them from the CIB. Now, the writer will purge them from the
CIB as well.
This happens by setting the values to NULL rather than dropping them
from memory. If there is a writer, it will also erase the node's entire
transient attributes section. If there is no writer, the next elected
writer will erase each value as part of its normal election-win
write-out. In either case, all nodes will drop the NULL values from
memory after getting the notification that the erasure succeeded.
This fixes a variety of timing issues when multiple nodes including the
attribute writer are shutting down. If the writer leaves before some
other node, the DC wipes that other node's attributes from the CIB when
that other node leaves the controller process group (or all other nodes
do if the DC is the leaving node). If a new writer (possibly even the
node itself) is elected before the node's attribute manager leaves the
cluster layer, it will write the attributes back to the CIB. Once the
other node leaves the cluster layer, all attribute managers remove its
attributes from memory, but they are now "stuck" in the CIB.
As of this commit, the controller still erases the attributes from the CIB when
the node leaves the controller process group, which is redundant but doesn't
cause any new problems and will be corrected in an upcoming commit.
Note: This will cause an insignificant regression if backported to
Pacemaker 2. The Pacemaker 2 controller purges attributes from the CIB
for leaving DCs only if they are at version 1.1.13 or later, because
earlier DCs will otherwise get fenced after a clean shutdown. Since the
attribute manager doesn't know the DC or its version, the attributes
would now always be wiped, so old leaving DCs will get fenced. The
fencing would occur only in the highly unlikely situation of a rolling
upgrade from Pacemaker 2-supported versions 1.1.11 or 1.1.12, and the
upgrade would still succeed without any negative impact on resources.
Fixes T138