Low: libcrmcommon: Wait for replies to attrd update messages.
This fixes a bug that was affecting the Reattach ctslab test. The
problem was that unmanaged resources were stopping and starting after
the cluster was restarted as part of that test. This is because the
master-stateful-1 attribute was not getting set on one cluster node,
causing the promotable resource to get demoted on one cluster node and
promoted on another. Due to constraints, this caused all other
resources in the group to follow promotable to the new cluster node.
These stops and starts cause the test to fail.
The fix appears to be waiting for ACKs from attrd update messages.
I don't know exactly why this happens, but I have some suspicions. From
examining the log files, sometimes crm_attribute (as run from the
Stateful resource) sends the IPC message to set the attribute, but
pacemaker-attrd never acknowledges that it has received or acted upon
the message. There is no "IPC Received" log message, nor does
additional logging even show that the dispatch function is called.
Sometimes, "IPC Received" is logged right after the message is logged as
having been sent, and sometimes it is logged much later. My suspicion
is that when the message isn't received, what has happened is that
crm_attribute has sent the message, disconnected from the server, and
torn down the connection all before the server gets around to reading
from the connection. At this point, I suspect that there is nothing for
the server to read.
Waiting for the ACK forces the client to wait for the server to read
before tearing down the connection which appears to fix it.
I don't know why we are just seeing this now, given that before my
crm_attribute IPC changes we were not waiting for ACKs either. I also
wonder if we should be doing this everywhere, or whether there is a
deeper problem that needs to be solved.