Feature: libcrmcommon: Be more lenient in evicting IPC clients.
Each IPC connection has a message queue. If the client is unable to
process messages faster than the server is sending them, that queue
start to back up. pacemaker enforces a cap on the queue size, and
that's adjustable with the cluster-ipc-limit parameter. Once the queue
grows beyond that size, the client is assumed to be dead and is evicted
so it can be restarted and the queue resources freed.
However, it's possible that the client is not dead. On clusters with
very large numbers of resources (I've tried with 300, but fewer might
also cause problems), certain actions can happen that cause a spike in
IPC messages. In RHEL-76276, the action that causes this is moving
nodes in and out of standby. This spike in messages causes the server
to overwhelm the client, which is then evicted.
My multi-part IPC patches made this even worse, as now if the CIB is so
large that it needs to split an IPC message up, there will be more
messages than before.
What this fix does is get rid of the cap on the queue size for pacemaker
daemons. As long as the server has been able to send messages to the
client, the client is still doing work and shouldn't be evicted. It may
just be processing messages slower than the server is sending them.
Note that this could lead the queue to grow without bound, eventually
crashing the server. For this reason, we're only allowing pacemaker
daemons to ignore the queue size limit.
Potential problems with this approach:
- If the client is so busy that it can't receive even a single message that crm_ipcs_flush_events tries to send, it will still be evicted. However, the flush operation does retry with a delay several times giving the client time to finish up what it's doing.
- We have timers all over the place with daemons waiting on replies. It's possible that because we are no longer just evicting the clients, we will now see those timers expire which will just lead to different problems. If so, these fixes would probably need to take place in the client code.
Fixes T38