Med: libcrmcommon: Retry on EAGAIN in crm_ipcs_flush_events.
712bb3f5354f
Actions

Description

Med: libcrmcommon: Retry on EAGAIN in crm_ipcs_flush_events.

If qb_ipcs_event_sendv returns -EAGAIN when flushing an event, we
immediately break out of the flushing while loop and won't try again
until either we send another event or the event flushing trigger fires
off. That could be up to 1.5 seconds from the last call.

In the meantime, the daemon could still be receiving messages,
processing them, generating events in response, and sticking those into
the send queue. It's possible that if each send gets an -EAGAIN,
replies will continue to back up in the queue faster than they are being
sent out. Over time, this can lead to whoever sending the messages
giving up on the reply, eventually leading to a daemon aborting and
restarting.

I've seen this happen on a cluster where I generate 100 dummy resources
in a loop. controld will send repeated cib_query messages, each of
which based will receive almost immediately. As it sends responses
back, each one gets a -EAGAIN. This gradually slows things down until
eventually, some response takes longer than the 50s timeout that
controld sets. This causes it to log an error and restart.

Instead, if we just pause a few milliseconds and try again, it's likely
whatever caused the -EAGAIN will have cleared up and the reply will go
through. This results in the queue not backing up. It's possible that
we need to wait a little bit longer, or try one or two more times, but
we really don't want to sleep in this function longer than is absolutely
necessary.