Fix: controller: Avoid crash due to GSource double-free
And rename cib_retry_timer to controld_cib_retry_timer so that it
doesn't look like public CIB API.
This fixes a regression introduced by 676b547 and/or f91848e.
Issues have been observed occasionally during cts-lab runs. For example:
Jul 30 16:20:21 BadNews: 2025-07-30T16:19:46-0400 rhel9-ctslab-3 pacemaker-controld[54218]: error: crm_glib_handler: Forked child [73686] to record non-fatal assertion at logging.c:83 : Source ID 2307737477 was not found when attempting to remove it
Jul 30 16:20:21 BadNews: 2025-07-30T16:19:47-0400 rhel9-ctslab-3 pacemaker-controld[54218]: crit: GLib: Source ID 2307737477 was not found when attempting to remove it
Jul 30 16:20:23 BadNews: 2025-07-30T16:19:58-0400 rhel9-ctslab-3 pacemakerd[54211]: error: pacemaker-controld[54218] terminated with signal 6 (Aborted)
We believe this is happening because somehow we're assigning a new value
to cib_retry_timer before we run the sleep_timer callback for the first
one. So, for example:
- do_pe_invoke_callback() creates cib_retry_timer. Call it timer-1.
- do_pe_invoke_callback() creates cib_retry_timer. Call it timer-2. Now cib_retry_timer no longer points to timer-1.
- sleep_timer gets called for timer-1 . It calls mainloop_timer_del(cib_retry_timer). This destroys timer-2.
- sleep_timer gets called for timer-2. It calls mainloop_timer_del(cib_retry_timer). This tries to destroy timer-2 again, which fails.
Commit 676b547 removed a blocking wait, so there's probably a way for us
to re-enter do_pe_invoke_callback() before the first wait expires.
So I think we should be NULL-ing cib_retry_timer after we delete it, and
also making sure it's currently NULL before we assign to it in
do_pe_invoke_callback(). We can do this by returning early from
do_pe_invoke(). That way we don't lose the reference to our existing
timer.
I hope the controller doesn't end up in a busy-loop of just trying to
invoke the scheduler over and over until the cib_retry_timer expires. If
so, then we'd need to modify whatever code is setting A_PE_INVOKE, or
something like that, to try to get the controller to do other useful
work.
Even if we're incorrect about the cause of the crashes, this commit
seems like a good practice.
Closes T990
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>