HomeClusterLabs Projects

Fix: controller: Avoid crash due to GSource double-free

Description

Fix: controller: Avoid crash due to GSource double-free

And rename cib_retry_timer to controld_cib_retry_timer so that it
doesn't look like public CIB API.

This fixes a regression introduced by 676b547 and/or f91848e.

Issues have been observed occasionally during cts-lab runs. For example:

Jul 30 16:20:21 BadNews: 2025-07-30T16:19:46-0400 rhel9-ctslab-3 pacemaker-controld[54218]: error: crm_glib_handler: Forked child [73686] to record non-fatal assertion at logging.c:83 : Source ID 2307737477 was not found when attempting to remove it
Jul 30 16:20:21 BadNews: 2025-07-30T16:19:47-0400 rhel9-ctslab-3 pacemaker-controld[54218]: crit: GLib: Source ID 2307737477 was not found when attempting to remove it
Jul 30 16:20:23 BadNews: 2025-07-30T16:19:58-0400 rhel9-ctslab-3 pacemakerd[54211]: error: pacemaker-controld[54218] terminated with signal 6 (Aborted)

We believe this is happening because somehow we're assigning a new value
to cib_retry_timer before we run the sleep_timer callback for the first
one. So, for example:

  1. do_pe_invoke_callback() creates cib_retry_timer. Call it timer-1.
  2. do_pe_invoke_callback() creates cib_retry_timer. Call it timer-2. Now cib_retry_timer no longer points to timer-1.
  3. sleep_timer gets called for timer-1 . It calls mainloop_timer_del(cib_retry_timer). This destroys timer-2.
  4. sleep_timer gets called for timer-2. It calls mainloop_timer_del(cib_retry_timer). This tries to destroy timer-2 again, which fails.

Commit 676b547 removed a blocking wait, so there's probably a way for us
to re-enter do_pe_invoke_callback() before the first wait expires.

So I think we should be NULL-ing cib_retry_timer after we delete it, and
also making sure it's currently NULL before we assign to it in
do_pe_invoke_callback(). We can do this by returning early from
do_pe_invoke(). That way we don't lose the reference to our existing
timer.

I hope the controller doesn't end up in a busy-loop of just trying to
invoke the scheduler over and over until the cib_retry_timer expires. If
so, then we'd need to modify whatever code is setting A_PE_INVOKE, or
something like that, to try to get the controller to do other useful
work.

Even if we're incorrect about the cause of the crashes, this commit
seems like a good practice.

Closes T990

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Thu, Jul 31, 1:33 AM
Parents
rP459d0563e340: Merge pull request #3926 from nrwahl2/nrwahl2-cibadmin
Branches
Unknown
Tags
Unknown
Tasks
T990: Bug in removing glib source