Page MenuHomeClusterLabs Projects

crm_mon in interactive mode segfaults during cluster shutdown
Closed (Released)Public

Description

The backtrace is as follows:

#0  0x00007fd23dbe35c0 stonith_send_notification (libstonithd.so.26 + 0xa5c0)
#1  0x00007fd23d7bb810 g_list_foreach (libglib-2.0.so.0 + 0x56810)
#2  0x00007fd23dbdeead foreach_notify_entry (libstonithd.so.26 + 0x5ead)
#3  0x00007fd23dbdefd1 stonith_connection_destroy (libstonithd.so.26 + 0x5fd1)
#4  0x00007fd23db90c12 mainloop_gio_destroy (libcrmcommon.so.34 + 0x32c12)
#5  0x00007fd23d7bb4b5 g_source_callback_unref (libglib-2.0.so.0 + 0x564b5)
#6  0x00007fd23d7bf9a4 g_source_destroy_internal (libglib-2.0.so.0 + 0x5a9a4)
#7  0x00007fd23d7c0f79 g_main_context_dispatch_unlocked.lto_priv.0 (libglib-2.0.so.0 + 0x5bf79)
#8  0x00007fd23d81bdd8 g_main_context_iterate_unlocked.isra.0 (libglib-2.0.so.0 + 0xb6dd8)
#9  0x00007fd23d7c2447 g_main_loop_run (libglib-2.0.so.0 + 0x5d447)
#10 0x000055c909b97a96 main (crm_mon + 0x5a96)
#11 0x00007fd23d8d614a __libc_start_call_main (libc.so.6 + 0x2814a)
#12 0x00007fd23d8d620b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2820b)
#13 0x000055c909b97ff5 _start (crm_mon + 0x5ff5)

This is consistently reproducible using the current main branch. I haven't established a known good version yet, but I don't believe this has always happened.

Event Timeline

nrwahl2 triaged this task as Unbreak Now! priority.Jan 15 2024, 2:56 PM
nrwahl2 created this task.
nrwahl2 created this object with edit policy "Restricted Project (Project)".

I've reproduced this on 2.1.0, 2.1.2, and 2.1.7, so it doesn't appear to be a regression. Additionally, it only happens at shutdown, so there's no tangible impact except that it won't reconnect when the cluster starts back... which is not great.

Still, I have no idea how we haven't noticed this before. This especially should have come up while I worked on T15. I feel certain I've seen crm_mon handle the cluster stopping in the past without a segfault.

I need to verify this, but here's what I think is happening:

  1. Mainloop destroys stonith API connection, calling stonith_connection_destroy() on the stonith client.
  2. The stonith client sends out T_STONITH_NOTIFY_DISCONNECT notifications via foreach_notify_entry() -> stonith_send_notification().
  3. One notification triggers crm_mon.c:mon_st_callback_display() (or event()), which calls mon_cib_connection_destroy().
  4. mon_cib_connection_destroy() calls stonith_api_delete(), which calls stonith_api_free() and destroys the entire client object.
  5. Control returns to stonith_send_notification() and foreach_notify_entry().
  6. One of the further attempts to access members of the stonith client object causes a segfault, since the object and all the members that it owns have been freed.

Assuming that is the case...

I'm thinking about adding a rudimentary reference count to the stonith_t object in the private member. We're already doing that for the notify list with notify_refcnt.

The idea would be to initialize the ref count to 1 in stonith_api_new(), increment it somewhere (maybe in foreach_notify_entry() like we do with notify_refcnt), and then decrement it in stonith_api_free(). If the resulting ref count is 0, then stonith_api_free() can go ahead and destroy the object. Otherwise, it sets a flag that says "call stonith_api_free() again and destroy the stonith_t object at the next opportunity, when the ref count is 0." I'd just have to figure out exactly when it becomes safe to make that automatic second call to stonith_api_free().

I don't really like this. It's more convoluted than I'd prefer. However, it might be the simplest *general* solution. If we just reorganize the crm_mon code that frees the stonith client, we might run into this again in some other context.

The problem is the notify mechanism. The stonith_t object sends out synchronous notifications. The notify client might be the owner of the stonith_t object, and it may need to free it in response to the notification. This is unsafe because control will then return to the foreach_notify_entry(), and something will try to access the stonith_t object again. We could make it safe by marking the stonith_t object for later destruction if something else is still using it, rather than freeing it immediately.

An alternative might be to find a way to reorganize crm_mon as mentioned above, so that this doesn't happen, and then to exercise strict discipline in the future. But these callback sequences get convoluted fast, so it's easy to overlook some possibility. Plus the complexity of crm_mon itself...

On second thought, although the ref count might be the easiest way to fix this, it would be totally reasonable to say "no function that's registered as a stonith notify callback should ever free the stonith_t object."

Ugh, I just dread restructuring crm_mon. So many pieces that don't make sense on the surface. "Why is X here?" "Why do we have to free the entire stonith_t object and create a new one, instead of just reconnecting (and maybe re-registering notifications if the old ones went away)?" Etc.

Not reproducible on RHEL 9.3 with the latest packages. Also not reproducible on Fedora 37.

This issue begins (in the Red Hat/Fedora ecosystem) on Fedora 38.

I have not been able to identify changes in gcc default settings, if that would even influence this. Maybe something in the kernel, IDK.

On Fedora 38/39, the issue starts with commit bc91cc5 (Pacemaker 2.1.0). Previously, mon_cib_connection_destroy() removed all stonith client notifications and disconnected the client, but it didn't delete the client object. As of that commit, it also deletes the client object.

So when control returns to st_client.c:foreach_notify_entry(), the stonith client object and its members been freed, and we hit a segmentation fault when accessing one of the entries.

I don't know if we'll want to backport a fix, and if so how far. RHEL isn't affected currently, and the same probably goes for SLES. Maybe Debian too. By the time RHEL gets whatever non-Pacemaker package introduces the segfault in F38/F39, Pacemaker will be patched. More up-to-date distros like Fedora will be hit, but hopefully they can upgrade Pacemaker to the fixed version.

kgaillot changed the task status from Merged to Released.Aug 8 2024, 7:13 PM