Fix: tools: crm_mon segfaults when fencer connection is lost
This is easiest to observe when Pacemaker is stopping.
When crm_mon is running in interactive mode (the default) and the
cluster is stopped, crm_mon crashes with a segmentation fault. This is a
regression that was introduced in Pacemaker 2.1.0 by commit bc91cc5.
However, for some reason the crash doesn't happen on all platforms. In
particular, I can reproduce the crash on Fedora 38 and 39, but not on
RHEL 9.3 or Fedora 37. This is independent of the Pacemaker version.
The cause is a use-after-free. In detail, it is as follows:
- crm_mon registers a notification via its stonith API client for disconnect events. This notification will call either mon_st_callback_event() or mon_st_callback_display(), depending on the CLI options. Both of these callbacks call mon_cib_connection_destroy() for disconnect notifications, so it doesn't matter which one is used.
- When the fencer connection is lost, the mainloop calls the stonith API client's destroy callback (stonith_connection_destroy()).
- stonith_connection_destroy() sets the state to stonith_disconnected and calls foreach_notify_entry(..., stonith_send_notification, blob), where blob contains a disconnect notification.
- foreach_notify_entry() loops over all the registered notify entries, calling stonith_send_notification(entry, blob) for each notify entry.
- For each notify client that's subscribed to disconnect notifications, stonith_send_notification() calls the registered callback function.
- Based on the registration in step (1), stonith_send_notification() synchronously calls mon_st_callback_event()/display() for crm_mon.
- mon_st_callback_event()/display() calls mon_cib_connection_destroy().
- mon_cib_connection_destroy() calls stonith_api_delete(), which frees the stonith API client and its members, including the notification table.
- Control returns to stonith_send_notification() and then back to foreach_notify_entry().
- foreach_notify_entry() moves to the next entry in the list. But the entire list was freed in step (8). So when it tries to access a member of one of the entries, we get a segmentation fault.
Commit bc91cc5 introduced the regression by deleting the stonith API
client in mon_cib_connection_destroy(). Prior to that,
mon_cib_connection_destroy() only disconnected the client and marked its
notify entries for removal.
I audited the other uses of stonith_api_delete() in crm_mon and
elsewhere, and I believe they're safe in the sense that they're never
called while we're processing stonith notify callbacks. A function
should never be allowed to call stonith_api_delete() if the stonith API
client might be sending out notifications. If there are more
notifications in the table, attempts to access them will be a
use-after-free.
Fixes T751
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>