HomeClusterLabs Projects

Fix: fencer: Prevent double g_source_remove of op_timer_one

Description

Fix: fencer: Prevent double g_source_remove of op_timer_one

QE observed a rarely reproducible core dump in the fencer during
Pacemaker shutdown, in which we try to g_source_remove() an op timer
that's already been removed.

free_stonith_remote_op_list()
-> g_hash_table_destroy()
-> g_hash_table_remove_all_nodes()
-> clear_remote_op_timers()
-> g_source_remove()
-> crm_glib_handler()
-> "Source ID 190 was not found when attempting to remove it"

The likely cause is that request_peer_fencing() doesn't set
op->op_timer_one to 0 after calling g_source_remove() on it, so if that
op is still in the stonith_remote_op_list at shutdown with the same
timer, clear_remote_op_timers() tries to remove the source for
op_timer_one again.

There are only five locations that call g_source_remove() on a
remote_fencing_op_t timer.

  • Three of them are in clear_remote_op_timers(), which first 0-checks the timer and then sets it to 0 after g_source_remove().
  • One is in remote_op_query_timeout(), which does the same.
  • The last is the one we fix here in request_peer_fencing().

I don't know all the conditions of QE's test scenario at this point.
What I do know:

  • have-watchdog=true
  • stonith-watchdog-timeout=10
  • no explicit topology
  • fence agent script is missing for the configured fence device
  • requested fencing of one node
  • cluster shutdown

Fixes RHBZ2166967

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Feb 3 2023, 3:08 PM
Parents
rP11c15a89fafa: Merge pull request #2904 from waltdisgrace/t516
Branches
Unknown
Tags
Unknown