HomeClusterLabs Projects

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost

Description

Fix: lrmd: cancel currently pending STONITH op if stonithd connection is lost

The currently pending op is moved from rsc->pending_ops to rsc->active
(if it is asynchronous). Therefore, that also needs to be cleaned up if
the stonithd connection fails. Otherwise, the resource gets stuck forever
on an op that will never complete.

Example, interacting with the (long fixed) bug mentioned in pull/334:

  1. lrmd gets a start action, becomes rsc->active
  2. stonithd times out on the action
  3. stonithd gets SIGTERMed due to the aforementioned bug
  4. lrmd notices and attemps to clean up the pending ops, but misses rsc->active
  5. start times out in crmd, attempts to recover
  6. lrmd gets a stop action, gets put into rsc->pending_ops
  7. lrmd never runs the action since rsc->active is non-NULL
  8. stop times out in crmd and the host gets STONITHed due to a failed stop (even though a stonith resource stop is basically a no-op!)

Details

Provenance
Hector Martin <marcan@marcan.st>Authored on Jun 11 2015, 4:03 AM
Parents
rPf3a69d9a54e0: Merge pull request #718 from kgaillot/cleanup
Branches
Unknown
Tags
Unknown

Event Timeline