HomeClusterLabs Projects

[links] stabilize latency calculation when nodes are not responsive

Description

[links] stabilize latency calculation when nodes are not responsive

The following scenario is more of a corner case than normal, but
this change allows to better deal with this situation:

  1. 2 nodes cluster (corosync) (node A and node B)
  2. kill -stop $(pidof corosync) on node A
  3. node B will continue to send ping packets to node A
  4. node A is accumulating those ping packets in the kernel network socket
  5. wait some seconds and unpause node A
  6. node A will start processing the ping packets in the queue and send pong replies to node B
  7. node B will see an extreme increase of latency due those "obsoleted" ping/pong packets
  8. node B, as latency increases, will take longer and longer to notice that node A is down due to the pong_timeout adjustment for latency (required for initial cluster spike).

the solution:

  1. Use average latency to calculate pong_timeout_adj vs latency_max. Averate latency will go down again in time, while latency_max is never reset.
  1. RX thread will filter out all pong packets that have higher latency than currently configure pong_timeout. This barrier should have been in place even before.

this solution reduces the latency spike on node B to a perfectly
reasonable level and it will all eventually stabilize over time
as latency samples increase and latency will reduce.

Please be aware that using a pong_timeout smaller than latency will
simply mark the link down now.

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>

Details

Provenance
fabbioneAuthored on Sep 6 2019, 1:05 AM
Parents
rKeba56bb9a905: [handle] Set thread stack size on create
Branches
Unknown
Tags
Unknown

Event Timeline