Med: scheduler: Don't always fence online remote nodes.
Let's assume you have a cluster configured as follows:
- Three nodes, plus one Pacemaker Remote node.
- At least two NICs on each node.
- Multiple layers of fencing, including fence_kdump.
- The timeout for fence_kdump is set higher on the real nodes than it is on the remote node.
- A resource is configured that can only be run on the remote node.
Now, let's assume that the node running the connection resource for the
remote node is disconnect from the rest of the cluster. In testing,
this disconnection was done by bringing one network interface down.
Due to the fence timeouts, the following things will occur:
- The node whose interface was brought down will split off into its own cluster partition without quorum, while the other two nodes maintain quorum.
- The partition with quorum will restart the remote node resource on another real node in the partition.
- The node by itself will be fenced. However, due to the long fence_kdump timeout, it will continue to make decisions regarding resources.
- The node by itself will re-assign resources, including the remote connection resource. This resource will be assigned back to the same node again.
- The node by itself will decide to fence the remote node, which will hit the "in our membership" clause of pe_can_fence. This is because remote nodes are marked as online when they are assigned, not when they are actually running.
- When the fence_kdump timeout expires, the node by itself will fence the remote node. This succeeds because there is still a secondary network connection it can use. This fencing will succeed, causing the remote node to reboot and then causing a loss of service.
- The node by itself will then be fenced.
The bug to me seems to be that the remote resource is marked as online
when it isn't yet. I think with that changed, all the other remote
fencing related code would then work as intended. However, it probably
has to remain as-is in order to schedule resources on the remote node -
resources probably can't be assigned to an offline node. Making changes
in pe_can_fence seems like the least invasive way to deal with this
problem.
I also think this probably has probably been here for a very long time -
perhaps always - but we just haven't seen it due to the number of things
that have to be configured before it can show up. In particular, the
fencing timeouts and secondary network connection are what allow this
behavior to happen.
I can't think of a good reason why a node without quorum would ever want
to fence a remote node, especially if the connection resource has been
moved to the wquochanges in pe_can_fence seems like the least invasive
way to deal with this problem.
My fix here therefore is just to test whether there is another node it
could have been moved to and if so, don't fence it.
Fixes T978
Fixes RHEL-84018