HomeClusterLabs Projects

Fix: libpacemaker: Don't shuffle clone instances unnecessarily

Description

Fix: libpacemaker: Don't shuffle clone instances unnecessarily

Currently, clone instances may be shuffled under certain conditions,
causing an unnecessary resource downtime when an instance is moved
away from its current running node.

For example, this can happen when a stopped promotable instance is
scheduled to promote and the stickiness is lower than the promotion
score (see the clone-recover-no-shuffle-7 test). Instance 0 gets
assigned first and goes to the node that will be promoted. If instance 0
is already running on some node, it must stop there before it can start
on the new node. Another instance may start in its place after it stops.

The fix is to assign an instance to its current node during the early
assignment phase, if that node is going to receive any instance at all.
If the node will receive an instance, it should receive its current
instance.

The approach is described in detail in comments.

Previously, if instance 0 was running on node1 and got assigned to node2
during the early assignment phase (due to node2 having a higher score),
we backed out and immediately gave up on assigning instance 0 early.

Now, we increment a "number of instances reserved" counter, as well as
the parent's counter of instances assigned to node2. We then try again
to assign instance 0 to node1. If node2 already has the max allowed
number of instances, then it will be marked unavailable for this round.

Fixes T489
Fixes RHBZ#1931023

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Jun 22 2023, 1:40 AM
Parents
rP0d3faea124cd: Test: scheduler: Update tests after new stop_if_fail argument
Branches
Unknown
Tags
Unknown
Tasks
Restricted Maniphest Task