HomeClusterLabs Projects

galera: Fix automatic recovery when a cluster was not gracefully stopped

Description

galera: Fix automatic recovery when a cluster was not gracefully stopped

When selecting a bootstrap node, the Galera resource agent primarily
depends on the safe_to_bootstrap flag in grastate.dat. If none of the
nodes have this flag set to 1 then functions detect_last_commit() +
detect_first_master() provide a recovery logic to select the bootstrap
node based on each node's last commit, as obtained from grastate.dat or
'mysqld_safe --wsrep-recover'.

Fix 65f35e9172407e64ded90f29ea8fc0dfca9643e3 introduced a problem for
this recovery logic. If a whole cluster is not gracefully stopped,
grastate.dat on every node contains "safe_to_bootstrap: 0" and
"seqno: -1". Function detect_safe_to_bootstrap() then considers each
node with this seqno as not suitable for bootstraping and clears the
safe_to_bootstrap attribute. Nonetheless, functions detect_last_commit()
+ detect_first_master() successfully find a bootstrap node, relying on
the recovery logic. However, when the promote operation is invoked,
function galera_promote() runs check '"$(get_safe_to_bootstrap)" = "0"'
which fails and prevents the code from writing "safe_to_bootstrap: 1"
into grastate.dat of the selected node to mark it as a bootstrap node.
The end result is that Galera refuses to be started on this node and
therefore the whole cluster remains down.

The patch fixes the problem by adjusting detect_safe_to_bootstrap() to
accept the combination of "safe_to_bootstrap: 0" and "seqno: -1" and
allow a node with this state to potentially become a bootstrap node.

Details

Provenance
Petr Pavlu <petr.pavlu@suse.com>Authored on Aug 26 2020, 7:59 AM
Parents
rR104bb45bd2c7: Merge pull request #1549 from schubergphilis/master
Branches
Unknown
Tags
Unknown

Event Timeline