Page MenuHomeClusterLabs Projects

Validate stonith-watchdog-timeout appropriately and default to 0 on invalid values
Open, HighPublic

Assigned To
Authored By
nrwahl2
Jan 11 2024, 12:45 AM
Tags
  • Restricted Project
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

See discussion starting at https://github.com/ClusterLabs/pacemaker/pull/3311#discussion_r1446742520.

Currently, the stonith-watchdog-timeout cluster option has two defaults:

  • 0 ("disabled") if unset
  • -1 ("auto-calculate") if the configured value is invalid (that is, a garbage string that can't be parsed as a timeout)
    • All negative and invalid values are treated the same as -1. We say -1 for convenience.

This has a couple of unfortunate and closely related side effects.

  • We can't validate the value at all. Any value not parsable as a timeout by crm_get_msec() is treated as -1 and causes the timeout to be auto-calculated based on the SBD_WATCHDOG_TIMEOUT environment variable. Some configurations might rely on this behavior, so we can't begin validating values until a compatibility break. In effect, everything is allowed and has meaning (invalid means "negative", which means "auto-calculate").
  • The default value in the documentation and the options array is wrong and can't be made correct. The default value is supposed to be the same regardless of whether the option is completely unset or the option is set to an invalid value. For stonith-watchdog-timeout, these two cases yield different default values.

The proposal is to make 0 (disabled) the unconditional default value at Pacemaker 3.0.0 and to reject values that aren't parsable as timeouts. We can either continue letting negative values mean "auto-calculate" or define a string like "auto" for that purpose. The latter is clearer but would require an XSL transform for existing configs. (Would keeping "negative == auto-calculate" also require that an XSL transform from garbage values to -1?)

Note: pcmk__valid_stonith_watchdog_timeout() looks like a validation function, but it can never return false. It can only pass or exit with a fatal error; it exits only when the stonith-watchdog-timeout value is incompatible with the SBD_WATCHDOG_TIMEOUT value.

Event Timeline

nrwahl2 triaged this task as Normal priority.Jan 11 2024, 12:45 AM
nrwahl2 created this task.
nrwahl2 created this object with edit policy "Restricted Project (Project)".
kgaillot moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 24 2024, 1:04 PM
kgaillot raised the priority of this task from Normal to High.

I guess there are more dimension to this problem:

  • At first it isn't ensured that SBD_WATCHDOG_TIMEOUT is the same value on all nodes - hence pcmk__valid_stonith_watchdog_timeout() to check if stonith-watchdog-timeout works for an explicit node that otherwise has to immediately shut down resources or better never has started any.
  • And the cluster has to agree in some way to a common real timeout value in numbers. It is crucial that the node that actually observes watchdog-fencing has a number at least as high as the number derived from SBD_WATCHDOG_TIMEOUT on the target node.
  • For some time we now support nodes in the cluster that do support watchdog-fencing and others that don't. Otherwise you could have watchdog-fencing just enabled if all nodes do have a proper hardware-watchdog that is usable for SBD. Of course it enables the freedom to decide where you want watchdog-fencing and where not - for whatever reason. So it isn't even possible to do the automated calculation on all nodes.

Thus the concept currently implemented is dangerous/screwed anyway. So we should either drop the feature or go for something more complex.
Only real thing coming to my mind is that each node when it joins the cluster somehow communicates the value it would derive locally and some internal representation of stonith-watchdog-timeout is increased cluster-wide. But can we be sure that we've already seen a node before we are fencing it? Probably not.

Could we pass the feature to highlevel tooling somehow? When SBD is enabled pcs anyway does some sanity checking (e.g. existence of watchdog-devices on the nodes). It as well creates a quite minimalistic version of sbd-configuration /etc/sysconfig/sbd with SBD_WATCHDOG_TIMEOUT (noway around it being set to the same value on all nodes ) so we might as well have it set stonith-watchdog-timeout to a reasonable value if requested (or even per default).
If configuration is changed afterwards on a per-node basis validation would be done on each node as of now. (I mean per the definition of validate as of pcmk__valid_stonith_watchdog_timeout.

Revisiting this, we should check whether negative stonith-watchdog-timeout is either used by pcs or documented anywhere downstream. If not, just drop the auto-calculation feature and require nonnegative values.

If we do want to keep auto-calculation, this task can just be about using 0 for invalid values (for 3.0.0), and we can separate a new task for improving the auto-calculation (possibly using a new special node attribute set when the controller joins the cluster).

Another possible improvement is that instead of a bad value causing the node to immediately shut down, it could set a new special node attribute that would cause the scheduler to treat it as being in standby.

Revisiting this, we should check whether negative stonith-watchdog-timeout is either used by pcs or documented anywhere downstream. If not, just drop the auto-calculation feature and require nonnegative values.

It doesn't look like pcs specifically uses a negative SWT... at a glance it might not even accept it (haven't tested). The validator function requires the property to be greater than _get_local_sbd_watchdog_timeout().

return validate.ValidatorAll(
    [
        _StonithWatchdogTimeoutValidator(
            "stonith-watchdog-timeout",
            _get_local_sbd_watchdog_timeout(),
            severity=severity,
        )
    ]

Negative values don't seem to be documented in the RHEL HA Add-on product docs. I can't speak for SLES or elsewhere right now.

Bah. It's not in the product docs but it is in the design guidance article.

Since Pacemaker 2.0.0, setting this value to -1 means derive stonith-watchdog-timeout from SBD_WATCHDOG_TIMEOUT automatically. Red Hat does not recommend using this feature as there is a danger of split-brain when SBD_WATCHDOG_TIMEOUT is set inconsistently with different nodes.

https://access.redhat.com/articles/2941601

Let's just drop auto-calculation and use 0 for negative or invalid values.