See discussion starting at https://github.com/ClusterLabs/pacemaker/pull/3311#discussion_r1446742520.
Currently, the stonith-watchdog-timeout cluster option has two defaults:
- 0 ("disabled") if unset
- -1 ("auto-calculate") if the configured value is invalid (that is, a garbage string that can't be parsed as a timeout)
- All negative and invalid values are treated the same as -1. We say -1 for convenience.
This has a couple of unfortunate and closely related side effects.
- We can't validate the value at all. Any value not parsable as a timeout by crm_get_msec() is treated as -1 and causes the timeout to be auto-calculated based on the SBD_WATCHDOG_TIMEOUT environment variable. Some configurations might rely on this behavior, so we can't begin validating values until a compatibility break. In effect, everything is allowed and has meaning (invalid means "negative", which means "auto-calculate").
- The default value in the documentation and the options array is wrong and can't be made correct. The default value is supposed to be the same regardless of whether the option is completely unset or the option is set to an invalid value. For stonith-watchdog-timeout, these two cases yield different default values.
The proposal is to make 0 (disabled) the unconditional default value at Pacemaker 3.0.0 and to reject values that aren't parsable as timeouts. We can either continue letting negative values mean "auto-calculate" or define a string like "auto" for that purpose. The latter is clearer but would require an XSL transform for existing configs. (Would keeping "negative == auto-calculate" also require that an XSL transform from garbage values to -1?)
Note: pcmk__valid_stonith_watchdog_timeout() looks like a validation function, but it can never return false. It can only pass or exit with a fatal error; it exits only when the stonith-watchdog-timeout value is incompatible with the SBD_WATCHDOG_TIMEOUT value.