Page MenuHomeClusterLabs Projects

Using SBD With Pacemaker
Updated 168 Days AgoPublic

Overview

SBD (Storage-Based Death) allows Pacemaker to use a watchdog device to halt a node.

While its name derives from using shared storage for coordination, only a watchdog is required -- shared storage is optional.

This document uses pcs (RHEL) commands as examples, but the concepts should apply to any distribution.

SBD with resource recovery has been supported since version 1.1.13, but newer versions are recommended.

Configuration

Watchdog-based self-fencing without resource recovery

With this configuration, sbd will watch for corosync and/or pacemaker failures, and panic the node for safety, but the cluster will not recover resources elsewhere as long as the node's state is unknown (that is, before the node has rejoined or the administrator has run stonith_admin --confirm):

  • Ensure you have a watchdog device, and configure it as SBD_WATCHDOG_DEV in /etc/sysconfig/sbd. While it is possible to specify a software watchdog, software watchdogs rely on a correctly functioning operating system, and thus are unreliable for fencing purposes. Always use a hardware watchdog device in production. (Many server motherboards have them built in.)
  • Ensure that the sbd daemon is running on a node before starting the cluster services. The best approach is generally to enable it to start at boot. (The cluster can't manage the sbd daemon as a cluster resource.) There are two flavors of SBD, sbd for cluster nodes, and sbd_remote for Pacemaker Remote nodes. Here we use sbd as an example, but for Pacemaker Remote nodes, replace sbd with sbd_remote:
systemctl enable --now sbd

To disable:

systemctl disable --now sbd

Watchdog-based self-fencing with resource recovery

With this configuration, in addition to the basic functionality, the remaining cluster will assume services are stopped after a specified amount of time and recover them:

  • With watchdog-only SBD, the cluster must have true quorum. Thus, it can only be used in a cluster with three or more nodes, or a two-node cluster with external quorum (such as corosync using qdevice with a third node).
  • Configure the basic setup on every node as described above.
  • Select a recovery interval (in seconds) that is greater than SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd.
  • Enable cluster recovery, replacing ${interval} with the desired number of seconds:
pcs property set stonith-watchdog-timeout=${interval}

Critical: Do not set stonith-watchdog-timeout until sbd is configured and running on every node (including Pacemaker Remote nodes).

To disable:

pcs property set stonith-watchdog-timeout=0

Critical: Do not stop sbd or sbd_remote on any node until stonith-watchdog-timeout has been unset/deleted.

Storage-based self-fencing with resource recovery

With this configuration, in addition to the above functionality, the cluster will use shared storage as a disk-based poison pill:

  • Configure SBD with the device used as shared storage, replacing whatever with the actual shared block device:
sed -i s@SBD_DEVICE.*@SBD_DEVICE=/dev/whatever@ /etc/sysconfig/sbd
  • Configure the basic and standard setup as above.

See also

Last Author
kgaillot
Last Edited
Jan 10 2024, 6:15 PM