diff --git a/daemons/pacemakerd/pacemaker.service.in b/daemons/pacemakerd/pacemaker.service.in index b128ddcb3d..0363a2259c 100644 --- a/daemons/pacemakerd/pacemaker.service.in +++ b/daemons/pacemakerd/pacemaker.service.in @@ -1,97 +1,97 @@ [Unit] Description=Pacemaker High Availability Cluster Manager Documentation=man:pacemakerd Documentation=https://clusterlabs.org/pacemaker/doc/ # DefaultDependencies takes care of sysinit.target, # basic.target, and shutdown.target # We need networking to bind to a network address. It is recommended not to # use Wants or Requires with network.target, and not to use # network-online.target for server daemons. After=network.target # Time syncs can make the clock jump backward, which messes with logging # and failure timestamps, so wait until it's done. After=time-sync.target # Managing systemd resources requires DBus. After=dbus.service Wants=dbus.service # Some OCF resources may have dependencies that aren't managed by the cluster; # these must be started before Pacemaker and stopped after it. The # resource-agents package provides this target, which lets system adminstrators # add drop-ins for those dependencies. After=resource-agents-deps.target Wants=resource-agents-deps.target After=syslog.service After=rsyslog.service After=corosync.service Requires=corosync.service [Install] WantedBy=multi-user.target [Service] Type=simple KillMode=process NotifyAccess=main EnvironmentFile=-@CONFIGDIR@/pacemaker EnvironmentFile=-@CONFIGDIR@/sbd SuccessExitStatus=100 -ExecStart=@sbindir@/pacemakerd -f +ExecStart=@sbindir@/pacemakerd # Systemd v227 and above can limit the number of processes spawned by a # service. That is a bad idea for an HA cluster resource manager, so disable it # by default. The administrator can create a local override if they really want # a limit. If your systemd version does not support TasksMax, and you want to # get rid of the resulting log warnings, comment out this option. TasksMax=infinity # If pacemakerd doesn't stop, it's probably waiting on a cluster # resource. Sending -KILL will just get the node fenced SendSIGKILL=no # If we ever hit the StartLimitInterval/StartLimitBurst limit, and the # admin wants to stop the cluster while pacemakerd is not running, it # might be a good idea to enable the ExecStopPost directive below. # # However, the node will likely end up being fenced as a result, so it's # not enabled by default. # # ExecStopPost=/usr/bin/killall -TERM pacemaker-attrd pacemaker-based \ # pacemaker-controld pacemaker-execd pacemaker-fenced \ # pacemaker-schedulerd # If you want Corosync to stop whenever Pacemaker is stopped, # uncomment the next line too: # # ExecStopPost=/bin/sh -c 'pidof pacemaker-controld || killall -TERM corosync' # Pacemaker will restart along with Corosync if Corosync is stopped while # Pacemaker is running. # In this case, if you want to be fenced always (if you do not want to restart) # uncomment ExecStopPost below. # # ExecStopPost=/bin/sh -c 'pidof corosync || \ # /usr/bin/systemctl --no-block stop pacemaker' # When the service functions properly, it will wait to exit until all resources # have been stopped on the local node, and potentially across all nodes that # are shutting down. The default of 30min should cover most typical cluster # configurations, but it may need an increase to adapt to local conditions # (e.g. a large, clustered database could conceivably take longer to stop). TimeoutStopSec=30min TimeoutStartSec=60s # Restart options include: no, on-success, on-failure, on-abort or always Restart=on-failure # crm_perror() writes directly to stderr, so ignore it here # to avoid double-logging with the wrong format StandardError=null diff --git a/doc/sphinx/Clusters_from_Scratch/verification.rst b/doc/sphinx/Clusters_from_Scratch/verification.rst index 9d647f81a0..b7fa20ea7e 100644 --- a/doc/sphinx/Clusters_from_Scratch/verification.rst +++ b/doc/sphinx/Clusters_from_Scratch/verification.rst @@ -1,215 +1,215 @@ Start and Verify Cluster ------------------------ Start the Cluster ################# Now that corosync is configured, it is time to start the cluster. The command below will start corosync and pacemaker on both nodes in the cluster. If you are issuing the start command from a different node than the one you ran the ``pcs host auth`` command on earlier, you must authenticate on the current node you are logged into before you will be allowed to start the cluster. .. code-block:: none [root@pcmk-1 ~]# pcs cluster start --all pcmk-1: Starting Cluster... pcmk-2: Starting Cluster... .. NOTE:: An alternative to using the ``pcs cluster start --all`` command is to issue either of the below command sequences on each node in the cluster separately: .. code-block:: none # pcs cluster start Starting Cluster... or .. code-block:: none # systemctl start corosync.service # systemctl start pacemaker.service .. IMPORTANT:: In this example, we are not enabling the corosync and pacemaker services to start at boot. If a cluster node fails or is rebooted, you will need to run ``pcs cluster start `` (or ``--all``) to start the cluster on it. While you could enable the services to start at boot, requiring a manual start of cluster services gives you the opportunity to do a post-mortem investigation of a node failure before returning it to the cluster. Verify Corosync Installation ############################ First, use ``corosync-cfgtool`` to check whether cluster communication is happy: .. code-block:: none [root@pcmk-1 ~]# corosync-cfgtool -s Printing link status. Local node ID 1 LINK ID 0 addr = 192.168.122.101 status: nodeid 1: localhost nodeid 2: connected We can see here that everything appears normal with our fixed IP address (not a 127.0.0.x loopback address) listed as the **addr**, and **localhost** and **connected** for the statuses of nodeid 1 and nodeid 2, respectively. If you see something different, you might want to start by checking the node's network, firewall and SELinux configurations. Next, check the membership and quorum APIs: .. code-block:: none [root@pcmk-1 ~]# corosync-cmapctl | grep members runtime.members.1.config_version (u64) = 0 runtime.members.1.ip (str) = r(0) ip(192.168.122.101) runtime.members.1.join_count (u32) = 1 runtime.members.1.status (str) = joined runtime.members.2.config_version (u64) = 0 runtime.members.2.ip (str) = r(0) ip(192.168.122.102) runtime.members.2.join_count (u32) = 1 runtime.members.2.status (str) = joined [root@pcmk-1 ~]# pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 pcmk-1 (local) 2 1 pcmk-2 You should see both nodes have joined the cluster. Verify Pacemaker Installation ############################# Now that we have confirmed that Corosync is functional, we can check the rest of the stack. Pacemaker has already been started, so verify the necessary processes are running: .. code-block:: none [root@pcmk-1 ~]# ps axf PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] ...lots of processes... 17121 ? SLsl 0:01 /usr/sbin/corosync -f - 17133 ? Ss 0:00 /usr/sbin/pacemakerd -f + 17133 ? Ss 0:00 /usr/sbin/pacemakerd 17134 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-based 17135 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-fenced 17136 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-execd 17137 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-attrd 17138 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-schedulerd 17139 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-controld If that looks OK, check the ``pcs status`` output: .. code-block:: none [root@pcmk-1 ~]# pcs status Cluster name: mycluster WARNINGS: No stonith devices and stonith-enabled is not false Cluster Summary: * Stack: corosync * Current DC: pcmk-2 (version 2.0.5-4.el8-ba59be7122) - partition with quorum * Last updated: Wed Jan 20 07:54:02 2021 * Last change: Wed Jan 20 07:48:25 2021 by hacluster via crmd on pcmk-2 * 2 nodes configured * 0 resource instances configured Node List: * Online: [ pcmk-1 pcmk-2 ] Full List of Resources: * No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled Finally, ensure there are no start-up errors from corosync or pacemaker (aside from messages relating to not having STONITH configured, which are OK at this point): .. code-block:: none [root@pcmk-1 ~]# journalctl -b | grep -i error .. NOTE:: Other operating systems may report startup errors in other locations, for example ``/var/log/messages``. Repeat these checks on the other node. The results should be the same. Explore the Existing Configuration ################################## For those who are not of afraid of XML, you can see the raw cluster configuration and status by using the ``pcs cluster cib`` command. .. topic:: The last XML you'll see in this document .. code-block:: none [root@pcmk-1 ~]# pcs cluster cib .. code-block:: xml Before we make any changes, it's a good idea to check the validity of the configuration. .. code-block:: none [root@pcmk-1 ~]# crm_verify -L -V error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid As you can see, the tool has found some errors. The cluster will not start any resources until we configure STONITH.