diff --git a/doc/sphinx/Clusters_from_Scratch/installation.rst b/doc/sphinx/Clusters_from_Scratch/installation.rst index db61a49b74..6a0d4698cd 100644 --- a/doc/sphinx/Clusters_from_Scratch/installation.rst +++ b/doc/sphinx/Clusters_from_Scratch/installation.rst @@ -1,437 +1,416 @@ Installation ------------ Install |CFS_DISTRO| |CFS_DISTRO_VER| ################################################################################################ Boot the Install Image ______________________ Download the latest |CFS_DISTRO| |CFS_DISTRO_VER| DVD ISO by navigating to the `CentOS Mirrors List `_, selecting a download mirror which is close to you, and finally selecting the .iso file that has "dvd" in its name. Use the image to boot a virtual machine, or burn it to a DVD or USB drive and boot a physical server from that. After starting the installation, select your language and keyboard layout at the welcome screen. .. figure:: images/WelcomeToCentos.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Installation Welcome Screen |CFS_DISTRO| |CFS_DISTRO_VER| Installation Welcome Screen Installation Options ____________________ At this point, you get a chance to tweak the default installation options. .. figure:: images/InstallationSummary.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Installation Summary Screen |CFS_DISTRO| |CFS_DISTRO_VER| Installation Summary Screen Click on the **SOFTWARE SELECTION** section (try saying that 10 times quickly). The default environment, **Server with GUI**, does have add-ons with much of the software we need, but we will change the environment to a **Minimal Install** here, so that we can see exactly what software is required later, and press **Done**. .. figure:: images/SoftwareSelection.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Software Selection Screen |CFS_DISTRO| |CFS_DISTRO_VER| Software Selection Screen Configure Network _________________ In the **NETWORK & HOSTNAME** section: - Edit **Host Name:** as desired. For this example, we will use **pcmk-1.localdomain** and then press **Apply**. - Select your network device, press **Configure...**, and use the **Manual** method to assign a fixed IP address. For this example, we'll use 192.168.122.101 under **IPv4 Settings** (with an appropriate netmask, gateway and DNS server). - Press **Save**. - Flip the switch to turn your network device on, and press **Done**. .. figure:: images/NetworkAndHostName.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Editing network settings |CFS_DISTRO| |CFS_DISTRO_VER| Network Interface Screen .. IMPORTANT:: Do not accept the default network settings. Cluster machines should never obtain an IP address via DHCP, because DHCP's periodic address renewal will interfere with corosync. Configure Disk ______________ By default, the installer's automatic partitioning will use LVM (which allows us to dynamically change the amount of space allocated to a given partition). However, it allocates all free space to the ``/`` (aka. **root**) partition, which cannot be reduced in size later (dynamic increases are fine). In order to follow the DRBD and GFS2 portions of this guide, we need to reserve space on each machine for a replicated volume. Enter the **INSTALLATION DESTINATION** section, ensure the hard drive you want to install to is selected, select **Custom** to be the **Storage Configuration**, and press **Done**. In the **MANUAL PARTITIONING** screen that comes next, click the option to create mountpoints automatically. Select the ``/`` mountpoint, and reduce the desired capacity by 3GiB or so. Select **Modify...** by the volume group name, and change the **Size policy:** to **As large as possible**, to make the reclaimed space available inside the LVM volume group. We'll add the additional volume later. .. figure:: images/ManualPartitioning.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Manual Partitioning Screen |CFS_DISTRO| |CFS_DISTRO_VER| Manual Partitioning Screen Press **Done**, then **Accept changes**. Configure Time Synchronization ______________________________ It is highly recommended to enable NTP on your cluster nodes. Doing so ensures all nodes agree on the current time and makes reading log files significantly easier. |CFS_DISTRO| will enable NTP automatically. If you want to change any time-related settings (such as time zone or NTP server), you can do this in the **TIME & DATE** section. Root Password ______________________________ In order to continue to the next step, a **Root Password** must be set. .. figure:: images/RootPassword.png - :scale: 80% - :width: 1024 - :height: 800 :align: center :alt: Root Password Screen |CFS_DISTRO| |CFS_DISTRO_VER| Root Password Screen Press **Done** (depending on the password you chose, you may need to do so twice). Finish Install ______________ Select **Begin Installation**. Once it completes, **Reboot System** as instructed. After the node reboots, you'll see a login prompt on the console. Login using **root** and the password you created earlier. .. figure:: images/ConsolePrompt.png - :scale: 80% - :width: 1024 - :height: 768 :align: center :alt: Console Prompt |CFS_DISTRO| |CFS_DISTRO_VER| Console Prompt .. NOTE:: From here on, we're going to be working exclusively from the terminal. Configure the OS ################ Verify Networking _________________ Ensure that the machine has the static IP address you configured earlier. .. code-block:: none [root@pcmk-1 ~]# ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp1s0: mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:32:cf:a9 brd ff:ff:ff:ff:ff:ff inet 192.168.122.101/24 brd 192.168.122.255 scope global noprefixroute enp1s0 valid_lft forever preferred_lft forever inet6 fe80::c3e1:3ba:959:fa96/64 scope link noprefixroute valid_lft forever preferred_lft forever .. NOTE:: If you ever need to change the node's IP address from the command line, follow these instructions, replacing **${device}** with the name of your network device: .. code-block:: none [root@pcmk-1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-${device} # manually edit as desired [root@pcmk-1 ~]# nmcli dev disconnect ${device} [root@pcmk-1 ~]# nmcli con reload ${device} [root@pcmk-1 ~]# nmcli con up ${device} This makes **NetworkManager** aware that a change was made on the config file. Next, ensure that the routes are as expected: .. code-block:: none [root@pcmk-1 ~]# ip route default via 192.168.122.1 dev enp1s0 proto static metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.101 metric 100 If there is no line beginning with **default via**, then you may need to add a line such as ``GATEWAY="192.168.122.1"`` to the device configuration using the same process as described above for changing the IP address. Now, check for connectivity to the outside world. Start small by testing whether we can reach the gateway we configured. .. code-block:: none [root@pcmk-1 ~]# ping -c 1 192.168.122.1 PING 192.168.122.1 (192.168.122.1) 56(84) bytes of data. 64 bytes from 192.168.122.1: icmp_seq=1 ttl=64 time=0.492 ms --- 192.168.122.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.492/0.492/0.492/0.000 ms Now try something external; choose a location you know should be available. .. code-block:: none [root@pcmk-1 ~]# ping -c 1 www.clusterlabs.org PING mx1.clusterlabs.org (95.217.104.78) 56(84) bytes of data. 64 bytes from mx1.clusterlabs.org (95.217.104.78): icmp_seq=1 ttl=54 time=134 ms --- mx1.clusterlabs.org ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 133.987/133.987/133.987/0.000 ms Login Remotely ______________ The console isn't a very friendly place to work from, so we will now switch to accessing the machine remotely via SSH where we can use copy and paste, etc. From another host, check whether we can see the new host at all: .. code-block:: none [gchin@gchin ~]$ ping -c 1 192.168.122.101 PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. 64 bytes from 192.168.122.101: icmp_seq=1 ttl=64 time=0.344 ms --- 192.168.122.101 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.344/0.344/0.344/0.000 ms Next, login as root via SSH. .. code-block:: none [gchin@gchin ~]$ ssh root@192.168.122.101 The authenticity of host '192.168.122.101 (192.168.122.101)' can't be established. ECDSA key fingerprint is SHA256:NBvcRrPDLIt39Rf0Tz4/f2Rd/FA5wUiDOd9bZ9QWWjo. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '192.168.122.101' (ECDSA) to the list of known hosts. root@192.168.122.101's password: Last login: Tue Jan 10 20:46:30 2021 [root@pcmk-1 ~]# Apply Updates _____________ Apply any package updates released since your installation image was created: .. code-block:: none [root@pcmk-1 ~]# yum update .. index:: single: node; short name Use Short Node Names ____________________ During installation, we filled in the machine's fully qualified domain name (FQDN), which can be rather long when it appears in cluster logs and status output. See for yourself how the machine identifies itself: .. code-block:: none [root@pcmk-1 ~]# uname -n pcmk-1.localdomain We can use the `hostnamectl` tool to strip off the domain name: .. code-block:: none [root@pcmk-1 ~]# hostnamectl set-hostname $(uname -n | sed s/\\..*//) Now, check that the machine is using the correct name: .. code-block:: none [root@pcmk-1 ~]# uname -n pcmk-1 You may want to reboot to ensure all updates take effect. Repeat for Second Node ###################### Repeat the Installation steps so far, so that you have two nodes ready to have the cluster software installed. For the purposes of this document, the additional node is called pcmk-2 with address 192.168.122.102. Configure Communication Between Nodes ##################################### Configure Host Name Resolution ______________________________ Confirm that you can communicate between the two new nodes: .. code-block:: none [root@pcmk-1 ~]# ping -c 3 192.168.122.102 PING 192.168.122.102 (192.168.122.102) 56(84) bytes of data. 64 bytes from 192.168.122.102: icmp_seq=1 ttl=64 time=1.22 ms 64 bytes from 192.168.122.102: icmp_seq=2 ttl=64 time=0.795 ms 64 bytes from 192.168.122.102: icmp_seq=3 ttl=64 time=0.751 ms --- 192.168.122.102 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2054ms rtt min/avg/max/mdev = 0.751/0.923/1.224/0.214 ms Now we need to make sure we can communicate with the machines by their name. If you have a DNS server, add additional entries for the two machines. Otherwise, you'll need to add the machines to ``/etc/hosts`` on both nodes. Below are the entries for my cluster nodes: .. code-block:: none [root@pcmk-1 ~]# grep pcmk /etc/hosts 192.168.122.101 pcmk-1.clusterlabs.org pcmk-1 192.168.122.102 pcmk-2.clusterlabs.org pcmk-2 We can now verify the setup by again using ping: .. code-block:: none [root@pcmk-1 ~]# ping -c 3 pcmk-2 PING pcmk-2.clusterlabs.org (192.168.122.102) 56(84) bytes of data. 64 bytes from pcmk-2.clusterlabs.org (192.168.122.102): icmp_seq=1 ttl=64 time=0.295 ms 64 bytes from pcmk-2.clusterlabs.org (192.168.122.102): icmp_seq=2 ttl=64 time=0.616 ms 64 bytes from pcmk-2.clusterlabs.org (192.168.122.102): icmp_seq=3 ttl=64 time=0.809 ms --- pcmk-2.clusterlabs.org ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2043ms rtt min/avg/max/mdev = 0.295/0.573/0.809/0.212 ms .. index:: SSH Configure SSH _____________ SSH is a convenient and secure way to copy files and perform commands remotely. For the purposes of this guide, we will create a key without a password (using the -N option) so that we can perform remote actions without being prompted. .. WARNING:: Unprotected SSH keys (those without a password) are not recommended for servers exposed to the outside world. We use them here only to simplify the demo. Create a new key and allow anyone with that key to log in: .. index:: single: SSH; key .. topic:: Creating and Activating a new SSH Key .. code-block:: none [root@pcmk-1 ~]# ssh-keygen -t dsa -f ~/.ssh/id_dsa -N "" Generating public/private dsa key pair. Created directory '/root/.ssh'. Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: SHA256:ehR595AVLAVpvFgqYXiayds2qx8emkvnHmfQZMTZ4jM root@pcmk-1 The key's randomart image is: +---[DSA 1024]----+ | . ..+.=+. | | . +o+ Bo. | | . *oo+*+o | | = .*E..o | | oS..o . | | .o+. | | o.*oo | | . B.* | | === | +----[SHA256]-----+ [root@pcmk-1 ~]# cp ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys Install the key on the other node: .. code-block:: none [root@pcmk-1 ~]# scp -r ~/.ssh pcmk-2: The authenticity of host 'pcmk-2 (192.168.122.102)' can't be established. ECDSA key fingerprint is SHA256:FQ4sVubTiHdQ6IetbN96fixoTVx/LuQUV8qoyiywnfs. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'pcmk-2,192.168.122.102' (ECDSA) to the list of known hosts. root@pcmk-2's password: id_dsa 100% 1385 1.6MB/s 00:00 id_dsa.pub 100% 601 1.0MB/s 00:00 authorized_keys 100% 601 1.3MB/s 00:00 known_hosts 100% 184 389.2KB/s 00:00 Test that you can now run commands remotely, without being prompted: .. code-block:: none [root@pcmk-1 ~]# ssh pcmk-2 -- uname -n root@pcmk-2's password: pcmk-2 diff --git a/doc/sphinx/Pacemaker_Administration/tools.rst b/doc/sphinx/Pacemaker_Administration/tools.rst index 16353216b8..e85edee403 100644 --- a/doc/sphinx/Pacemaker_Administration/tools.rst +++ b/doc/sphinx/Pacemaker_Administration/tools.rst @@ -1,568 +1,562 @@ .. index:: command-line tool Using Pacemaker Command-Line Tools ---------------------------------- .. index:: single: command-line tool; output format .. _cmdline_output: Controlling Command Line Output ############################### Some of the pacemaker command line utilities have been converted to a new output system. Among these tools are ``crm_mon`` and ``stonith_admin``. This is an ongoing project, and more tools will be converted over time. This system lets you control the formatting of output with ``--output-as=`` and the destination of output with ``--output-to=``. The available formats vary by tool, but at least plain text and XML are supported by all tools that use the new system. The default format is plain text. The default destination is stdout but can be redirected to any file. Some formats support command line options for changing the style of the output. For instance: .. code-block:: none # crm_mon --help-output Usage: crm_mon [OPTION?] Provides a summary of cluster's current state. Outputs varying levels of detail in a number of different formats. Output Options: --output-as=FORMAT Specify output format as one of: console (default), html, text, xml --output-to=DEST Specify file name for output (or "-" for stdout) --html-cgi Add text needed to use output in a CGI program --html-stylesheet=URI Link to an external CSS stylesheet --html-title=TITLE Page title --text-fancy Use more highly formatted output .. index:: single: crm_mon single: command-line tool; crm_mon .. _crm_mon: Monitor a Cluster with crm_mon ############################## The ``crm_mon`` utility displays the current state of an active cluster. It can show the cluster status organized by node or by resource, and can be used in either single-shot or dynamically updating mode. It can also display operations performed and information about failures. Using this tool, you can examine the state of the cluster for irregularities, and see how it responds when you cause or simulate failures. See the manual page or the output of ``crm_mon --help`` for a full description of its many options. .. topic:: Sample output from crm_mon -1 .. code-block:: none Cluster Summary: * Stack: corosync * Current DC: node2 (version 2.0.0-1) - partition with quorum * Last updated: Mon Jan 29 12:18:42 2018 * Last change: Mon Jan 29 12:18:40 2018 by root via crm_attribute on node3 * 5 nodes configured * 2 resources configured Node List: * Online: [ node1 node2 node3 node4 node5 ] * Active resources: * Fencing (stonith:fence_xvm): Started node1 * IP (ocf:heartbeat:IPaddr2): Started node2 .. topic:: Sample output from crm_mon -n -1 .. code-block:: none Cluster Summary: * Stack: corosync * Current DC: node2 (version 2.0.0-1) - partition with quorum * Last updated: Mon Jan 29 12:21:48 2018 * Last change: Mon Jan 29 12:18:40 2018 by root via crm_attribute on node3 * 5 nodes configured * 2 resources configured * Node List: * Node node1: online * Fencing (stonith:fence_xvm): Started * Node node2: online * IP (ocf:heartbeat:IPaddr2): Started * Node node3: online * Node node4: online * Node node5: online As mentioned in an earlier chapter, the DC is the node is where decisions are made. The cluster elects a node to be DC as needed. The only significance of the choice of DC to an administrator is the fact that its logs will have the most information about why decisions were made. .. index:: pair: crm_mon; CSS .. _crm_mon_css: Styling crm_mon HTML output ___________________________ Various parts of ``crm_mon``'s HTML output have a CSS class associated with them. Not everything does, but some of the most interesting portions do. In the following example, the status of each node has an ``online`` class and the details of each resource have an ``rsc-ok`` class. .. code-block:: html

Node List

  • Node: cluster01 online
    • ping (ocf::pacemaker:ping): Started
  • Node: cluster02 online
    • ping (ocf::pacemaker:ping): Started
By default, a stylesheet for styling these classes is included in the head of the HTML output. The relevant portions of this stylesheet that would be used in the above example is: .. code-block:: css If you want to override some or all of the styling, simply create your own stylesheet, place it on a web server, and pass ``--html-stylesheet=`` to ``crm_mon``. The link is added after the default stylesheet, so your changes take precedence. You don't need to duplicate the entire default. Only include what you want to change. .. index:: single: cibadmin single: command-line tool; cibadmin .. _cibadmin: Edit the CIB XML with cibadmin ############################## The most flexible tool for modifying the configuration is Pacemaker's ``cibadmin`` command. With ``cibadmin``, you can query, add, remove, update or replace any part of the configuration. All changes take effect immediately, so there is no need to perform a reload-like operation. The simplest way of using ``cibadmin`` is to use it to save the current configuration to a temporary file, edit that file with your favorite text or XML editor, and then upload the revised configuration. .. topic:: Safely using an editor to modify the cluster configuration .. code-block:: none # cibadmin --query > tmp.xml # vi tmp.xml # cibadmin --replace --xml-file tmp.xml Some of the better XML editors can make use of a RELAX NG schema to help make sure any changes you make are valid. The schema describing the configuration can be found in ``pacemaker.rng``, which may be deployed in a location such as ``/usr/share/pacemaker`` depending on your operating system distribution and how you installed the software. If you want to modify just one section of the configuration, you can query and replace just that section to avoid modifying any others. .. topic:: Safely using an editor to modify only the resources section .. code-block:: none # cibadmin --query --scope resources > tmp.xml # vi tmp.xml # cibadmin --replace --scope resources --xml-file tmp.xml To quickly delete a part of the configuration, identify the object you wish to delete by XML tag and id. For example, you might search the CIB for all STONITH-related configuration: .. topic:: Searching for STONITH-related configuration items .. code-block:: none # cibadmin --query | grep stonith If you wanted to delete the ``primitive`` tag with id ``child_DoFencing``, you would run: .. code-block:: none # cibadmin --delete --xml-text '' See the cibadmin man page for more options. .. warning:: Never edit the live ``cib.xml`` file directly. Pacemaker will detect such changes and refuse to use the configuration. .. index:: single: crm_shadow single: command-line tool; crm_shadow .. _crm_shadow: Batch Configuration Changes with crm_shadow ########################################### Often, it is desirable to preview the effects of a series of configuration changes before updating the live configuration all at once. For this purpose, ``crm_shadow`` creates a "shadow" copy of the configuration and arranges for all the command-line tools to use it. To begin, simply invoke ``crm_shadow --create`` with a name of your choice, and follow the simple on-screen instructions. Shadow copies are identified with a name to make it possible to have more than one. .. warning:: Read this section and the on-screen instructions carefully; failure to do so could result in destroying the cluster's active configuration! .. topic:: Creating and displaying the active sandbox .. code-block:: none # crm_shadow --create test Setting up shadow instance Type Ctrl-D to exit the crm_shadow shell shadow[test]: shadow[test] # crm_shadow --which test From this point on, all cluster commands will automatically use the shadow copy instead of talking to the cluster's active configuration. Once you have finished experimenting, you can either make the changes active via the ``--commit`` option, or discard them using the ``--delete`` option. Again, be sure to follow the on-screen instructions carefully! For a full list of ``crm_shadow`` options and commands, invoke it with the ``--help`` option. .. topic:: Use sandbox to make multiple changes all at once, discard them, and verify real configuration is untouched .. code-block:: none shadow[test] # crm_failcount -r rsc_c001n01 -G scope=status name=fail-count-rsc_c001n01 value=0 shadow[test] # crm_standby --node c001n02 -v on shadow[test] # crm_standby --node c001n02 -G scope=nodes name=standby value=on shadow[test] # cibadmin --erase --force shadow[test] # cibadmin --query shadow[test] # crm_shadow --delete test --force Now type Ctrl-D to exit the crm_shadow shell shadow[test] # exit # crm_shadow --which No active shadow configuration defined # cibadmin -Q See the next section, :ref:`crm_simulate`, for how to test your changes before committing them to the live cluster. .. index:: single: crm_simulate single: command-line tool; crm_simulate .. _crm_simulate: Simulate Cluster Activity with crm_simulate ########################################### The command-line tool `crm_simulate` shows the results of the same logic the cluster itself uses to respond to a particular cluster configuration and status. As always, the man page is the primary documentation, and should be consulted for further details. This section aims for a better conceptual explanation and practical examples. Replaying cluster decision-making logic _______________________________________ At any given time, one node in a Pacemaker cluster will be elected DC, and that node will run Pacemaker's scheduler to make decisions. Each time decisions need to be made (a "transition"), the DC will have log messages like "Calculated transition ... saving inputs in ..." with a file name. You can grab the named file and replay the cluster logic to see why particular decisions were made. The file contains the live cluster configuration at that moment, so you can also look at it directly to see the value of node attributes, etc., at that time. The simplest usage is (replacing $FILENAME with the actual file name): .. topic:: Simulate cluster response to a given CIB .. code-block:: none # crm_simulate --simulate --xml-file $FILENAME That will show the cluster state when the process started, the actions that need to be taken ("Transition Summary"), and the resulting cluster state if the actions succeed. Most actions will have a brief description of why they were required. The transition inputs may be compressed. ``crm_simulate`` can handle these compressed files directly, though if you want to edit the file, you'll need to uncompress it first. You can do the same simulation for the live cluster configuration at the current moment. This is useful mainly when using ``crm_shadow`` to create a sandbox version of the CIB; the ``--live-check`` option will use the shadow CIB if one is in effect. .. topic:: Simulate cluster response to current live CIB or shadow CIB .. code-block:: none # crm_simulate --simulate --live-check Why decisions were made _______________________ To get further insight into the "why", it gets user-unfriendly very quickly. If you add the ``--show-scores`` option, you will also see all the scores that went into the decision-making. The node with the highest cumulative score for a resource will run it. You can look for ``-INFINITY`` scores in particular to see where complete bans came into effect. You can also add ``-VVVV`` to get more detailed messages about what's happening under the hood. You can add up to two more V's even, but that's usually useful only if you're a masochist or tracing through the source code. Visualizing the action sequence _______________________________ Another handy feature is the ability to generate a visual graph of the actions needed, using the ``--dot-file`` option. This relies on the separate Graphviz [#]_ project. .. topic:: Generate a visual graph of cluster actions from a saved CIB .. code-block:: none # crm_simulate --simulate --xml-file $FILENAME --dot-file $FILENAME.dot # dot $FILENAME.dot -Tsvg > $FILENAME.svg ``$FILENAME.dot`` will contain a GraphViz representation of the cluster's response to your changes, including all actions with their ordering dependencies. ``$FILENAME.svg`` will be the same information in a standard graphical format that you can view in your browser or other app of choice. You could, of course, use other ``dot`` options to generate other formats. How to interpret the graphical output: * Bubbles indicate actions, and arrows indicate ordering dependencies * Resource actions have text of the form ``__ `` indicating that the specified action will be executed for the specified resource on the specified node, once if interval is 0 or at specified recurring interval otherwise * Actions with black text will be sent to the executor (that is, the appropriate agent will be invoked) * Actions with orange text are "pseudo" actions that the cluster uses internally for ordering but require no real activity * Actions with a solid green border are part of the transition (that is, the cluster will attempt to execute them in the given order -- though a transition can be interrupted by action failure or new events) * Dashed arrows indicate dependencies that are not present in the transition graph * Actions with a dashed border will not be executed. If the dashed border is blue, the cluster does not feel the action needs to be executed. If the dashed border is red, the cluster would like to execute the action but cannot. Any actions depending on an action with a dashed border will not be able to execute. * Loops should not happen, and should be reported as a bug if found. .. topic:: Small Cluster Transition .. image:: ../shared/images/Policy-Engine-small.png :alt: An example transition graph as represented by Graphviz - :height: 325 - :width: 1161 - :scale: 75 % :align: center In the above example, it appears that a new node, ``pcmk-2``, has come online and that the cluster is checking to make sure ``rsc1``, ``rsc2`` and ``rsc3`` are not already running there (indicated by the ``rscN_monitor_0`` entries). Once it did that, and assuming the resources were not active there, it would have liked to stop ``rsc1`` and ``rsc2`` on ``pcmk-1`` and move them to ``pcmk-2``. However, there appears to be some problem and the cluster cannot or is not permitted to perform the stop actions which implies it also cannot perform the start actions. For some reason, the cluster does not want to start ``rsc3`` anywhere. .. topic:: Complex Cluster Transition .. image:: ../shared/images/Policy-Engine-big.png :alt: Complex transition graph that you're not expected to be able to read - :width: 1455 - :height: 1945 - :scale: 75 % :align: center What-if scenarios _________________ You can make changes to the saved or shadow CIB and simulate it again, to see how Pacemaker would react differently. You can edit the XML by hand, use command-line tools such as ``cibadmin`` with either a shadow CIB or the ``CIB_file`` environment variable set to the filename, or use higher-level tool support (see the man pages of the specific tool you're using for how to perform actions on a saved CIB file rather than the live CIB). You can also inject node failures and/or action failures into the simulation; see the ``crm_simulate`` man page for more details. This capability is useful when using a shadow CIB to edit the configuration. Before committing the changes to the live cluster with ``crm_shadow --commit``, you can use ``crm_simulate`` to see how the cluster will react to the changes. .. _crm_attribute: .. index:: single: attrd_updater single: command-line tool; attrd_updater single: crm_attribute single: command-line tool; crm_attribute Manage Node Attributes, Cluster Options and Defaults with crm_attribute and attrd_updater ######################################################################################### ``crm_attribute`` and ``attrd_updater`` are confusingly similar tools with subtle differences. ``attrd_updater`` can query and update node attributes. ``crm_attribute`` can query and update not only node attributes, but also cluster options, resource defaults, and operation defaults. To understand the differences, it helps to understand the various types of node attribute. .. table:: **Types of Node Attributes** +-----------+----------+-------------------+------------------+----------------+----------------+ | Type | Recorded | Recorded in | Survive full | Manageable by | Manageable by | | | in CIB? | attribute manager | cluster restart? | crm_attribute? | attrd_updater? | | | | memory? | | | | +===========+==========+===================+==================+================+================+ | permanent | yes | no | yes | yes | no | +-----------+----------+-------------------+------------------+----------------+----------------+ | transient | yes | yes | no | yes | yes | +-----------+----------+-------------------+------------------+----------------+----------------+ | private | no | yes | no | no | yes | +-----------+----------+-------------------+------------------+----------------+----------------+ As you can see from the table above, ``crm_attribute`` can manage permanent and transient node attributes, while ``attrd_updater`` can manage transient and private node attributes. The difference between the two tools lies mainly in *how* they update node attributes: ``attrd_updater`` always contacts the Pacemaker attribute manager directly, while ``crm_attribute`` will contact the attribute manager only for transient node attributes, and will instead modify the CIB directly for permanent node attributes (and for transient node attributes when unable to contact the attribute manager). By contacting the attribute manager directly, ``attrd_updater`` can change an attribute's "dampening" (whether changes are immediately flushed to the CIB or after a specified amount of time, to minimize disk writes for frequent changes), set private node attributes (which are never written to the CIB), and set attributes for nodes that don't yet exist. By modifying the CIB directly, ``crm_attribute`` can set permanent node attributes (which are only in the CIB and not managed by the attribute manager), and can be used with saved CIB files and shadow CIBs. However a transient node attribute is set, it is synchronized between the CIB and the attribute manager, on all nodes. .. index:: single: crm_failcount single: command-line tool; crm_failcount single: crm_node single: command-line tool; crm_node single: crm_report single: command-line tool; crm_report single: crm_standby single: command-line tool; crm_standby single: crm_verify single: command-line tool; crm_verify single: stonith_admin single: command-line tool; stonith_admin Other Commonly Used Tools ######################### Other command-line tools include: * ``crm_failcount``: query or delete resource fail counts * ``crm_node``: manage cluster nodes * ``crm_report``: generate a detailed cluster report for bug submissions * ``crm_resource``: manage cluster resources * ``crm_standby``: manage standby status of nodes * ``crm_verify``: validate a CIB * ``stonith_admin``: manage fencing devices See the manual pages for details. .. rubric:: Footnotes .. [#] Graph visualization software. See http://www.graphviz.org/ for details. diff --git a/doc/sphinx/Pacemaker_Remote/intro.rst b/doc/sphinx/Pacemaker_Remote/intro.rst index 361d4fb82d..9c5dab81a0 100644 --- a/doc/sphinx/Pacemaker_Remote/intro.rst +++ b/doc/sphinx/Pacemaker_Remote/intro.rst @@ -1,190 +1,186 @@ Scaling a Pacemaker Cluster --------------------------- Overview ######## In a basic Pacemaker high-availability cluster [#]_ each node runs the full cluster stack of Corosync and all Pacemaker components. This allows great flexibility but limits scalability to around 16 nodes. To allow for scalability to dozens or even hundreds of nodes, Pacemaker allows nodes not running the full cluster stack to integrate into the cluster and have the cluster manage their resources as if they were a cluster node. Terms ##### .. index:: single: cluster node single: node; cluster node **cluster node** A node running the full high-availability stack of corosync and all Pacemaker components. Cluster nodes may run cluster resources, run all Pacemaker command-line tools (``crm_mon``, ``crm_resource`` and so on), execute fencing actions, count toward cluster quorum, and serve as the cluster's Designated Controller (DC). .. index:: pacemaker_remoted **pacemaker_remoted** A small service daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. Nodes running ``pacemaker_remoted`` may run cluster resources and most command-line tools, but cannot perform other functions of full cluster nodes such as fencing execution, quorum voting, or DC eligibility. The ``pacemaker_remoted`` daemon is an enhanced version of Pacemaker's local resource management daemon (LRMD). .. index:: single: remote node single: node; remote node **pacemaker_remote** The name of the systemd service that manages ``pacemaker_remoted`` **Pacemaker Remote** A way to refer to the general technology implementing nodes running ``pacemaker_remoted``, including the cluster-side implementation and the communication protocol between them. **remote node** A physical host running ``pacemaker_remoted``. Remote nodes have a special resource that manages communication with the cluster. This is sometimes referred to as the *bare metal* case. .. index:: single: guest node single: node; guest node **guest node** A virtual host running ``pacemaker_remoted``. Guest nodes differ from remote nodes mainly in that the guest node is itself a resource that the cluster manages. .. NOTE:: *Remote* in this document refers to the node not being a part of the underlying corosync cluster. It has nothing to do with physical proximity. Remote nodes and guest nodes are subject to the same latency requirements as cluster nodes, which means they are typically in the same data center. .. NOTE:: It is important to distinguish the various roles a virtual machine can serve in Pacemaker clusters: * A virtual machine can run the full cluster stack, in which case it is a cluster node and is not itself managed by the cluster. * A virtual machine can be managed by the cluster as a resource, without the cluster having any awareness of the services running inside the virtual machine. The virtual machine is *opaque* to the cluster. * A virtual machine can be a cluster resource, and run ``pacemaker_remoted`` to make it a guest node, allowing the cluster to manage services inside it. The virtual machine is *transparent* to the cluster. .. index:: single: virtual machine; as guest node Guest Nodes ########### **"I want a Pacemaker cluster to manage virtual machine resources, but I also want Pacemaker to be able to manage the resources that live within those virtual machines."** Without ``pacemaker_remoted``, the possibilities for implementing the above use case have significant limitations: * The cluster stack could be run on the physical hosts only, which loses the ability to monitor resources within the guests. * A separate cluster could be on the virtual guests, which quickly hits scalability issues. * The cluster stack could be run on the guests using the same cluster as the physical hosts, which also hits scalability issues and complicates fencing. With ``pacemaker_remoted``: * The physical hosts are cluster nodes (running the full cluster stack). * The virtual machines are guest nodes (running ``pacemaker_remoted``). Nearly zero configuration is required on the virtual machine. * The cluster stack on the cluster nodes launches the virtual machines and immediately connects to ``pacemaker_remoted`` on them, allowing the virtual machines to integrate into the cluster. The key difference here between the guest nodes and the cluster nodes is that the guest nodes do not run the cluster stack. This means they will never become the DC, initiate fencing actions or participate in quorum voting. On the other hand, this also means that they are not bound to the scalability limits associated with the cluster stack (no 16-node corosync member limits to deal with). That isn't to say that guest nodes can scale indefinitely, but it is known that guest nodes scale horizontally much further than cluster nodes. Other than the quorum limitation, these guest nodes behave just like cluster nodes with respect to resource management. The cluster is fully capable of managing and monitoring resources on each guest node. You can build constraints against guest nodes, put them in standby, or do whatever else you'd expect to be able to do with cluster nodes. They even show up in ``crm_mon`` output as nodes. To solidify the concept, below is an example that is very similar to an actual deployment we test in our developer environment to verify guest node scalability: * 16 cluster nodes running the full Corosync + Pacemaker stack * 64 Pacemaker-managed virtual machine resources running ``pacemaker_remoted`` configured as guest nodes * 64 Pacemaker-managed webserver and database resources configured to run on the 64 guest nodes With this deployment, you would have 64 webservers and databases running on 64 virtual machines on 16 hardware nodes, all of which are managed and monitored by the same Pacemaker deployment. It is known that ``pacemaker_remoted`` can scale to these lengths and possibly much further depending on the specific scenario. Remote Nodes ############ **"I want my traditional high-availability cluster to scale beyond the limits imposed by the corosync messaging layer."** Ultimately, the primary advantage of remote nodes over cluster nodes is scalability. There are likely some other use cases related to geographically distributed HA clusters that remote nodes may serve a purpose in, but those use cases are not well understood at this point. Like guest nodes, remote nodes will never become the DC, initiate fencing actions or participate in quorum voting. That is not to say, however, that fencing of a remote node works any differently than that of a cluster node. The Pacemaker scheduler understands how to fence remote nodes. As long as a fencing device exists, the cluster is capable of ensuring remote nodes are fenced in the exact same way as cluster nodes. Expanding the Cluster Stack ########################### With ``pacemaker_remoted``, the traditional view of the high-availability stack can be expanded to include a new layer: Traditional HA Stack ____________________ .. image:: images/pcmk-ha-cluster-stack.png - :width: 17cm - :height: 9cm :alt: Traditional Pacemaker+Corosync Stack :align: center HA Stack With Guest Nodes _________________________ .. image:: images/pcmk-ha-remote-stack.png - :width: 20cm - :height: 10cm :alt: Pacemaker+Corosync Stack with pacemaker_remoted :align: center .. [#] See the ``_ Pacemaker documentation, especially *Clusters From Scratch* and *Pacemaker Explained*. diff --git a/doc/sphinx/shared/pacemaker-intro.rst b/doc/sphinx/shared/pacemaker-intro.rst index c7aaeab86d..3473636843 100644 --- a/doc/sphinx/shared/pacemaker-intro.rst +++ b/doc/sphinx/shared/pacemaker-intro.rst @@ -1,201 +1,196 @@ What Is Pacemaker? #################### Pacemaker is a high-availability *cluster resource manager* -- software that runs on a set of hosts (a *cluster* of *nodes*) in order to preserve integrity and minimize downtime of desired services (*resources*). [#]_ It is maintained by the `ClusterLabs `_ community. Pacemaker's key features include: * Detection of and recovery from node- and service-level failures * Ability to ensure data integrity by fencing faulty nodes * Support for one or more nodes per cluster * Support for multiple resource interface standards (anything that can be scripted can be clustered) * Support (but no requirement) for shared storage * Support for practically any redundancy configuration (active/passive, N+1, etc.) * Automatically replicated configuration that can be updated from any node * Ability to specify cluster-wide relationships between services, such as ordering, colocation and anti-colocation * Support for advanced service types, such as *clones* (services that need to be active on multiple nodes), *promotable clones* (clones that can run in one of two roles), and containerized services * Unified, scriptable cluster management tools .. note:: **Fencing** *Fencing*, also known as *STONITH* (an acronym for Shoot The Other Node In The Head), is the ability to ensure that it is not possible for a node to be running a service. This is accomplished via *fence devices* such as intelligent power switches that cut power to the target, or intelligent network switches that cut the target's access to the local network. Pacemaker represents fence devices as a special class of resource. A cluster cannot safely recover from certain failure conditions, such as an unresponsive node, without fencing. Cluster Architecture ____________________ At a high level, a cluster can be viewed as having these parts (which together are often referred to as the *cluster stack*): * **Resources:** These are the reason for the cluster's being -- the services that need to be kept highly available. * **Resource agents:** These are scripts or operating system components that start, stop, and monitor resources, given a set of resource parameters. These provide a uniform interface between Pacemaker and the managed services. * **Fence agents:** These are scripts that execute node fencing actions, given a target and fence device parameters. * **Cluster membership layer:** This component provides reliable messaging, membership, and quorum information about the cluster. Currently, Pacemaker supports `Corosync `_ as this layer. * **Cluster resource manager:** Pacemaker provides the brain that processes and reacts to events that occur in the cluster. These events may include nodes joining or leaving the cluster; resource events caused by failures, maintenance, or scheduled activities; and other administrative actions. To achieve the desired availability, Pacemaker may start and stop resources and fence nodes. * **Cluster tools:** These provide an interface for users to interact with the cluster. Various command-line and graphical (GUI) interfaces are available. Most managed services are not, themselves, cluster-aware. However, many popular open-source cluster filesystems make use of a common *Distributed Lock Manager* (DLM), which makes direct use of Corosync for its messaging and membership capabilities and Pacemaker for the ability to fence nodes. .. image:: ../shared/images/pcmk-stack.png :alt: Example cluster stack - :scale: 75 % :align: center Pacemaker Architecture ______________________ Pacemaker itself is composed of multiple daemons that work together: * pacemakerd * pacemaker-attrd * pacemaker-based * pacemaker-controld * pacemaker-execd * pacemaker-fenced * pacemaker-schedulerd .. image:: ../shared/images/pcmk-internals.png :alt: Pacemaker software components - :scale: 65 % :align: center The Pacemaker master process (pacemakerd) spawns all the other daemons, and respawns them if they unexpectedly exit. The *Cluster Information Base* (CIB) is an `XML `_ representation of the cluster's configuration and the state of all nodes and resources. The *CIB manager* (pacemaker-based) keeps the CIB synchronized across the cluster, and handles requests to modify it. The *attribute manager* (pacemaker-attrd) maintains a database of attributes for all nodes, keeps it synchronized across the cluster, and handles requests to modify them. These attributes are usually recorded in the CIB. Given a snapshot of the CIB as input, the *scheduler* (pacemaker-schedulerd) determines what actions are necessary to achieve the desired state of the cluster. The *local executor* (pacemaker-execd) handles requests to execute resource agents on the local cluster node, and returns the result. The *fencer* (pacemaker-fenced) handles requests to fence nodes. Given a target node, the fencer decides which cluster node(s) should execute which fencing device(s), and calls the necessary fencing agents (either directly, or via requests to the fencer peers on other nodes), and returns the result. The *controller* (pacemaker-controld) is Pacemaker's coordinator, maintaining a consistent view of the cluster membership and orchestrating all the other components. Pacemaker centralizes cluster decision-making by electing one of the controller instances as the 'Designated Controller' ('DC'). Should the elected DC process (or the node it is on) fail, a new one is quickly established. The DC responds to cluster events by taking a current snapshot of the CIB, feeding it to the scheduler, then asking the executors (either directly on the local node, or via requests to controller peers on other nodes) and the fencer to execute any necessary actions. .. note:: **Old daemon names** The Pacemaker daemons were renamed in version 2.0. You may still find references to the old names, especially in documentation targeted to version 1.1. .. table:: +-------------------+---------------------+ | Old name | New name | +===================+=====================+ | attrd | pacemaker-attrd | +-------------------+---------------------+ | cib | pacemaker-based | +-------------------+---------------------+ | crmd | pacemaker-controld | +-------------------+---------------------+ | lrmd | pacemaker-execd | +-------------------+---------------------+ | stonithd | pacemaker-fenced | +-------------------+---------------------+ | pacemaker_remoted | pacemaker-remoted | +-------------------+---------------------+ Node Redundancy Designs _______________________ Pacemaker supports practically any `node redundancy configuration `_ including *Active/Active*, *Active/Passive*, *N+1*, *N+M*, *N-to-1* and *N-to-N*. Active/passive clusters with two (or more) nodes using Pacemaker and `DRBD `_ are a cost-effective high-availability solution for many situations. One of the nodes provides the desired services, and if it fails, the other node takes over. .. image:: ../shared/images/pcmk-active-passive.png :alt: Active/Passive Redundancy :align: center - :scale: 75 % Pacemaker also supports multiple nodes in a shared-failover design, reducing hardware costs by allowing several active/passive clusters to be combined and share a common backup node. .. image:: ../shared/images/pcmk-shared-failover.png :alt: Shared Failover :align: center - :scale: 75 % When shared storage is available, every node can potentially be used for failover. Pacemaker can even run multiple copies of services to spread out the workload. This is sometimes called N to N Redundancy. .. image:: ../shared/images/pcmk-active-active.png :alt: N to N Redundancy :align: center - :scale: 75 % .. rubric:: Footnotes .. [#] *Cluster* is sometimes used in other contexts to refer to hosts grouped together for other purposes, such as high-performance computing (HPC), but Pacemaker is not intended for those purposes.