diff --git a/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst index 133e79096f..60421644ea 100644 --- a/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst +++ b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst @@ -1,345 +1,342 @@ Multi-Site Clusters and Tickets ------------------------------- -.. Convert_to_RST: - - Apart from local clusters, Pacemaker also supports multi-site clusters. - That means you can have multiple, geographically dispersed sites, each with a - local cluster. Failover between these clusters can be coordinated - manually by the administrator, or automatically by a higher-level entity called - a 'Cluster Ticket Registry (CTR)'. - - == Challenges for Multi-Site Clusters == - - Typically, multi-site environments are too far apart to support - synchronous communication and data replication between the sites. - That leads to significant challenges: - - - How do we make sure that a cluster site is up and running? - - - How do we make sure that resources are only started once? - - - How do we make sure that quorum can be reached between the different - sites and a split-brain scenario avoided? - - - How do we manage failover between sites? - - - How do we deal with high latency in case of resources that need to be - stopped? - - In the following sections, learn how to meet these challenges. - - == Conceptual Overview == - - Multi-site clusters can be considered as “overlay” clusters where - each cluster site corresponds to a cluster node in a traditional cluster. - The overlay cluster can be managed by a CTR in order to - guarantee that any cluster resource will be active - on no more than one cluster site. This is achieved by using - 'tickets' that are treated as failover domain between cluster - sites, in case a site should be down. - - The following sections explain the individual components and mechanisms - that were introduced for multi-site clusters in more detail. - - === Ticket === - - Tickets are, essentially, cluster-wide attributes. A ticket grants the - right to run certain resources on a specific cluster site. Resources can - be bound to a certain ticket by +rsc_ticket+ constraints. Only if the - ticket is available at a site can the respective resources be started there. - Vice versa, if the ticket is revoked, the resources depending on that - ticket must be stopped. - - The ticket thus is similar to a 'site quorum', i.e. the permission to - manage/own resources associated with that site. (One can also think of the - current +have-quorum+ flag as a special, cluster-wide ticket that is granted in - case of node majority.) - - Tickets can be granted and revoked either manually by administrators - (which could be the default for classic enterprise clusters), or via - the automated CTR mechanism described below. - - A ticket can only be owned by one site at a time. Initially, none - of the sites has a ticket. Each ticket must be granted once by the cluster - administrator. - - The presence or absence of tickets for a site is stored in the CIB as a - cluster status. With regards to a certain ticket, there are only two states - for a site: +true+ (the site has the ticket) or +false+ (the site does - not have the ticket). The absence of a certain ticket (during the initial - state of the multi-site cluster) is the same as the value +false+. - - === Dead Man Dependency === - - A site can only activate resources safely if it can be sure that the - other site has deactivated them. However after a ticket is revoked, it can - take a long time until all resources depending on that ticket are stopped - "cleanly", especially in case of cascaded resources. To cut that process - short, the concept of a 'Dead Man Dependency' was introduced. - - If a dead man dependency is in force, if a ticket is revoked from a site, the - nodes that are hosting dependent resources are fenced. This considerably speeds - up the recovery process of the cluster and makes sure that resources can be - migrated more quickly. - - This can be configured by specifying a +loss-policy="fence"+ in - +rsc_ticket+ constraints. - - === Cluster Ticket Registry === - - A CTR is a coordinated group of network daemons that automatically handles - granting, revoking, and timing out tickets (instead of the administrator - revoking the ticket somewhere, waiting for everything to stop, and then - granting it on the desired site). - - Pacemaker does not implement its own CTR, but interoperates with external - software designed for that purpose (similar to how resource and fencing agents - are not directly part of pacemaker). - - Participating clusters run the CTR daemons, which connect to each other, exchange - information about their connectivity, and vote on which sites gets which - tickets. - - A ticket is granted to a site only once the CTR is sure that the ticket - has been relinquished by the previous owner, implemented via a timer in most - scenarios. If a site loses connection to its peers, its tickets time out and - recovery occurs. After the connection timeout plus the recovery timeout has - passed, the other sites are allowed to re-acquire the ticket and start the - resources again. - - This can also be thought of as a "quorum server", except that it is not - a single quorum ticket, but several. - - === Configuration Replication === - - As usual, the CIB is synchronized within each cluster, but it is 'not' synchronized - across cluster sites of a multi-site cluster. You have to configure the resources - that will be highly available across the multi-site cluster for every site - accordingly. +Apart from local clusters, Pacemaker also supports multi-site clusters. +That means you can have multiple, geographically dispersed sites, each with a +local cluster. Failover between these clusters can be coordinated +manually by the administrator, or automatically by a higher-level entity called +a *Cluster Ticket Registry (CTR)*. + +Challenges for Multi-Site Clusters +################################## + +Typically, multi-site environments are too far apart to support +synchronous communication and data replication between the sites. +That leads to significant challenges: + +- How do we make sure that a cluster site is up and running? + +- How do we make sure that resources are only started once? + +- How do we make sure that quorum can be reached between the different + sites and a split-brain scenario avoided? + +- How do we manage failover between sites? + +- How do we deal with high latency in case of resources that need to be + stopped? + +In the following sections, learn how to meet these challenges. + +Conceptual Overview +################### + +Multi-site clusters can be considered as “overlay” clusters where +each cluster site corresponds to a cluster node in a traditional cluster. +The overlay cluster can be managed by a CTR in order to +guarantee that any cluster resource will be active +on no more than one cluster site. This is achieved by using +*tickets* that are treated as failover domain between cluster +sites, in case a site should be down. + +The following sections explain the individual components and mechanisms +that were introduced for multi-site clusters in more detail. + +Ticket +______ + +Tickets are, essentially, cluster-wide attributes. A ticket grants the +right to run certain resources on a specific cluster site. Resources can +be bound to a certain ticket by ``rsc_ticket`` constraints. Only if the +ticket is available at a site can the respective resources be started there. +Vice versa, if the ticket is revoked, the resources depending on that +ticket must be stopped. + +The ticket thus is similar to a *site quorum*, i.e. the permission to +manage/own resources associated with that site. (One can also think of the +current ``have-quorum`` flag as a special, cluster-wide ticket that is +granted in case of node majority.) + +Tickets can be granted and revoked either manually by administrators +(which could be the default for classic enterprise clusters), or via +the automated CTR mechanism described below. + +A ticket can only be owned by one site at a time. Initially, none +of the sites has a ticket. Each ticket must be granted once by the cluster +administrator. + +The presence or absence of tickets for a site is stored in the CIB as a +cluster status. With regards to a certain ticket, there are only two states +for a site: ``true`` (the site has the ticket) or ``false`` (the site does +not have the ticket). The absence of a certain ticket (during the initial +state of the multi-site cluster) is the same as the value ``false``. + +Dead Man Dependency +___________________ + +A site can only activate resources safely if it can be sure that the +other site has deactivated them. However after a ticket is revoked, it can +take a long time until all resources depending on that ticket are stopped +"cleanly", especially in case of cascaded resources. To cut that process +short, the concept of a *Dead Man Dependency* was introduced. + +If a dead man dependency is in force, if a ticket is revoked from a site, the +nodes that are hosting dependent resources are fenced. This considerably speeds +up the recovery process of the cluster and makes sure that resources can be +migrated more quickly. + +This can be configured by specifying a ``loss-policy="fence"`` in +``rsc_ticket`` constraints. + +Cluster Ticket Registry +_______________________ + +A CTR is a coordinated group of network daemons that automatically handles +granting, revoking, and timing out tickets (instead of the administrator +revoking the ticket somewhere, waiting for everything to stop, and then +granting it on the desired site). + +Pacemaker does not implement its own CTR, but interoperates with external +software designed for that purpose (similar to how resource and fencing agents +are not directly part of pacemaker). + +Participating clusters run the CTR daemons, which connect to each other, exchange +information about their connectivity, and vote on which sites gets which +tickets. + +A ticket is granted to a site only once the CTR is sure that the ticket +has been relinquished by the previous owner, implemented via a timer in most +scenarios. If a site loses connection to its peers, its tickets time out and +recovery occurs. After the connection timeout plus the recovery timeout has +passed, the other sites are allowed to re-acquire the ticket and start the +resources again. + +This can also be thought of as a "quorum server", except that it is not +a single quorum ticket, but several. + +Configuration Replication +_________________________ + +As usual, the CIB is synchronized within each cluster, but it is *not* synchronized +across cluster sites of a multi-site cluster. You have to configure the resources +that will be highly available across the multi-site cluster for every site +accordingly. .. _ticket-constraints: Configuring Ticket Dependencies ############################### -.. Convert_to_RST_2: - - The `rsc_ticket` constraint lets you specify the resources depending on a certain - ticket. Together with the constraint, you can set a `loss-policy` that defines - what should happen to the respective resources if the ticket is revoked. - - The attribute `loss-policy` can have the following values: - - * +fence:+ Fence the nodes that are running the relevant resources. - - * +stop:+ Stop the relevant resources. - - * +freeze:+ Do nothing to the relevant resources. - - * +demote:+ Demote relevant resources that are running in master mode to slave mode. - - - .Constraint that fences node if +ticketA+ is revoked - ==== - [source,XML] - ------- - - ------- - ==== - - The example above creates a constraint with the ID +rsc1-req-ticketA+. It - defines that the resource +rsc1+ depends on +ticketA+ and that the node running - the resource should be fenced if +ticketA+ is revoked. - - If resource +rsc1+ were a promotable resource (i.e. it could run in master or - slave mode), you might want to configure that only master mode - depends on +ticketA+. With the following configuration, +rsc1+ will be - demoted to slave mode if +ticketA+ is revoked: - - .Constraint that demotes +rsc1+ if +ticketA+ is revoked - ==== - [source,XML] - ------- - - ------- - ==== - - You can create multiple `rsc_ticket` constraints to let multiple resources - depend on the same ticket. However, `rsc_ticket` also supports resource sets - (see <>), - so one can easily list all the resources in one `rsc_ticket` constraint instead. - - .Ticket constraint for multiple resources - ==== - [source,XML] - ------- - - - - - - - - - - - ------- - ==== - - In the example above, there are two resource sets, so we can list resources - with different roles in a single +rsc_ticket+ constraint. There's no dependency - between the two resource sets, and there's no dependency among the - resources within a resource set. Each of the resources just depends on - +ticketA+. - - Referencing resource templates in +rsc_ticket+ constraints, and even - referencing them within resource sets, is also supported. - - If you want other resources to depend on further tickets, create as many - constraints as necessary with +rsc_ticket+. - - - == Managing Multi-Site Clusters == - - === Granting and Revoking Tickets Manually === - - You can grant tickets to sites or revoke them from sites manually. - If you want to re-distribute a ticket, you should wait for - the dependent resources to stop cleanly at the previous site before you - grant the ticket to the new site. - - Use the `crm_ticket` command line tool to grant and revoke tickets. - - //// - These commands will actually just print a message telling the user that they - require '--force'. That is probably a good exercise rather than letting novice - users cut and paste '--force' here. - //// - - To grant a ticket to this site: - ------- - # crm_ticket --ticket ticketA --grant - ------- - - To revoke a ticket from this site: - ------- - # crm_ticket --ticket ticketA --revoke - ------- - - [IMPORTANT] - ==== - If you are managing tickets manually, use the `crm_ticket` command with +The **rsc_ticket** constraint lets you specify the resources depending on a certain +ticket. Together with the constraint, you can set a **loss-policy** that defines +what should happen to the respective resources if the ticket is revoked. + +The attribute **loss-policy** can have the following values: + +* ``fence:`` Fence the nodes that are running the relevant resources. + +* ``stop:`` Stop the relevant resources. + +* ``freeze:`` Do nothing to the relevant resources. + +* ``demote:`` Demote relevant resources that are running in master mode to slave mode. + +.. topic:: Constraint that fences node if ``ticketA`` is revoked + + .. code-block:: xml + + + +The example above creates a constraint with the ID ``rsc1-req-ticketA``. It +defines that the resource ``rsc1`` depends on ``ticketA`` and that the node running +the resource should be fenced if ``ticketA`` is revoked. + +If resource ``rsc1`` were a promotable resource (i.e. it could run in master or +slave mode), you might want to configure that only master mode +depends on ``ticketA``. With the following configuration, ``rsc1`` will be +demoted to slave mode if ``ticketA`` is revoked: + +.. topic:: Constraint that demotes ``rsc1`` if ``ticketA`` is revoked + + .. code-block:: xml + + + +You can create multiple **rsc_ticket** constraints to let multiple resources +depend on the same ticket. However, **rsc_ticket** also supports resource sets +(see :ref:`s-resource-sets`), so one can easily list all the resources in one +**rsc_ticket** constraint instead. + +.. topic:: Ticket constraint for multiple resources + + .. code-block:: xml + + + + + + + + + + + + +In the example above, there are two resource sets, so we can list resources +with different roles in a single ``rsc_ticket`` constraint. There's no dependency +between the two resource sets, and there's no dependency among the +resources within a resource set. Each of the resources just depends on +``ticketA``. + +Referencing resource templates in ``rsc_ticket`` constraints, and even +referencing them within resource sets, is also supported. + +If you want other resources to depend on further tickets, create as many +constraints as necessary with ``rsc_ticket``. + +Managing Multi-Site Clusters +############################ + +Granting and Revoking Tickets Manually +______________________________________ + +You can grant tickets to sites or revoke them from sites manually. +If you want to re-distribute a ticket, you should wait for +the dependent resources to stop cleanly at the previous site before you +grant the ticket to the new site. + +Use the **crm_ticket** command line tool to grant and revoke tickets. + +To grant a ticket to this site: + + .. code-block:: none + + # crm_ticket --ticket ticketA --grant + +To revoke a ticket from this site: + + .. code-block:: none + + # crm_ticket --ticket ticketA --revoke + +.. important:: + + If you are managing tickets manually, use the **crm_ticket** command with great care, because it cannot check whether the same ticket is already granted elsewhere. - ==== - - - === Granting and Revoking Tickets via a Cluster Ticket Registry === - - We will use https://github.com/ClusterLabs/booth[Booth] here as an example of - software that can be used with pacemaker as a Cluster Ticket Registry. Booth - implements the - http://en.wikipedia.org/wiki/Raft_%28computer_science%29[Raft] - algorithm to guarantee the distributed consensus among different - cluster sites, and manages the ticket distribution (and thus the failover - process between sites). - - Each of the participating clusters and 'arbitrators' runs the Booth daemon - `boothd`. - - An 'arbitrator' is the multi-site equivalent of a quorum-only node in a local - cluster. If you have a setup with an even number of sites, - you need an additional instance to reach consensus about decisions such - as failover of resources across sites. In this case, add one or more - arbitrators running at additional sites. Arbitrators are single machines - that run a booth instance in a special mode. An arbitrator is especially - important for a two-site scenario, otherwise there is no way for one site - to distinguish between a network failure between it and the other site, and - a failure of the other site. - - The most common multi-site scenario is probably a multi-site cluster with two - sites and a single arbitrator on a third site. However, technically, there are - no limitations with regards to the number of sites and the number of - arbitrators involved. - - `Boothd` at each site connects to its peers running at the other sites and - exchanges connectivity details. Once a ticket is granted to a site, the - booth mechanism will manage the ticket automatically: If the site which - holds the ticket is out of service, the booth daemons will vote which - of the other sites will get the ticket. To protect against brief - connection failures, sites that lose the vote (either explicitly or - implicitly by being disconnected from the voting body) need to - relinquish the ticket after a time-out. Thus, it is made sure that a - ticket will only be re-distributed after it has been relinquished by the - previous site. The resources that depend on that ticket will fail over - to the new site holding the ticket. The nodes that have run the - resources before will be treated according to the `loss-policy` you set - within the `rsc_ticket` constraint. - - Before the booth can manage a certain ticket within the multi-site cluster, - you initially need to grant it to a site manually via the `booth` command-line - tool. After you have initially granted a ticket to a site, `boothd` - will take over and manage the ticket automatically. - - [IMPORTANT] - ==== - The `booth` command-line tool can be used to grant, list, or - revoke tickets and can be run on any machine where `boothd` is running. - If you are managing tickets via Booth, use only `booth` for manual - intervention, not `crm_ticket`. That ensures the same ticket + +Granting and Revoking Tickets via a Cluster Ticket Registry +___________________________________________________________ + +We will use `Booth `_ here as an example of +software that can be used with pacemaker as a Cluster Ticket Registry. Booth +implements the `Raft `_ +algorithm to guarantee the distributed consensus among different +cluster sites, and manages the ticket distribution (and thus the failover +process between sites). + +Each of the participating clusters and *arbitrators* runs the Booth daemon +**boothd**. + +An *arbitrator* is the multi-site equivalent of a quorum-only node in a local +cluster. If you have a setup with an even number of sites, +you need an additional instance to reach consensus about decisions such +as failover of resources across sites. In this case, add one or more +arbitrators running at additional sites. Arbitrators are single machines +that run a booth instance in a special mode. An arbitrator is especially +important for a two-site scenario, otherwise there is no way for one site +to distinguish between a network failure between it and the other site, and +a failure of the other site. + +The most common multi-site scenario is probably a multi-site cluster with two +sites and a single arbitrator on a third site. However, technically, there are +no limitations with regards to the number of sites and the number of +arbitrators involved. + +**Boothd** at each site connects to its peers running at the other sites and +exchanges connectivity details. Once a ticket is granted to a site, the +booth mechanism will manage the ticket automatically: If the site which +holds the ticket is out of service, the booth daemons will vote which +of the other sites will get the ticket. To protect against brief +connection failures, sites that lose the vote (either explicitly or +implicitly by being disconnected from the voting body) need to +relinquish the ticket after a time-out. Thus, it is made sure that a +ticket will only be re-distributed after it has been relinquished by the +previous site. The resources that depend on that ticket will fail over +to the new site holding the ticket. The nodes that have run the +resources before will be treated according to the **loss-policy** you set +within the **rsc_ticket** constraint. + +Before the booth can manage a certain ticket within the multi-site cluster, +you initially need to grant it to a site manually via the **booth** command-line +tool. After you have initially granted a ticket to a site, **boothd** +will take over and manage the ticket automatically. + +.. important:: + + The **booth** command-line tool can be used to grant, list, or + revoke tickets and can be run on any machine where **boothd** is running. + If you are managing tickets via Booth, use only **booth** for manual + intervention, not **crm_ticket**. That ensures the same ticket will only be owned by one cluster site at a time. - ==== - - ==== Booth Requirements ==== - - * All clusters that will be part of the multi-site cluster must be based on - Pacemaker. - - * Booth must be installed on all cluster nodes and on all arbitrators that will - be part of the multi-site cluster. - - * Nodes belonging to the same cluster site should be synchronized via NTP. However, - time synchronization is not required between the individual cluster sites. - - === General Management of Tickets === - - Display the information of tickets: - ------- - # crm_ticket --info - ------- - - Or you can monitor them with: - ------- - # crm_mon --tickets - ------- - - Display the +rsc_ticket+ constraints that apply to a ticket: - ------- - # crm_ticket --ticket ticketA --constraints - ------- - - When you want to do maintenance or manual switch-over of a ticket, - revoking the ticket would trigger the loss policies. If - +loss-policy="fence"+, the dependent resources could not be gracefully - stopped/demoted, and other unrelated resources could even be affected. - - The proper way is making the ticket 'standby' first with: - ------- - # crm_ticket --ticket ticketA --standby - ------- - - Then the dependent resources will be stopped or demoted gracefully without - triggering the loss policies. - - If you have finished the maintenance and want to activate the ticket again, - you can run: - ------- - # crm_ticket --ticket ticketA --activate - ------- - - == For more information == - - * https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html[SUSE's Geo Clustering quick start] - - * https://github.com/ClusterLabs/booth[Booth] + +Booth Requirements +~~~~~~~~~~~~~~~~~~ + +* All clusters that will be part of the multi-site cluster must be based on + Pacemaker. + +* Booth must be installed on all cluster nodes and on all arbitrators that will + be part of the multi-site cluster. + +* Nodes belonging to the same cluster site should be synchronized via NTP. However, + time synchronization is not required between the individual cluster sites. + +General Management of Tickets +_____________________________ + +Display the information of tickets: + + .. code-block:: none + + # crm_ticket --info + +Or you can monitor them with: + + .. code-block:: none + + # crm_mon --tickets + +Display the ``rsc_ticket`` constraints that apply to a ticket: + + .. code-block:: none + + # crm_ticket --ticket ticketA --constraints + +When you want to do maintenance or manual switch-over of a ticket, +revoking the ticket would trigger the loss policies. If +``loss-policy="fence"``, the dependent resources could not be gracefully +stopped/demoted, and other unrelated resources could even be affected. + +The proper way is making the ticket *standby* first with: + + .. code-block:: none + + # crm_ticket --ticket ticketA --standby + +Then the dependent resources will be stopped or demoted gracefully without +triggering the loss policies. + +If you have finished the maintenance and want to activate the ticket again, +you can run: + + .. code-block:: none + + # crm_ticket --ticket ticketA --activate + +For more information +#################### + +* `SUSE's Geo Clustering quick start `_ + +* `Booth `_