This is a clustered resource group manager layered on top of Magma, a single API which can talk to multiple cluster infrastructures via their native APIs. This resource manager requires both magma and one or more plugins to be available (the dumb plugin will work just fine, but will only get you one node...). Note: I tend to use "service" and "resource group" to mean the same thing. While they are slightly different, "services" were simply collections of different things (IPs, file systems, etc.), as are resource groups. Introduction The resource manager is a daemon which provides failover of user-defined resource collected into groups. It is a direct descendent of the Service Manager from Red Hat Cluster Manager 1.2, which is a descendent of Cluster Manager 1.0, which is a descendent of Kimberlite 1.1.0. The primary reason for a rename was to avoid confusion with the Service Manager present in David Tiegland's Symmetric Cluster Architecture paper and software, with which this application must collaborate. The simplicity, ease of maintenance, and stability in the field has led us to attempt to preserve as much of the method of operations of Cluster Manager's Service Manager, but certain aspects needed to be removed in order to provide a more flexible framework for users. At this point, the resource manager is designed primarily for cold failover (= application has to restart entirely). This allows a lot of flexibility in the sense that most off-the-shelf applications can benefit from increased availability with minimal effort. However, it may be extended to support warm and even hot failover if necessary, though these models require often require application modification - which is outside the scope of this readme. Goals and Requirements (1) Supercede Red Hat Cluster Manager's capabilities with respect to service modification. Red Hat Cluster Manager's resource model was based on an old model from Kimberlite 1.1.0. In this model, services were entirely monolithic, and had to be disabled in order to be changed. This meant that if one had an NFS service with a list of clients and a new client was added, all other clients had to lose access in order to add the new client. In short, provide for on-line service modification and have intelligence about what to restart/reload after a new configuration is received. Note that this intelligence is not yet completed, but the framework exists for it to be completed. (2) Be able to use CMAN/DLM/SM and GuLM cluster infrastructures. This is done via magma. (3) Use CCS for all configuration data. (4) Be able to queue requests for services. This is forward-looking towards things such as event scheduling (e.g. disable this resource group at this time in the future). (5) Provide an extensible set of rules which define resource structure. This is accomplished by providing rule sets which essentially define how resources may be defined in CCS. These rule sets attempt to be OCF RA API compliant. This has the side effect of partially solving (1) for us. (6) Use a distributed model for resource group/service state. This is currently done with a port of the View-Formation code from Cluster Manager 1.2. Ideally, it would be nice to use LVBs in the DLM or GuLM to distribute this state; but it probably requires >64 bytes (DLM's limit). Another model is client-server like NFS, where clients tell the new master what resource groups they have. However, this has implications if the server fails and was running a RG. (7) Combination of 3 and 6 gives us: no dependency on shared storage. (8) Be practical. Try to follow the model from clumanager 1.x as much as possible without sacrificing flexibility. It should be easy to convert an existing clumanager 1.2 user to use this RM. While it will be impossible to perform a rolling upgrade from clumanager 1.2 to this RM, it should be easy to convert an existing installation's configuration information. (9) Try to be OCF compliant across how the resource scripts are written. While this is a goal, it is not a guarantee at this point. (10) Work with the same fencing model as clumanager and GFS (i.e. "top-down" - the infrastructure handles it). Resource-based fencing, for example, is specifically not a goal at this point. (11) Be scalable without being overly complex ;) Note that we're not very scalable at the moment (VF is too slow for this kind of thing) (12) Introduce other failover/balancing policies (instead of just failover domains). Directory Structure include/ Include files src/daemons/ The RM daemon and clurmtabd (from RHEL3), which keeps /var/lib/nfs/rmtab in sync with clustered exports, or tries to. src/utils/ Home of various utilities including clustat and clusvcadm. src/clulib/ Library functions which aren't necessarily tied to a specific daemon. This was more important in clumanager 1.2; and is mostly cruft at this point. src/resources/ Shell scripts and resource XML rules for various resources. Includes a DTD to validate a given resource XML rule. Implementation - Handling of Client Requests The resource manager currently uses a threaded model which has a producer thread and a set of consumer threads with their own work queues. Client <--(result)---------+ | | (req) | | [Handle request] | ^ v | Listener --(req)--> Resource group thread In a typical example, the client sends a request to the listener thread. The listener accepts hands it off to a resource group queue and immediately resumes listening for requests. Queueing the request on a resource thread queue has a side effect of spawning a new resource group thread if one is needed. The resource group thread pulls the request off of its work queue and handles the request. After the request completes, the resource thread sends the result of the operation back to the client. If no other requests are pending and the last operation resulted in the resource group being in a non-running state (stopped, disabled, etc.), then the resource group thread cleans itself up and exits. The resource manager handles failed resource groups in a simplistic manner which is the same as how Red Hat Cluster Manager handled failover. If a resource group fails a status check, the resource manager stops the affected RG. If the 'STOP' operation succeeds, then the RM tries to restart the RG. If this succeeds, the RM is done. However, if it fails to restart/recover the resource group, it again stops the RG and attempts to determine the best-available online node based on the failover domain or (not implemented yet) least-loaded cluster member. If none are available, or all members fail to start the RG, the RG is disabled. Cluster Events Cluster events are provided by a cluster-abstraction library called Magma. Magma currently runs with CMAN/DLM, gulm, or no cluster infrastructure at all (e.g. one node, always-quorate-pseudo-cluster). Magma provides group membership lists, state of the cluster quorum (if one exists), cluster locking, and (in some cases), barriers - so the RM needs to implement none of the above. Cluster events are handed to the listener thread via a file descriptor, and can affect resource group states. For instance, when a member is no longer quorate, all resource groups are stopped immediately (and, quite normally, the member is fenced prior to this actually completing!). Whenever a node transition occurs (or in the SM case, a node joins the requisite service group), all RG states are evaluated to see if they should be moved about or started (or failed-over), depending on the failover domain (or other policies which are not yet implemented). The Resource Tree In clumanager 1.x (and Kimberlite, for that matter), the attributes of various resources attributes were read one by one, as required, inside of a large shell script. This had several benefits (hard dependency enforcment, maintenance, support, etc.), but had several drawbacks as well; the most important being flexibility. In contrast to clumanager 1.x, the resource groups are now modeled in the daemon's RAM as tree structures with all their requisite attributes loaded from CCS based on external XML rules. After a resource is started, it follows down the tree and starts all dependent children. Before a resource is stopped, all of its dependent children are first stopped. Because of this structure, it is possible to add or restart, for instance, an "NFS client" resource without affecting its parent "export" resource. By determining the delta between resource lists and/or resource trees, it's possible to use this information to automatically restart a node in the tree and all its dependent children in the event of a configuration change. The tree can be summarized as follows: group ip address... file system... NFS export... NFS client... samba share(s)... script... Resource Agents Resource agents are scripts or executables which handle operations for a given resource (such as start, stop, restart, etc.). See the OCF RA API v1.0 for more details on how the resource agents work: http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/ The resource agents and handling should be OCF compliant (at least in how they are called), primarily because it is fairly nonsensical to require third party application developers to write multiple differing application scripts in order to have their application work in different cluster environments on an otherwise similar system [1]. The XML Resource Rules The XML Resource Rules are encapsulated in the resource agent scripts, which attempt to follow the OCF RA API 1.0. The RA API, however, does not work with the goals of this project as it exists today, so there are a few extensions to the ra-api-1.dtd. * Addition of a 'required' tag parameter elements. This allows the UI and/or the resource group manager to determine if a resource is configured properly without having to ask the resource agent itself. * Addition of a 'primary' tag parameter elements. This allows the resource tree to reference resource instances using the 'ref=' tag in the code. * Addition of an 'inherit' tag to allow resources to inherit a parent resources' parameters. Additionally, there is some data which is used in the 'special' element of the RA API DTD. Description of rgmanager-specific block. * root: This is the root resource type. This should only exist in the 'group' resource, generally. * maxinstances: This is the maximum number of instances in the resource tree this resource may exist. Description of a resource type or attribute. Describes a valid child rule type. * type: Resource type which is to be declared as a valid child for this resource type. * start: Start level of this child type. All resource children at level '1' are started before all resource children at level '2'. Valid values are 1-99. * stop: Stop level of this child type. All resource children at level '1' are stopped before all resource children at level '2'. Valid values are 1-99. Failover Domains A failover domain is an ordered subset of cluster members to which a resource group may be bound. The following is a list of semantics governing the options as to how the different configuration options affect the behavior of a resource group which is bound to a particular failover domain: * restricted domain: Resource groups bound to the domain may only run on cluster members which are also members of the failover domain. If no members of the failover domain are available, the resource group is placed in the stopped state. * unrestricted domain: Resource groups bound to this domain may run on all cluster members members, but will run on a member of the domain whenever one is available. This means that if a resource group is running outside of the domain and a member of the domain transitions online, the resource group will migrate to that cluster member. * ordered domain: Nodes in an ordered domain are assigned a priority level from 1-100, priority 1 being the highest and 100 being the lowest. A member of the highest priority group will run a resourece group bound to a given domain whenever one is online. This means that if member A has a higher priority than member B, the resource group will migrate to A if it was running on B if A transitions from offline to online. * unordered domain: Members of the domain have no order of preference; any member may run the resource group. Resource groups will always migrate to members of their failover domain whenever possible, however, in an unordered domain. Ordering and restriction are flags and may be combined in any way (ie, ordered+restricted, unordered+unrestricted, etc.). These combinations affect both where resource groups start after initial quorum formation and which cluster members will take over resource groups in the event that the resource group or the member running it has failed (without being recoverable on that member). Failover Domains (Examples) Given a cluster comprised of this set of members: {A, B, C, D, E, F, G} Ordered, restricted failover domain {A(1), B(2), C(3)}. A resource group 'S' will always run on member 'A' whenever member 'A' is online and there is a quorum. If all members of {A, B, C} are offline, the resource group will be stopped. If the resource group is running on 'C' and 'A' transitions online, the resource group will migrate to 'A'. Unordered, restricted failover domain {A, B, C}. A service 'S' will only run if there is a quorum and at least one member of {A, B, C} is online. If another member of the domain transitions online, the service does not relocate. Ordered, unrestricted failover domain {A(1), B(2), C(3)}. A resource group 'S' will run whenever there is a quorum. If a member of the failover domain is online, the resource group will run on the highest ranking member. That is, if 'A' is online, the resource group will run on 'A'. Unordered, unrestricted failover domain {A, B, C}. This is also called a "Set of Preferred Members". When one or more members of the failover domain are online, the service will run on a nonspecific online member of the failover domain. If another member of the failover domain transitions online, the service does not relocate. Notes: [1] Lars Marowsky-Bree pointed this out; which makes perfect sense. The OCF RA API attempts to follow LSB with respect to how init-scripts handle starting/stopping/status of daemons and such, so it was directly in line with where we were going with this RM. [2] This is derived from the OCF RA metadata DTD: http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/ra-api-1.dtd