This is a clustered resource group manager layered on top of Magma, a
single API which can talk to multiple cluster infrastructures via their
native APIs.  This resource manager requires both magma and one or more
plugins to be available (the dumb plugin will work just fine, but will
only get you one node...).


Note: I tend to use "service" and "resource group" to mean the same
thing.  While they are slightly different, "services" were simply
collections of different things (IPs, file systems, etc.), as are
resource groups.

Introduction

The resource manager is a daemon which provides failover of user-defined
resource collected into groups.  It is a direct descendent of the
Service Manager from Red Hat Cluster Manager 1.2, which is a descendent
of Cluster Manager 1.0, which is a descendent of Kimberlite 1.1.0.
The primary reason for a rename was to avoid confusion with the Service
Manager present in David Tiegland's Symmetric Cluster Architecture paper
and software, with which this application must collaborate.

The simplicity, ease of maintenance, and stability in the field has led
us to attempt to preserve as much of the method of operations of Cluster
Manager's Service Manager, but certain aspects needed to be removed in
order to provide a more flexible framework for users.

At this point, the resource manager is designed primarily for cold
failover (= application has to restart entirely).  This allows a lot of
flexibility in the sense that most off-the-shelf applications can
benefit from increased availability with minimal effort.  However, it
may be extended to support warm and even hot failover if necessary,
though these models require often require application modification -
which is outside the scope of this readme.


Goals and Requirements

(1) Supercede Red Hat Cluster Manager's capabilities with respect to
service modification.  Red Hat Cluster Manager's resource model was
based on an old model from Kimberlite 1.1.0.  In this model, services
were entirely monolithic, and had to be disabled in order to be changed.
This meant that if one had an NFS service with a list of clients and a
new client was added, all other clients had to lose access in order to
add the new client.  In short, provide for on-line service modification
and have intelligence about what to restart/reload after a new
configuration is received.  Note that this intelligence is not yet
completed, but the framework exists for it to be completed.

(2) Be able to use CMAN/DLM/SM and GuLM cluster infrastructures.  This
is done via magma.

(3) Use CCS for all configuration data.

(4) Be able to queue requests for services.  This is forward-looking
towards things such as event scheduling (e.g. disable this resource
group at this time in the future).

(5) Provide an extensible set of rules which define resource structure.
This is accomplished by providing rule sets which essentially define how
resources may be defined in CCS.  These rule sets attempt to be OCF
RA API compliant.  This has the side effect of partially solving (1)
for us.

(6) Use a distributed model for resource group/service state.  This is
currently done with a port of the View-Formation code from Cluster
Manager 1.2.  Ideally, it would be nice to use LVBs in the DLM or GuLM
to distribute this state; but it probably requires >64 bytes (DLM's
limit).  Another model is client-server like NFS, where clients tell
the new master what resource groups they have.  However, this has
implications if the server fails and was running a RG.

(7) Combination of 3 and 6 gives us: no dependency on shared storage.

(8) Be practical.  Try to follow the model from clumanager 1.x as much
as possible without sacrificing flexibility.  It should be easy to
convert an existing clumanager 1.2 user to use this RM.  While it will
be impossible to perform a rolling upgrade from clumanager 1.2 to this
RM, it should be easy to convert an existing installation's
configuration information.

(9) Try to be OCF compliant across how the resource scripts are written.
While this is a goal, it is not a guarantee at this point.

(10) Work with the same fencing model as clumanager and GFS (i.e.
"top-down" - the infrastructure handles it).  Resource-based fencing,
for example, is specifically not a goal at this point.

(11) Be scalable without being overly complex ;)  Note that we're not
very scalable at the moment (VF is too slow for this kind of thing)

(12) Introduce other failover/balancing policies (instead of just
failover domains).


Directory Structure

include/

Include files

src/daemons/

The RM daemon and clurmtabd (from RHEL3), which keeps /var/lib/nfs/rmtab
in sync with clustered exports, or tries to.

src/utils/

Home of various utilities including clustat and clusvcadm.

src/clulib/

Library functions which aren't necessarily tied to a specific daemon.
This was more important in clumanager 1.2; and is mostly cruft at this 
point.

src/resources/

Shell scripts and resource XML rules for various resources.  Includes
a DTD to validate a given resource XML rule.


Implementation - Handling of Client Requests

The resource manager currently uses a threaded model which has a
producer thread and a set of consumer threads with their own work
queues.

 Client <--(result)---------+
   |                        |
 (req)                      |
   |                 [Handle request]
   |                        ^
   v                        |
Listener --(req)--> Resource group thread

In a typical example, the client sends a request to the listener thread.
The listener accepts hands it off to a resource group queue and
immediately resumes listening for requests.  Queueing the request on a
resource thread queue has a side effect of spawning a new resource group
thread if one is needed.  The resource group thread pulls the request
off of its work queue and handles the request.  After the request
completes, the resource thread sends the result of the operation back to
the client.  If no other requests are pending and the last operation
resulted in the resource group being in a non-running state (stopped,
disabled, etc.), then the resource group thread cleans itself up and
exits.

The resource manager handles failed resource groups in a simplistic
manner which is the same as how Red Hat Cluster Manager handled
failover.  If a resource group fails a status check, the resource
manager stops the affected RG.  If the 'STOP' operation succeeds, then
the RM tries to restart the RG.  If this succeeds, the RM is done.

However, if it fails to restart/recover the resource group, it again
stops the RG and attempts to determine the best-available online node
based on the failover domain or (not implemented yet) least-loaded
cluster member.  If none are available, or all members fail to start the
RG, the RG is disabled.


Cluster Events

Cluster events are provided by a cluster-abstraction library called
Magma.  Magma currently runs with CMAN/DLM, gulm, or no cluster
infrastructure at all (e.g. one node, always-quorate-pseudo-cluster).

Magma provides group membership lists, state of the cluster quorum (if
one exists), cluster locking, and (in some cases), barriers - so the RM
needs to implement none of the above.

Cluster events are handed to the listener thread via a file descriptor,
and can affect resource group states.  For instance, when a member is no
longer quorate, all resource groups are stopped immediately (and, quite
normally, the member is fenced prior to this actually completing!).

Whenever a node transition occurs (or in the SM case, a node joins the
requisite service group), all RG states are evaluated to see if they
should be moved about or started (or failed-over), depending on the
failover domain (or other policies which are not yet implemented).


The Resource Tree

In clumanager 1.x (and Kimberlite, for that matter), the attributes of
various resources attributes were read one by one, as required, inside
of a large shell script.  This had several benefits (hard dependency
enforcment, maintenance, support, etc.), but had several drawbacks as
well; the most important being flexibility.

In contrast to clumanager 1.x, the resource groups are now modeled in
the daemon's RAM as tree structures with all their requisite
attributes loaded from CCS based on external XML rules.

After a resource is started, it follows down the tree and starts all
dependent children.  Before a resource is stopped, all of its dependent
children are first stopped.  Because of this structure, it is possible
to add or restart, for instance, an "NFS client" resource without
affecting its parent "export" resource.  By determining the delta
between resource lists and/or resource trees, it's possible to use this
information to automatically restart a node in the tree and all its
dependent children in the event of a configuration change.

The tree can be summarized as follows:

	group
		ip address...
		file system...
			NFS export...
				NFS client...
		samba share(s)...
		script...


Resource Agents

Resource agents are scripts or executables which handle operations for a
given resource (such as start, stop, restart, etc.).  See the OCF RA API
v1.0 for more details on how the resource agents work:

	http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/

The resource agents and handling should be OCF compliant (at least in
how they are called), primarily because it is fairly nonsensical to
require third party application developers to write multiple differing
application scripts in order to have their application work in different
cluster environments on an otherwise similar system [1].


The XML Resource Rules

The XML Resource Rules are encapsulated in the resource agent scripts,
which attempt to follow the OCF RA API 1.0.  The RA API, however, does
not work with the goals of this project as it exists today, so there are
a few extensions to the ra-api-1.dtd.

  * Addition of a 'required' tag parameter elements.  This allows the UI
    and/or the resource group manager to determine if a resource is
    configured properly without having to ask the resource agent itself.
  * Addition of a 'primary' tag parameter elements.  This allows the
    resource tree to reference resource instances using the 'ref='
    tag in the code.
  * Addition of an 'inherit' tag to allow resources to inherit a parent
    resources' parameters.

Additionally, there is some data which is used in the 'special' element
of the RA API DTD.

Description of rgmanager-specific block.

<!ELEMENT attributes>
<!ATTLIST 
	root		(0|1)		#IMPLIED
	maxinstances	CDATA		#IMPLIED>

  * root: This is the root resource type.  This should only exist in the
    'group' resource, generally.
  * maxinstances: This is the maximum number of instances in the resource
    tree this resource may exist.

Description of a resource type or attribute.

<!ELEMENT child EMPTY>
<!ATTLIST child
	type		CDATA		#REQUIRED
	start		CDATA		#IMPLIED
	stop		CDATA		#IMPLIED>

Describes a valid child rule type.

  * type: Resource type which is to be declared as a valid child for
    this resource type.
  * start: Start level of this child type.  All resource children at
    level '1' are started before all resource children at level '2'.
    Valid values are 1-99.
  * stop: Stop level of this child type.  All resource children at level
    '1' are stopped before all resource children at level '2'.
    Valid values are 1-99.


Failover Domains

A failover domain is an ordered subset of cluster members to which a
resource group may be bound. The following is a list of semantics
governing the options as to how the different configuration options
affect the behavior of a resource group which is bound to a particular
failover domain:

  * restricted domain: Resource groups bound to the domain may only run
    on cluster members which are also members of the failover domain. If
    no members of the failover domain are available, the resource group
    is placed in the stopped state.

  * unrestricted domain: Resource groups bound to this domain may run on
    all cluster members members, but will run on a member of the domain
    whenever one is available. This means that if a resource group is
    running outside of the domain and a member of the domain transitions
    online, the resource group will migrate to that cluster member.

  * ordered domain: Nodes in an ordered domain are assigned a priority
    level from 1-100, priority 1 being the highest and 100 being the 
    lowest.  A member of the highest priority group will run a resourece
    group bound to a given domain whenever one is online.  This means
    that if member A has a higher priority than member B, the resource
    group will migrate to A if it was running on B if A transitions from
    offline to online.

  * unordered domain: Members of the domain have no order of preference;
    any member may run the resource group. Resource groups will always
    migrate to members of their failover domain whenever possible,
    however, in an unordered domain.

Ordering and restriction are flags and may be combined in any way (ie,
ordered+restricted, unordered+unrestricted, etc.). These combinations
affect both where resource groups start after initial quorum formation
and which cluster members will take over resource groups in the event
that the resource group or the member running it has failed (without
being recoverable on that member).


Failover Domains (Examples)

Given a cluster comprised of this set of members: {A, B, C, D, E, F, G}

Ordered, restricted failover domain {A(1), B(2), C(3)}. A resource group
'S' will always run on member 'A' whenever member 'A' is online and
there is a quorum.  If all members of {A, B, C} are offline, the
resource group will be stopped. If the resource group is running on 'C'
and 'A' transitions online, the resource group will migrate to 'A'.

Unordered, restricted failover domain {A, B, C}. A service 'S' will only
run if there is a quorum and at least one member of {A, B, C} is online.
If another member of the domain transitions online, the service does not
relocate.

Ordered, unrestricted failover domain {A(1), B(2), C(3)}. A resource
group 'S' will run whenever there is a quorum. If a member of the
failover domain is online, the resource group will run on the highest
ranking member. That is, if 'A' is online, the resource group will run
on 'A'.

Unordered, unrestricted failover domain {A, B, C}. This is also called a
"Set of Preferred Members". When one or more members of the failover
domain are online, the service will run on a nonspecific online member
of the failover domain. If another member of the failover domain
transitions online, the service does not relocate.


Notes:

[1] Lars Marowsky-Bree pointed this out; which makes perfect sense.  The
OCF RA API attempts to follow LSB with respect to how init-scripts
handle starting/stopping/status of daemons and such, so it was directly
in line with where we were going with this RM.

[2] This is derived from the OCF RA metadata DTD:

	http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/ra-api-1.dtd