crm.txt
No OneTemporary
Actions

Size

31 KB

Referenced Files

None

Subscribers

None

crm.txt
View Options

This document is not UTF8. It was detected as ISO-8859-1 (Latin 1) and converted to UTF8 for display.

	DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT!

	NOTICE: Some ideas in this paper aren't yet well sorted. Some ideas aren't
	complete. Some phrasings I'm myself not happy with yet. Some ideas need
	further explanation. Most of the ideas presented are not final yet. It is
	mostly a braindump.

	And did I say yet that this is still a DRAFT!!!!!! ?


	Title: Design of a Cluster Resource Manager
	Revision: $Id: crm.txt,v 1.2 2003/12/01 14:10:14 lars Exp $
	Author: Lars Marowsky-Brée <lmb@suse.de>
	Acknowledgements: Andrew Beekhof <andrew@beekhof.net>
	Luis Claudio R. Goncalves <lclaudio@conectiva.com.br>
	Fábio Olivé Leite <olive@conectiva.com.br>
	Alan Robertson <alanr@unix.sh>


	XX. Global TODO for this document

	In this section I keep track of tasks which I still want to perform on this
	document; probably of little interest to anyone else, and this section should
	be gone in the final version ;-)

	- Break out major parts (CIB, Policy Engine etc) into their own documents.
	- Add references to sub-features used and do not replicate too much
	information here. References should take the form of the feature id from the
	feature list.
	- Ensure unified use of words and terms when referring to the
	components.
	- Convert to docbook (not pressing)


	0. Abstract

	This paper outlines the design of a clustered recovery/resource manager
	to be running on top of and enhancing the Open Clustering Framework
	(OCF) infrastructure provided by heartbeat.

	The goal is to allow flexible resource allocation and globally ordered
	actions in a cluster of N nodes and dynamic reallocation of resources in
	case of failures (ie, "fail-over") or administrative changes to the
	cluster ("switch-over").

	This new Cluster Resource Manager is intended to be a replacement for the
	currently used "resource manager" of heartbeat.


	1. Introduction

	1.1. Requirements overview

	The CRM needs to provide the following functionality:

	- Secure.
	- Simple.

	heartbeat already provides these two properties; by no means may the
	new resource manager be more insecure. For sanity and stability - both
	of the developers, but in particular the users -, complexity needs to
	be kept at a minimum (but no simpler).

	- Support for more than 2 nodes.

	- Fault tolerant.
	(Failures can be node, network or resource failures.)

	- Ability to deal with complex policies for resource allocation and
	dependencies. Examples:
	- Support for globally ordered starting order,
	- Support for feedback in the allocation process to better support
	replicated resources,
	- Negated dependencies etc

	- Ability to make administrative requests for resource migration, adding new
	resources, removing resources et cetera while online.

	- Extensible framework.


	1.2. Scenario description

	The design outlined in this document is aimed at a cluster with the following
	properties; some of them will be further specified later.

	- The cluster provides a Concensus Membership Layer, as outlined in the OCF
	documents and provided by the CCM implementation by Ram Pai.

	(This provides all nodes in a partition with a common and agreed upon
	view of cluster membership, which tries to compute the largest set of
	fully connected nodes.)

	- It is possible for the nodes in a given partition to order all nodes
	in the partition in the same way, without the need for a distributed
	decision. This can be either achieved by having the membership be
	returned in the same order on all nodes, or by attaching a
	distinguishing attribute to each node.

	- The cluster provides a communication mechanism for unicast as well as
	broadcast messaging.

	Messages don't necessarily satisfy any special ordering, but they are
	atomic.

	- Byzantine failures are rare and self-contained. Stable storage is stable and
	not otherwise corrupted; network packets aren't spuriously generated etc.

	These errors are self-contained to the respective components which take
	appropriate precautions to prevent them from propagating upwards.
	(Checksums, authentication, error recovery or fail-fast)

	- Time is synchronized cluster-wide. Even in the face of a network
	partition, an upper bound for time diversion can be safely assumed.
	This can be virtually guaranteed by running NTP across all nodes in
	the cluster.

	- An IO fencing mechanism is available in the cluster.

	The fencing mechanism provides definitive feedback on whether a given
	fencing request succeeded or not.

	- A node knows which resources itself currently holds and their state
	(running / failed / stopped); it can provide this information if
	queried and will inform the CRM of state changes.

	(This part will be provided by the Local Resource Manager.)



	2.1. Basic algorithm

	The basic algorithm can be summarized as follows:

	a) Every partition elects a "Designated Coordinator"; this node will
	activate special logic to coordinate the recovery and administrative
	actions on all nodes in the cluster.

	a.1) The DC has the full state of the cluster available (or is able to
	retrieve it), as well as an uptodate copy of the administrative
	policies, information about fenced nodes etc; this shall further be
	referred to as "Cluster Information Base".

	b) Whenever a cluster event occurs, be it an adminstrative request, a node
	failure by membership services or a resource failure reported by a
	participating LRM, it is forwarded to the "Designated Coordinator".

	b1) For administrative requests, the DC arbitates whether or not they will be
	accepted into the CIB and serializes these updates. ie, policy changes which
	cannot be satisfied or would lead to an inconsistent state of the cluster will
	be rejected (unless explicitly overridden).

	c) It then computes via the Policy Engine:
	c.1) the new CIB
	c.2) The Transition Graph, an dependency-ordered graph of the actions
	necessary to go from the current cluster state as close as possible to
	the cluster state described by the CIB.

	d) Leading the transition to the target state:

	d1) Replicating the new CIB to all clients.

	d2) Executing each step of the transition graph in dependency order.
	(Potentially parallelized.)

	e) Exception handling if _any_ event or failure occurs:

	e1) The algorithm is aborted cleanly; pending operations are allowed to
	finish, but no new commands are issued to clients (in particular during
	phase d2)

	e2) The algorithm is invoked again from scratch.

	(It is obvious that there is room for optimization here by only
	recomputing smaller parts of the dependency tree or not broadcasting the
	full CIB every time, in particular if the DC has not been re-elected.
	However, these complicate the implementation and are not necessary for
	the first phase.)


	2.2) Feature analysis

	This meets the requirements listed as follows:

	- It is reasonably simple to have a policy engine capable of dealing
	with more than two nodes as distributed decisions are avoided as far
	as possible; only a single node leads the transition and coordinates
	participating nodes.

	Local nodes are only required to know their own state; whenever the
	coordinator fails, the necessary state can simply be reconstructed by the
	DC.

	- An event can be anything from a failed node or a request to add a new policy
	rule (ie, a new resource, a change to an allocation policy etc); this
	satisfies the requirement to deal with adminstrative requests at runtime.

	- Support for new kinds of policies, types of resources, ... can be added via
	the policy engine.

	- The modular design keeps complexity in any single component down.


	2.3. Stability of the algorithm

	This algorithm will eventually converge if no new events occur for a
	sufficiently long period of time.

	TODO: Add a discussion on factors affecting stability or kill the section
	entirely.


	2.4. Components

	This approach neatly divides the task into various components: (see
	crm-flowchart.fig!)

	- Cluster infrastructure
	- heartbeat
	- Concensus Cluster Membership
	- Messaging

	- Cluster Resource Manager
	- Policy Engine
	- Cluster Information Base
	- Transitioner

	- Local Resource Manager
	- Executioner
	- Resource Agents


	3. Local Resource Manager

	Note: This section only documents the requirements on the LRM from the point
	of view of the cluster-wide resource management. For a more detailled
	explanation, see the documentation by lclaudio.

	[ TODO: Add reference to lclaudio's document as soon as available ]

	This component knows which resources the node currently holds, their status
	(running, running/failed, stopping, stopped, stopped/failed, etc) and can
	provide this information to the CRM. It can start, restart and stop resources
	on demand. It can provide the CRM with a list of supported resource types.

	It will initiate any recovery action which is applicable and limitted to the
	local node and escalate all other events to the CRM.

	Any correct node in the cluster is running this service. Any node which fails
	to run this service will be evicted and fenced.

	It has access to the CIB for the resource parameters when starting a resource.

	NOTE: Because it might be necessary for the success of the monitoring
	operation that it is invoked with the same instance parameters as the resource
	was started with, it needs to keep a copy of that data, because the CIB might
	change at runtime.


	4. Cluster Resource Manager

	The CRM coordinates all non-local interactions in the cluster. It interacts
	with:

	- The membership layer
	- The local resource managers
	- non-local CRMs
	- Administrative commands
	- Fencing functionality

	Only one node is running the "Designated Coordinator" CRM at any given time in
	any given partition. All other nodes forward their input to this node only,
	and will relay its decisions to the local LRM.

	The coordinator is a "primus inter pares"; in theory, any CRM can act in this
	fashion, but the arbitation algorithm will distinguish a designated node.

	4.1. How to deal with failure of the designated CRM

	There are two major groups of failures here; if the entire node running the
	CRM fails, the underlaying membership algorithm shall take care of this.

	If the CRM logic fails, this shall be detected by internal consistency checks
	and local heartbeating to apphbd must stop immediately, effectively committing
	'suicide' and providing a failfast mechanism. However, for now we will
	assume that the CRM does not fail. Coping with internal failures
	internally is always difficult.

	This can later be enhanced by providing 'peer monitoring' of the CRMs among
	eachother and initiating fencing if necessary; however this has many pitfalls
	and is not inherently better.


	4.2. Election algorithm for the DC

	The election algorithm exploits the fact that there is a global ordering of
	the nodes in a given partition; it will simply select the first one. The node
	will know that it has been "distinguished" and take control.

	As the first action in the algorithm is to collect the status from all
	nodes, this will inform any node about the newly elected leader. Should
	any of them have been a DC until the new membership (in the case of
	cluster partitionings healing), he will cease operation and handover
	control in an orderly fashion.

	4.3. Consistency audits

	The DC can perform a variety of consistency checks:

	- Exclusive allocation of resources

	- All nodes have the necessary Resource Agents for the resources which might
	be running on them

	- Adminstrative requests (ie, rule additions) can be rejected if it would
	prevent the policy engine from computing a valid target state

	Implementing any of the audits is optional.

	4.4. Communication between the CRM and LRM

	- Every resource in the system is clearly identified in the CIB via a "UUID".
	This is the key passed around between CRM/LRM.

	(This avoids issues like with FailSafe, which used a "resource name / type"
	combination as the key; however, this made it very difficult to have, for
	example, two filesystems mounted at /usr/sap - production system and the
	test system - , because that is a key-clash. It is however a perfectly
	legal combination, as long as the resources are not activated on the same
	node; which can however be avoided by an appropriate negative dependency.
	Combined with "resource priority", the higher priority production system
	will push off the lower-priority test system and operation will continue)

	TODO: Protocol and channel need to be decided; based on the heartbeat IPC
	layer, wrapper library so they don't have to know all the gory details. AI
	lclaudio


	4.4. Executing the Transition Graph

	It is also the task of the DC to execute the computed dependency graph.

	The graph will be traversed and evaluated in dependency-compliant order.

	Every node in the graph corresponds to a single action; the links between them
	correspond to the dependencies.

	Only nodes for which all dependencies have been satisfied will be executed;
	the CRM is allowed to parallelize these however.

	If a task cannot be successfully evaluated and has to be ultimately considered
	failed, this will be treated as a hard barrier - all currently pending tasks
	will run to completion, appropriate constraints added to the CIB (ie,
	"resource X cannot start on node N1", for example) and the graph execution
	will be aborted with a failure; this will escalate the recovery back to the
	higher level again. (ie, trigger a rerun of the recovery algorithm)

	TODO: This needs more thought. Especially the failure paths could simply be
	treated just as normal nodes in the graph, simplifying the logic here; and for
	some tasks where re-running the recovery algorithm is pointless, the Policy
	Engine could precompute alternate actions (like marking the resource
	hierarchy failed from that node upwards etc). Could be a possible future
	extension not implemented in v1.0.

	TODO: A pure dependency graph as outlined doesn't easily allow to express "OR"
	conditions, but only "AND". "OR" conditions might be helpful if a node could
	be fenced via multiple mechanisms, and if each one should be tried in order as
	a fallback; this could be implemented by allowing a node in the graph to be a
	_list_ of actions to be tried in order.


	5. Cluster Information Base

	The CIB is also running on every node in the cluster. In essence, it provides
	a distributed database with weak transactional semantics, exploiting the fact
	that all updates are serialized by the DC and that each node itself knows
	its own latest status.

	5.1. Contents of the CIB

	The CIB is divided into two major parts; the "configuration" data and the
	"runtime" data.

	a) The configuration part of the database is setup by the
	administrator.

	The configuration present on any given node is appropriately versioned; a
	combination of timestamps and generation counter seems sensible. Thus the most
	recent version available in a partition can always be clearly identified and
	selected.

	- Configured resources
	- Resource identifier
	- Resource instance parameters

	- Special node attributes

	- Administrative policies
	- Resource placement constraints
	- Resource dependencies

	b) Runtime information

	This includes:

	- Information about the currently participating nodes in the partition.
	- Resource Agents present on each node
	- Resources active on the node and their status
	- Operational status of the node itself

	- Fencing data

	- results: the timestamped result of a fencing request.
	- metadata: each nodes contributes the list of fencing devices available
	to it.

	TODO: Maybe this is part of the "static" configuration data?

	- Dynamic administrative policies and constraints:
	- Temporary constraints to deny placing a resource on a node, ie in response to
	failures etc.
	- Resource migration requests by the administrator.

	(These might be ignored by the policy engine if otherwise a consistent
	state cannot be computed or because they have a limitted life time, for
	example "until the node has booted again")

	The runtime data can be constructed by merging all available data in the
	partition by exploiting that every node holds authoritive data on itself.


	5.2. Process of generating an uptodate CIB

	If necessary, the DC will retrieve the CIB from each node and compute the
	uptodate CIB from this data and broadcast the result to all nodes.

	The algorithm for merging the CIBs is rather straightforward:

	- Select the most recent configuration from all nodes.

	Note that this relies on the fact that an adminstrator has the wits to not
	force incompatible changes to separate cluster partitions; if he does, some
	configuration changes might be lost (or rather, overridden by the others),
	but that is a classical PEBKAC anyway.

	More complex merging algorithms could always be devised later for this step;
	the straightforward approach is to complain loudly about this problem as it
	can be clearly detected, and to also try to reject configuration changes in
	a degraded cluster unless forced.

	If any node in the cluster has a valid copy of the configuration, the
	cluster will be able to proceed. The case where this is not true is very
	unlikely; it requires at least a double failure to occur. The worstcase in
	this scenario is that some configuration changes might be reverted.

	- Merge the runtime data.
	- A node present in the partition is assumed to always have authoritive data
	on itself, its current status, resources it helds etc. This overrides any
	other data about it from other nodes.

	("Normative power of facts" as opposed to rumours)

	- For nodes not present in the partition, their latest status can be
	identified from the partial information present on other nodes because it
	is timestamped. ie, it is possible to say whether they were cleanly
	shutdown, whether they have already been fenced cleanly or whether a
	fencing attempt has failed etc.

	- Update timestamps and generation counters as appropriate.

	- Commit locally and broadcast.


	5.3. How are updates to the CIB handled?

	Any updates to the configuration will be serialized by the DC; they will be
	verified, committed to its own CIB and also broadcast to all nodes in the
	partition as an incremental update.

	Any updates pertaining to the runtime portion of the CIB can simply be
	broadcast to all nodes; the DC will receive them too, and all other nodes can
	save them for further reference so they will be available if the (new) DC
	needs to compute a new CIB.


	6. Policy Engine

	6.1. Functionality provided

	Even if only simple dependencies are supported at first (of the first two
	types, probably) which is straightforward to implement, the model can be
	easily extended.

	6.1.1. Required constraints

	- To be started on the same node as resX after resX
	- To only be started on a node from the set {A, B, C}

	6.1.2. Future extensions

	- To be started after resX, but without tieing them both to a given node, ie
	providing globally order
	- To NOT be run on the same node as resX (or generically, negated constraints)
	- To only be placed on a node with the attribute FOO (for example, connected
	to the FC-AL rack, or with at least XX mb of memory and others)

	6.2. Thoughts about algorithm implementation

	The process of arriving at a target state / transition graph for the cluster
	from the set of rules and the current cluster state (basically, all
	information in the CIB is available) can be implemented by a "constraint
	solver".

	- Analyse dependency graph in the configuration;
	- Compute eligible nodes for each subtree (intersection of nodes
	configured for resources within a dependency subtree)
	- Order eligible nodes by priority, stickiness etc and select target node

	- All other eligible nodes either have to be part of the partition or must
	be fenced successfully.

	...

	TODO: UNFINISHED

	TODO: Alternate implementation: Steal Finite Domain solver from GNU Prolog;
	steal constraint solver from
	http://www.cs.washington.edu/research/constraints/cassowary/ etc

	would allow policies to be expressed as intuitive constraints. Rather
	cool indeed. Evaluation.


	6.4. Design considerations

	The following part should answer the most common questions regarding this
	approach:

	6.4.1. Whether to deal with resource groups or "only" resources and
	dependencies?

	At the first glance, resource groups a la FailSafe are a welcome
	simplification. They can be treated as atomic units, resource allocation
	policy seems to become simpler, the CRM wouldn't even have to know about
	resources but simply allocate/reallocate resource groups. It seems natural to
	group resources in this fashion; all related resources are put into a resource
	group and done.

	However, it is also limitting. Resource groups become awkward if they stop
	being a logical mapping between resource and provided service; for example,
	the natural answer is to throw everything which should be running on a single
	node into a single resource group. If this becomes unnecessary later, resource
	group splitting is difficult. (Same for merging)

	Dependencies which spawn resource groups are also commonly requested; ie, a
	resource group should not run on the same node as a given other one. They are
	also cumbersome if one has to have a resource group for _everything_, even if
	it might only be a single resource or two. Then the abstraction gains little.

	Some decisions also spawn the boundary between single resources and resource
	groups, in which case the CRM has to know about both again. For example, the
	allocation policy of a resource group must be in line with how each resource
	itself can be allocated.

	In short, resource groups seem to fall short of easing bigger clusters and
	also lack some flexibility which can only be added at the expense of
	complexity.

	It seems more natural to me (by now) to only deal with resources and
	dependencies between them and to external constraints. I believe that in the
	end, both are just mindsets and that none is inherently harder to understand
	than the other, just one which is cleaner to implement.

	A very important point to keep in mind is that this document reflects the
	_internal_ representation of the configuration. An appropriate configuration
	wizard could (reasonably easily, although with limited features) present the
	user with a 'resource group'-style frontend.


	6.4.2. Why a transition graph

	The dependency information in the graph will allow operations to be
	parallelized to speed up recovery. (All operations for which all requirements
	are satisfied can be carried out in parallel.)

	The graph is "global" (by virtue of being centralized at the DC) and thus even
	allows the possibility of "barriers"; ie, starting a resource only if a
	resource has been started on another node.


	6.4.3. Who orders resource operations on a single node

	The LRM does not have enough information to order resource operations (start,
	stop etc) even for the local node, because it might - in theory - depend on
	non-local actions (resource operations, fencing etc).

	Only the transition graph as constructed by the DC contains such information.

	A possible optimisation here is that the DC transmits all operations which do
	not depend on external events as a single block to a given node, which can
	then proceed accordingly. Initially, just issueing the actions one by one
	shall suffice.


	7. Executioner

	7.1. Integration

	The Executioner is responsible for the fencing of nodes on request. It is a
	local component on each node and knows which fencing devices are available to
	it.

	This information is contributed to the CIB.

	On request of the CRM (relayed from the DC), it will carry out a fencing
	attempt and report the status back. The CRM might of course try to fence a
	given node from multiple nodes until one succeeds.

	See "executioner.txt" for details.


	7.2. Fencing algorithm

	At the start of the recovery process, the CRM shall verify the list of devices
	reachable by each node; this shall be done as part of querying the current CIB
	from them.

	It will then retrieve the list of nodes fenceable via each device. If this
	list has already been retrieved in the past, it may be reused from cache
	appropriately. (Reloaded only when new nodes get added to the cluster, or
	something)

	For nodes which need to be fenced, appropriate dependencies will need to be
	added to the transition graph. These shall ensure that any given device is not
	concurrently accessed, and that fencing requests are appropriately retried on
	other devices if available.

	THOUGHT: Adding "meatware" as a ultima ratio fallback fencing would allow
	disaster resilient setups; if the one site failed completely, the
	administrator could flip the switch and the transition graph would proceed.
	This would neatly address this part of the requirements list.

	7.3. Un-fencing

	Aborting an on-going fencing / STONITH operation is not supported.

	7.3.1. After a successful fencing

	Okay, so a node has been fenced. How does it properly rejoin the cluster?

	For example, a node n1 can notice an unclean shutdown at startup, and must
	assume it has been STONITHed (in particular, if uptime very low ;-).

	When can it clear this flag and initiate recovery / startup actions on its
	own?

	The answer seems to relate to the "thought" under 6.2.; when is recovery
	action initiated for a resource group anyway. An additional option "Proceed if
	more than 50% of the nodes for the resource group are in our partition; if it
	is a tie, do not proceed to initiate recovery if the 'unclean' flag is set on
	any node."

	(This moves the unclean flag into the runtime portion of the CIB)

	The flag can be cleared once majority for all resources has been reached
	again.

	7.3.2. Rejoining node while fencing requests are still pending

	Answer: Running fencing requests can not be aborted. In this rather
	unfortunate case, the node will most likely disappear soon because of the
	fencing request. Provisions might be made to consider this node as "down"
	already, to prevent resource pingpong.

	TODO: Ok, so n1 tried to fence n2 and vice-versa because of a partition, they
	failed via the STONITH device and have fallen back to meatware. The partition
	resolves. Shouldn't the system try to abort the meatware request? Maybe a
	special case for "meatware" requests? Or should meatware just notice this?


	7.3.3. Rejoining node for which fencing has ultimately failed in the past

	Two options are basically possible; if a node has rejoined the cluster, the
	"fencing failed" flag could be cleared for it, and the cluster could proceed
	as normal. (This seems sensible)

	The other option would be to request the node to commit suicide. Again, this
	is difficult in the case of a resolving cluster partition; both sides would
	commit suicide. This doesn't seem sensible.

	If the failure was due to a local hang - ie, scheduler bug, power management
	running wild etc - this shall be considered outside the scope of this
	discussion. The local health monitoring on such a node shall be responsible
	for containing such failures and reacting to it appropriately.


	8. Interaction with quorum

	TODO: Highly unfinished!

	Summary: CRM does not need quorum. However, the CRM could easily compute
	'quorum' as just another resource.

	"Quorum" is in fact not necessary for this design. It is implicit in the
	policy engine / CRM which will only bring a resources for which all
	dependencies - including fencing - have been satisfied.

	This in fact is quorum with slightly finer granularity. It allows the cluster
	to proceed in a scenario like:

	- {a,b,c,d} form a cluster. {a,b} share resources (called R1) and {c,d} share
	resources (called R2). If the cluster is partitioned int {a,b} and {c,d}, no
	fencing has to carried out; cluster operation can proceed as normal.

	Of course, as soon as a global resource spawning {a,b,c,d} is added, this in
	fact translates to "global quorum".

	This makes me think that if global quorum is in fact required it can be best
	expressed in this design by mapping it to such a global ('configured on all
	nodes') resource and communicating to the partition that it has quorum if it
	was able to recover this resource (or failed to recover it, that is).

	However, it also allows for "sub-quorum"; ie, given the example of an
	application requiring quorum to operate, it will usually only be interested in
	quorum of the _nodes eligible for the related resources_. So quorum could
	potentially be different if reported to different clients...


	8.1. Issues wrt quorum

	TODO: Certainly ;-)


	10. Integration with other projects

	10.1. Integration with heartbeat

	TODO: Elaborate.

	This component is supposed to replace the "resource manager" present in
	heartbeat currently.

	Issues which need to be addressed / kept in mind:

	- Start order; heartbeat should only join the cluster if the startup of CRM is
	successfully completed locally.

	- CRM makes extensive use of heartbeat's libraries (IPC, HBcomm,
	STONITH, PILS etc).
	...


	10.2. Integration with CCM

	TODO: Elaborate, if necessary.

	CRM is a client of CCM; CCM provides the set of nodes in the partition, CRM
	only operates on this data.

	It would be nice if CCM enforced policy before allowing a node to join a
	partition; ie, time not vastly desynchronized, CRM/LRM etc running and other
	ideas come to mind.


	10.3. Integration with Group Services

	Should Group Services functionality - in particular, group messaging - be
	available one day, the exact semantics will of course have to be taken into
	account.

	The obvious simplification possible with this component would be that the CRM
	could form a distributed process group in the cluster; instead of building on
	top of node messaging primitives and filtering these.


	10.4. Integration with non-heartbeat clusters

	In theory, the software should be able to run in any "compliant" environment;
	I'd safely assume that it should be reasonably easy to port on top of the
	Compaq CI, for example, or any Service-Availability Forum AIS.

	This would complement the respective feature lists as well as demonstrate that
	such interoperability is in fact possible, providing great leverage for OCF!

	TODO: Keep that in mind while designing the code. Encapsulate
	interactions with the lower levels cleanly.

	10.5. Integration with cluster-aware applications

	Software like cluster-aware volume managers, filesystems or distributed
	applications (databases) need to be integrated into the 'recovery' tree just
	like Resource Agent based ones.

	This should be reasonably simple - because it is a matter of inserting the
	necessary trigger in the right order into the transition graph -, but these
	clients need to be aware that they don't need to provide any fencing
	themselves etc, because the external framework has already taken care of
	this.

	10.6. Relation to OCF

	All applicable OCF specifications shall be implemented. Areas which OCF does
	not specify yet shall be implemented as prototypes, hopefully serving as a
	basis for OCF discussion.


	11. Monitoring

	11.1. Integration with health monitoring

	Health monitoring should prevent startup of cluster software on a "sick"
	node. ("Hey, I've rebooted 12 times within the last 60 minutes, maybe
	bringing the cluster software online wouldn't be so smart!") Should be
	handled by Local Resource Manager?

	Health monitoring can send events to DC:
	- I'm about to crash, migrate everything while you can
	- I'm overloaded, please migrate some resources if possible

	etc

	11.2. Monitoring the CRM

	Monitoring software can query CIB to get access to all data.

	TODO: Should traps/events be triggered from inside the different components
	themselves, or might it be a good idea to allow clients to "subscribe" to
	certain parts of the CIB and be notified if a change occurs? This would
	be helpful for SNMP/CIM traps.

	(The later would be somewhat alike FailSafe's cdbd)


	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
	Attention: Here be dragons. Anything following these lines are unordered
	thoughts which haven't yet been incorporated into the grand scheme of things.
	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

	X. ...

	Question: Why isn't moving a resource to another node an basic operation?

	Answer: Because it involves more than one node and needs to be coordinated by
	the designated CRM.

	Question: How can this deal with disaster resilient setups, where one site may
	be physical separate and fencing the other side is not possible?

	Answer: See the "note" under "Hangman".

File Metadata

Mime Type: text/plain
Expires: Thu, Oct 16, 12:15 AM (1 d, 12 h)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 2442287
Default Alt Text: crm.txt (31 KB)

crm.txtNo OneTemporaryActions

crm.txtView Options

File Metadata

Event Timeline

crm.txt
No OneTemporary
Actions

crm.txt
View Options