Background ################## First of all, go look at the diagram (comms.gif), read this, then look at the diagram again. Next, some terminology... Here I will use CRM to refer to the light blue section. That is, the entire collection of processes/daemons/modules on a node that, as a whole, manage resources in the cluster. CRMd refers to one of the dark blue boxes. It is the "master subsystem" if you like. Its role is to co-ordinate the actions of all the other pieces of the puzzle, including those on other nodes. Key points from the diagram: - All communications with the CRM are done with Heartbeat messages routed through the CRMd. These messages contain a text representation of an XML document, the schema of which is outlined at the end of this document. - All communications internal to the CRM (ie. between its subsystems) is performed with IPC messages. Again all messages are routed via the CRMd and contain the same XML documents as Heartbeat messages. - All admin clients (eventually) end up sending Heartbeat messages and are thus subject to existing HA client security is available. - The RPC layer allows the cluster to be controled from non-member hosts (subject to RPC security which is available for free). - The option of syncronous or asynchronous RPC calls will be provided. This will probably be in the form of a flag sent as part of the function call. Advantages: - The only source of "requests" is the CRMd which means it *never* has to forward on "request" messages for any of it sub-systems. This is useful for the security of the system (see security.txt). - Potentially, most CRM<-->CRM communications can be replaced with RPC calls. - We are able to re-use existing security mechanisms (IPC, HA, RPC, unix_auth via RPC) to protect the system. Message scenarios: ################## There are really only 3 messaging scenarios in this system (exluding broadcast vs. point-to-point). Again this is nice as it keeps down the number of "special cases". 1) Sub-system <--(IPC)--> CRM <--(IPC)--> Sub-system 2) Sub-system <--(IPC)--> Local CRM <--(Heartbeat)--> Remote CRM <--(IPC)--> Sub-system 3) Admin Client <--(Heartbeat Broadcast)--> Remote CRM <--(IPC)--> Sub-system Message examples: ################## 1.1) the DC telling the local LRM to start a resource 1.2) the LRM asking the CIB about a resource 2.1) the DC telling a remote LRM to start a resource 2.2) the DC asking (all) the CIB(s) to provide their view of the world 3.1) an admin request to add/remove/modify a resource 3.2) an admin request to force a failover of a resource or a recomputation of the resource dependancies. Message Notes: ################## Messages may be sent to the CRM from local sub-systems via IPC or from other HA clients via Heartbeat. It is then the responsibility of the CRM to unpack the message and pass it on to the correct sub-system. If the destination sub-system is the DC and the DC is not running on the current node, the message is discarded without error. Where the DC recieves a message from another node, it will also keep track of the sending host and the reference number so that it can direct the replies appropriately. The exception to this is where the message is from the DC. Messages to the DC are *always* sent as broadcast messages and the DC *must always* acknowledge the message with either the results of the message or a "thankyou" message. The reason for this is that the DC may change or a DC election may be in progress. The implication of this is that the sender should always set a timer and resend dc_messages if they have not been acknowledged. The DC will be able to detect duplicates by examining the destination sub-system and the reference number and we will rely on HA to ensure the delivery of DC responses. All messages are full crm_messages. I toyed with only sending the *_request or *_response piece of the message to and/or from the relevant sub-systems, but it just got messy. This way, the routing role of the CRM is much easier. And easier equals lower complexity, which means less bugs, which is good for everyone. Schema Notes: ################### Key Attributes =============== reference: provides the ability to track which request a responce is in relation to and where the local CRM should send it. *_filter: allow the operation to be limited to a particular type, id and/or priority timeout: allows the receiver to know how long the sender is expecting the task to take so we can act and report back accordingly. Attribute values ================= Where the list ends with |... , the complete list of possibilities will be fleshed out at a later date. Message Schema: ################### filter_type? #CDATA filter_id? #CDATA>