diff --git a/crmd/fsa_defines.h b/crmd/fsa_defines.h index f8ccfb23bf..43567e78c2 100644 --- a/crmd/fsa_defines.h +++ b/crmd/fsa_defines.h @@ -1,499 +1,499 @@ /* * Copyright (C) 2004 Andrew Beekhof * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This software is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public * License along with this library; if not, write to the Free Software * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA */ #ifndef FSA_DEFINES__H # define FSA_DEFINES__H /*====================================== * States the DC/CRMd can be in *======================================*/ enum crmd_fsa_state { S_IDLE = 0, /* Nothing happening */ S_ELECTION, /* Take part in the election algorithm as * described below */ S_INTEGRATION, /* integrate that status of new nodes (which is * all of them if we have just been elected DC) * to form a complete and up-to-date picture of * the CIB */ S_FINALIZE_JOIN, /* integrate that status of new nodes (which is * all of them if we have just been elected DC) * to form a complete and up-to-date picture of * the CIB */ S_NOT_DC, /* we are in crmd/slave mode */ S_POLICY_ENGINE, /* Determin the next stable state of the cluster */ S_RECOVERY, /* Something bad happened, check everything is ok * before continuing and attempt to recover if * required */ S_RELEASE_DC, /* we were the DC, but now we arent anymore, * possibly by our own request, and we should * release all unnecessary sub-systems, finish * any pending actions, do general cleanup and * unset anything that makes us think we are * special :) */ S_STARTING, /* we are just starting out */ S_PENDING, /* we are not a full/active member yet */ S_STOPPING, /* We are in the final stages of shutting down */ S_TERMINATE, /* We are going to shutdown, this is the equiv of * "Sending TERM signal to all processes" in Linux * and in worst case scenarios could be considered * a self STONITH */ S_TRANSITION_ENGINE, /* Attempt to make the calculated next stable * state of the cluster a reality */ S_HALT, /* Freeze - dont do anything * Something ad happened that needs the admin to fix * Wait for I_ELECTION */ /* ----------- Last input found in table is above ---------- */ S_ILLEGAL /* This is an illegal FSA state */ /* (must be last) */ }; # define MAXSTATE S_ILLEGAL /* A state diagram can be constructed from the dc_fsa.dot with the following command: dot -Tpng crmd_fsa.dot > crmd_fsa.png Description: Once we start and do some basic sanity checks, we go into the S_NOT_DC state and await instructions from the DC or input from the CCM which indicates the election algorithm needs to run. If the election algorithm is triggered we enter the S_ELECTION state from where we can either go back to the S_NOT_DC state or progress to the S_INTEGRATION state (or S_RELEASE_DC if we used to be the DC but arent anymore). The election algorithm has been adapted from http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR521 Loosly known as the Bully Algorithm, its major points are: - - Election is initiated by any node (N) notices that the coordinator + - Election is initiated by any node (N) notices that the controller is no longer responding - Concurrent multiple elections are possible - Algorithm + N sends ELECTION messages to all nodes that occur earlier in the CCM's membership list. - + If no one responds, N wins and becomes coordinator - + N sends out COORDINATOR messages to all other nodes in the + + If no one responds, N wins and becomes controller + + N sends out CONTROLLER messages to all other nodes in the partition + If one of higher-ups answers, it takes over. N is done. Once the election is complete, if we are the DC, we enter the S_INTEGRATION state which is a DC-in-waiting style state. We are the DC, but we shouldnt do anything yet because we may not have an up-to-date picture of the cluster. There may of course be times when this fails, so we should go back to the S_RECOVERY stage and check everything is ok. We may also end up here if a new node came online, since each node is authorative on itself and we would want to incorporate its information into the CIB. Once we have the latest CIB, we then enter the S_POLICY_ENGINE state where invoke the Policy Engine. It is possible that between invoking the Policy Engine and receiving an answer, that we receive more input. In this case we would discard the orginal result and invoke it again. Once we are satisfied with the output from the Policy Engine we enter S_TRANSITION_ENGINE and feed the Policy Engine's output to the Transition Engine who attempts to make the Policy Engine's calculation a reality. If the transition completes successfully, we enter S_IDLE, otherwise we go back to S_POLICY_ENGINE with the current unstable state and try again. Of course we may be asked to shutdown at any time, however we must progress to S_NOT_DC before doing so. Once we have handed over DC duties to another node, we can then shut down like everyone else, that is by asking the DC for permission and waiting it to take all our resources away. The case where we are the DC and the only node in the cluster is a special case and handled as an escalation which takes us to S_SHUTDOWN. Similarly if any other point in the shutdown fails or stalls, this is escalated and we end up in S_TERMINATE. At any point, the CRMd/DC can relay messages for its sub-systems, but outbound messages (from sub-systems) should probably be blocked until S_INTEGRATION (for the DC case) or the join protocol has completed (for the CRMd case) */ /*====================================== * * Inputs/Events/Stimuli to be given to the finite state machine * * Some of these a true events, and others a synthesised based on * the "register" (see below) and the contents or source of messages. * * At this point, my plan is to have a loop of some sort that keeps * going until receiving I_NULL * *======================================*/ enum crmd_fsa_input { /* 0 */ I_NULL, /* Nothing happened */ /* 1 */ I_CIB_OP, /* An update to the CIB occurred */ I_CIB_UPDATE, /* An update to the CIB occurred */ I_DC_TIMEOUT, /* We have lost communication with the DC */ I_ELECTION, /* Someone started an election */ I_PE_CALC, /* The Policy Engine needs to be invoked */ I_RELEASE_DC, /* The election completed and we were not * elected, but we were the DC beforehand */ I_ELECTION_DC, /* The election completed and we were (re-)elected * DC */ I_ERROR, /* Something bad happened (more serious than * I_FAIL) and may not have been due to the action * being performed. For example, we may have lost * our connection to the CIB. */ /* 9 */ I_FAIL, /* The action failed to complete successfully */ I_INTEGRATED, I_FINALIZED, I_NODE_JOIN, /* A node has entered the cluster */ I_NOT_DC, /* We are not and were not the DC before or after * the current operation or state */ I_RECOVERED, /* The recovery process completed successfully */ I_RELEASE_FAIL, /* We could not give up DC status for some reason */ I_RELEASE_SUCCESS, /* We are no longer the DC */ I_RESTART, /* The current set of actions needs to be * restarted */ I_TE_SUCCESS, /* Some non-resource, non-ccm action is required * of us, eg. ping */ /* 20 */ I_ROUTER, /* Do our job as router and forward this to the * right place */ I_SHUTDOWN, /* We are asking to shutdown */ I_STOP, /* We have been told to shutdown */ I_TERMINATE, /* Actually exit */ I_STARTUP, I_PE_SUCCESS, /* The action completed successfully */ I_JOIN_OFFER, /* The DC is offering membership */ I_JOIN_REQUEST, /* The client is requesting membership */ I_JOIN_RESULT, /* If not the DC: The result of a join request * Else: A client is responding with its local state info */ I_WAIT_FOR_EVENT, /* we may be waiting for an async task to "happen" * and until it does, we cant do anything else */ I_DC_HEARTBEAT, /* The DC is telling us that it is alive and well */ I_LRM_EVENT, /* 30 */ I_PENDING, I_HALT, /* ------------ Last input found in table is above ----------- */ I_ILLEGAL /* This is an illegal value for an FSA input */ /* (must be last) */ }; # define MAXINPUT I_ILLEGAL # define I_MESSAGE I_ROUTER /*====================================== * * actions * * Some of the actions below will always occur together for now, but I can * forsee that this may not always be the case. So I've spilt them up so * that if they ever do need to be called independantly in the future, it * wont be a problem. * * For example, separating A_LRM_CONNECT from A_STARTUP might be useful * if we ever try to recover from a faulty or disconnected LRM. * *======================================*/ /* Don't do anything */ # define A_NOTHING 0x0000000000000000ULL /* -- Startup actions -- */ /* Hook to perform any actions (other than starting the CIB, * connecting to HA or the CCM) that might be needed as part * of the startup. */ # define A_STARTUP 0x0000000000000001ULL /* Hook to perform any actions that might be needed as part * after startup is successful. */ # define A_STARTED 0x0000000000000002ULL /* Connect to Heartbeat */ # define A_HA_CONNECT 0x0000000000000004ULL # define A_HA_DISCONNECT 0x0000000000000008ULL # define A_INTEGRATE_TIMER_START 0x0000000000000010ULL # define A_INTEGRATE_TIMER_STOP 0x0000000000000020ULL # define A_FINALIZE_TIMER_START 0x0000000000000040ULL # define A_FINALIZE_TIMER_STOP 0x0000000000000080ULL /* -- Election actions -- */ # define A_DC_TIMER_START 0x0000000000000100ULL # define A_DC_TIMER_STOP 0x0000000000000200ULL # define A_ELECTION_COUNT 0x0000000000000400ULL # define A_ELECTION_VOTE 0x0000000000000800ULL # define A_ELECTION_START 0x0000000000001000ULL /* -- Message processing -- */ /* Process the queue of requests */ # define A_MSG_PROCESS 0x0000000000002000ULL /* Send the message to the correct recipient */ # define A_MSG_ROUTE 0x0000000000004000ULL /* Send a welcome message to new node(s) */ # define A_DC_JOIN_OFFER_ONE 0x0000000000008000ULL /* -- Server Join protocol actions -- */ /* Send a welcome message to all nodes */ # define A_DC_JOIN_OFFER_ALL 0x0000000000010000ULL /* Process the remote node's ack of our join message */ # define A_DC_JOIN_PROCESS_REQ 0x0000000000020000ULL /* Send out the reults of the Join phase */ # define A_DC_JOIN_FINALIZE 0x0000000000040000ULL /* Send out the reults of the Join phase */ # define A_DC_JOIN_PROCESS_ACK 0x0000000000080000ULL /* -- Client Join protocol actions -- */ # define A_CL_JOIN_QUERY 0x0000000000100000ULL # define A_CL_JOIN_ANNOUNCE 0x0000000000200000ULL /* Request membership to the DC list */ # define A_CL_JOIN_REQUEST 0x0000000000400000ULL /* Did the DC accept or reject the request */ # define A_CL_JOIN_RESULT 0x0000000000800000ULL /* -- Recovery, DC start/stop -- */ /* Something bad happened, try to recover */ # define A_RECOVER 0x0000000001000000ULL /* Hook to perform any actions (apart from starting, the TE, PE * and gathering the latest CIB) that might be necessary before * giving up the responsibilities of being the DC. */ # define A_DC_RELEASE 0x0000000002000000ULL /* */ # define A_DC_RELEASED 0x0000000004000000ULL /* Hook to perform any actions (apart from starting, the TE, PE * and gathering the latest CIB) that might be necessary before * taking over the responsibilities of being the DC. */ # define A_DC_TAKEOVER 0x0000000008000000ULL /* -- Shutdown actions -- */ # define A_SHUTDOWN 0x0000000010000000ULL # define A_STOP 0x0000000020000000ULL # define A_EXIT_0 0x0000000040000000ULL # define A_EXIT_1 0x0000000080000000ULL # define A_SHUTDOWN_REQ 0x0000000100000000ULL # define A_ELECTION_CHECK 0x0000000200000000ULL # define A_DC_JOIN_FINAL 0x0000000400000000ULL /* -- CCM actions -- */ # define A_CCM_CONNECT 0x0000001000000000ULL # define A_CCM_DISCONNECT 0x0000002000000000ULL /* -- CIB actions -- */ # define A_CIB_START 0x0000020000000000ULL # define A_CIB_STOP 0x0000040000000000ULL /* -- Transition Engine actions -- */ /* Attempt to reach the newly calculated cluster state. This is * only called once per transition (except if it is asked to * stop the transition or start a new one). * Once given a cluster state to reach, the TE will determin * tasks that can be performed in parallel, execute them, wait * for replies and then determin the next set until the new * state is reached or no further tasks can be taken. */ # define A_TE_INVOKE 0x0000100000000000ULL # define A_TE_START 0x0000200000000000ULL # define A_TE_STOP 0x0000400000000000ULL # define A_TE_CANCEL 0x0000800000000000ULL # define A_TE_HALT 0x0001000000000000ULL /* -- Policy Engine actions -- */ /* Calculate the next state for the cluster. This is only * invoked once per needed calculation. */ # define A_PE_INVOKE 0x0002000000000000ULL # define A_PE_START 0x0004000000000000ULL # define A_PE_STOP 0x0008000000000000ULL /* -- Misc actions -- */ /* Add a system generate "block" so that resources arent moved * to or are activly moved away from the affected node. This * way we can return quickly even if busy with other things. */ # define A_NODE_BLOCK 0x0010000000000000ULL /* Update our information in the local CIB */ # define A_UPDATE_NODESTATUS 0x0020000000000000ULL # define A_CIB_BUMPGEN 0x0040000000000000ULL # define A_READCONFIG 0x0080000000000000ULL /* -- LRM Actions -- */ /* Connect to the Local Resource Manager */ # define A_LRM_CONNECT 0x0100000000000000ULL /* Disconnect from the Local Resource Manager */ # define A_LRM_DISCONNECT 0x0200000000000000ULL # define A_LRM_INVOKE 0x0400000000000000ULL # define A_LRM_EVENT 0x0800000000000000ULL /* -- Logging actions -- */ # define A_LOG 0x1000000000000000ULL # define A_ERROR 0x2000000000000000ULL # define A_WARN 0x4000000000000000ULL # define O_EXIT (A_SHUTDOWN|A_STOP|A_CCM_DISCONNECT|A_LRM_DISCONNECT|A_HA_DISCONNECT|A_EXIT_0|A_CIB_STOP) # define O_RELEASE (A_DC_TIMER_STOP|A_DC_RELEASE|A_PE_STOP|A_TE_STOP|A_DC_RELEASED) # define O_PE_RESTART (A_PE_START|A_PE_STOP) # define O_TE_RESTART (A_TE_START|A_TE_STOP) # define O_CIB_RESTART (A_CIB_START|A_CIB_STOP) # define O_LRM_RECONNECT (A_LRM_CONNECT|A_LRM_DISCONNECT) # define O_DC_TIMER_RESTART (A_DC_TIMER_STOP|A_DC_TIMER_START) /*====================================== * * "register" contents * * Things we may want to remember regardless of which state we are in. * * These also count as inputs for synthesizing I_* * *======================================*/ # define R_THE_DC 0x00000001ULL /* Are we the DC? */ # define R_STARTING 0x00000002ULL /* Are we starting up? */ # define R_SHUTDOWN 0x00000004ULL /* Are we trying to shut down? */ # define R_STAYDOWN 0x00000008ULL /* Should we restart? */ # define R_JOIN_OK 0x00000010ULL /* Have we completed the join process */ # define R_READ_CONFIG 0x00000040ULL # define R_INVOKE_PE 0x00000080ULL /* Does the PE needed to be invoked at the next appropriate point? */ # define R_CIB_CONNECTED 0x00000100ULL /* Is the CIB connected? */ # define R_PE_CONNECTED 0x00000200ULL /* Is the Policy Engine connected? */ # define R_TE_CONNECTED 0x00000400ULL /* Is the Transition Engine connected? */ # define R_LRM_CONNECTED 0x00000800ULL /* Is the Local Resource Manager connected? */ # define R_CIB_REQUIRED 0x00001000ULL /* Is the CIB required? */ # define R_PE_REQUIRED 0x00002000ULL /* Is the Policy Engine required? */ # define R_TE_REQUIRED 0x00004000ULL /* Is the Transition Engine required? */ # define R_ST_REQUIRED 0x00008000ULL /* Is the Stonith daemon required? */ # define R_CIB_DONE 0x00010000ULL /* Have we calculated the CIB? */ # define R_HAVE_CIB 0x00020000ULL /* Do we have an up-to-date CIB */ # define R_CIB_ASKED 0x00040000ULL /* Have we asked for an up-to-date CIB */ # define R_MEMBERSHIP 0x00100000ULL /* Have we got CCM data yet */ # define R_PEER_DATA 0x00200000ULL /* Have we got T_CL_STATUS data yet */ # define R_HA_DISCONNECTED 0x00400000ULL /* did we sign out of our own accord */ # define R_CCM_DISCONNECTED 0x00800000ULL /* did we sign out of our own accord */ # define R_REQ_PEND 0x01000000ULL /* Are there Requests waiting for processing? */ # define R_PE_PEND 0x02000000ULL /* Has the PE been invoked and we're awaiting a reply? */ # define R_TE_PEND 0x04000000ULL /* Has the TE been invoked and we're awaiting completion? */ # define R_RESP_PEND 0x08000000ULL /* Do we have clients waiting on a response? if so perhaps we shouldnt stop yet */ # define R_IN_TRANSITION 0x10000000ULL /* */ # define R_SENT_RSC_STOP 0x20000000ULL /* Have we sent a stop action to all * resources in preparation for * shutting down */ # define R_IN_RECOVERY 0x80000000ULL /* * Magic RC used within CRMd to indicate direct nacks * (operation is invalid in current state) */ #define CRM_DIRECT_NACK_RC (99) enum crmd_fsa_cause { C_UNKNOWN = 0, C_STARTUP, C_IPC_MESSAGE, C_HA_MESSAGE, C_CCM_CALLBACK, C_CRMD_STATUS_CALLBACK, C_LRM_OP_CALLBACK, C_LRM_MONITOR_CALLBACK, C_TIMER_POPPED, C_SHUTDOWN, C_HEARTBEAT_FAILED, C_SUBSYSTEM_CONNECT, C_HA_DISCONNECT, C_FSA_INTERNAL, C_ILLEGAL }; extern const char *fsa_input2string(enum crmd_fsa_input input); extern const char *fsa_state2string(enum crmd_fsa_state state); extern const char *fsa_cause2string(enum crmd_fsa_cause cause); extern const char *fsa_action2string(long long action); #endif diff --git a/doc/crm-flowchart.fig b/doc/crm-flowchart.fig index 73265bea9c..6f778cb646 100644 --- a/doc/crm-flowchart.fig +++ b/doc/crm-flowchart.fig @@ -1,335 +1,335 @@ #FIG 3.2 Landscape Center Metric A4 59.40 Single -2 1200 2 6 1620 1665 2970 2430 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 2925 2385 2925 1890 1845 1890 1845 2385 2925 2385 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 2835 2295 1755 2295 1755 1800 2835 1800 2835 2295 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 2745 2205 1665 2205 1665 1710 2745 1710 2745 2205 4 1 0 50 0 14 14 0.0000 4 120 360 2340 2115 RAs\001 -6 6 6255 2520 7785 3375 6 6345 2610 7695 3285 4 1 0 50 0 14 14 0.0000 4 135 840 7020 2745 Cluster\001 4 1 0 50 0 14 14 0.0000 4 135 1320 7020 3000 Information\001 4 1 0 50 0 14 14 0.0000 4 120 480 7020 3255 Base\001 -6 6 6255 2520 7785 3375 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 7740 3330 7740 2565 6300 2565 6300 3330 7740 3330 -6 -6 6 7875 2520 8820 3150 6 7875 2520 8820 3150 2 4 0 2 0 7 50 0 -1 0.000 0 0 12 0 0 5 8773 3102 8773 2568 7922 2568 7922 3102 8773 3102 -6 4 1 0 50 0 14 14 0.0000 4 180 720 8348 2762 Policy\001 4 1 0 50 0 14 14 0.0000 4 180 720 8348 3037 Engine\001 -6 6 8910 2520 10665 2925 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 10620 2880 10620 2565 8955 2565 8955 2880 10620 2880 4 1 0 50 0 14 14 0.0000 4 135 1440 9765 2790 Transitioner\001 -6 6 6480 1620 10305 2025 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 10260 1980 10260 1665 6525 1665 6525 1980 10260 1980 4 1 0 50 0 14 16 0.0000 4 195 3600 8415 1890 Cluster Resource Manager\001 -6 6 7875 4725 9450 5130 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 9405 5085 9405 4770 7920 4770 7920 5085 9405 5085 4 1 0 50 0 14 16 0.0000 4 150 1350 8685 4995 heartbeat\001 -6 6 8730 4095 9990 4455 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 9945 4410 9945 4140 8775 4140 8775 4410 9945 4410 4 1 0 50 0 14 14 0.0000 4 180 1080 9360 4320 Messaging\001 -6 6 7200 3825 8640 4680 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 8595 4635 8595 3870 7245 3870 7245 4635 8595 4635 4 1 0 50 0 14 14 0.0000 4 120 1080 7920 4050 Concensus\001 4 1 0 50 0 14 14 0.0000 4 135 840 7920 4305 Cluster\001 4 1 0 50 0 14 14 0.0000 4 180 1200 7920 4560 Membership\001 -6 6 12465 1575 13815 2340 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 13770 2295 13770 1800 12690 1800 12690 2295 13770 2295 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 13680 2205 12600 2205 12600 1710 13680 1710 13680 2205 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 13590 2115 12510 2115 12510 1620 13590 1620 13590 2115 4 1 0 50 0 14 14 0.0000 4 120 360 13185 2025 RAs\001 -6 6 17325 1530 21150 1935 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 21105 1890 21105 1575 17370 1575 17370 1890 21105 1890 4 1 0 50 0 14 16 0.0000 4 195 3600 19260 1800 Cluster Resource Manager\001 -6 6 18720 4635 20295 5040 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 20250 4995 20250 4680 18765 4680 18765 4995 20250 4995 4 1 0 50 0 14 16 0.0000 4 150 1350 19530 4905 heartbeat\001 -6 6 19575 4005 20835 4365 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 20790 4320 20790 4050 19620 4050 19620 4320 20790 4320 4 1 0 50 0 14 14 0.0000 4 180 1080 20205 4230 Messaging\001 -6 6 18045 3735 19485 4590 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 19440 4545 19440 3780 18090 3780 18090 4545 19440 4545 4 1 0 50 0 14 14 0.0000 4 120 1080 18765 3960 Concensus\001 4 1 0 50 0 14 14 0.0000 4 135 840 18765 4215 Cluster\001 4 1 0 50 0 14 14 0.0000 4 180 1200 18765 4470 Membership\001 -6 6 18315 2115 19845 2970 6 18405 2205 19755 2880 4 1 0 50 0 14 14 0.0000 4 135 840 19080 2340 Cluster\001 4 1 0 50 0 14 14 0.0000 4 135 1320 19080 2595 Information\001 4 1 0 50 0 14 14 0.0000 4 120 480 19080 2850 Base\001 -6 6 18315 2115 19845 2970 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 19800 2925 19800 2160 18360 2160 18360 2925 19800 2925 -6 -6 6 6750 8370 8010 9090 4 1 0 50 0 14 14 0.0000 4 120 1080 7380 8505 Concensus\001 4 1 0 50 0 14 14 0.0000 4 135 840 7380 8760 Cluster\001 4 1 0 50 0 14 14 0.0000 4 180 1200 7380 9015 Membership\001 -6 6 6300 9945 7830 10800 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 6345 9990 6345 10755 7785 10755 7785 9990 6345 9990 -6 6 6390 10035 7740 10710 4 1 0 50 0 14 14 0.0000 4 135 840 7065 10170 Cluster\001 4 1 0 50 0 14 14 0.0000 4 135 1320 7065 10425 Information\001 4 1 0 50 0 14 14 0.0000 4 120 480 7065 10680 Base\001 -6 6 3240 1755 4905 2250 2 4 0 2 0 7 50 0 -1 6.000 0 0 15 0 0 5 4859 2222 4859 1783 3286 1783 3286 2222 4859 2222 4 1 0 50 0 14 14 0.0000 4 135 1320 4095 1980 Executioner\001 4 1 0 50 0 12 14 0.0000 4 165 1080 4095 2160 (STONITH)\001 -6 6 14085 1710 15750 2205 2 4 0 2 0 7 50 0 -1 6.000 0 0 15 0 0 5 15704 2177 15704 1738 14131 1738 14131 2177 15704 2177 4 1 0 50 0 14 14 0.0000 4 135 1320 14940 1935 Executioner\001 4 1 0 50 0 12 14 0.0000 4 165 1080 14940 2115 (STONITH)\001 -6 6 10485 10710 12150 11205 2 4 0 2 0 7 50 0 -1 6.000 0 0 15 0 0 5 12104 11177 12104 10738 10531 10738 10531 11177 12104 11177 4 1 0 50 0 14 14 0.0000 4 135 1320 11340 10935 Executioner\001 4 1 0 50 0 12 14 0.0000 4 165 1080 11340 11115 (STONITH)\001 -6 6 15300 4320 17415 4995 2 2 3 2 1 7 50 0 -1 6.000 0 0 -1 0 0 5 15345 4365 17370 4365 17370 4950 15345 4950 15345 4365 4 1 1 50 0 14 16 0.0000 4 150 1950 16380 4590 Adminstrative\001 4 1 1 50 0 14 16 0.0000 4 180 1050 16380 4875 request\001 -6 2 1 0 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 1350 6300 21600 6300 2 1 1 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 1845 6795 21780 6795 2 1 1 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 8775 6795 8775 5085 2 1 1 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 19755 4995 19755 6795 2 1 0 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 19350 6255 19350 4995 2 1 0 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 8415 6300 8415 5085 2 1 1 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 6750 7920 6750 6795 2 1 0 4 0 7 50 0 -1 10.000 0 0 -1 0 0 2 6390 7920 6390 6300 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 5040 3195 5040 2880 1575 2880 1575 3195 5040 3195 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 2430 2880 2430 2385 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 5130 1620 5130 3285 1485 3285 1485 1620 5130 1620 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 10710 3465 10710 1530 6165 1530 6165 3465 10710 3465 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 10035 5175 10035 3780 7155 3780 7155 5175 10035 5175 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 15885 3105 15885 2790 12420 2790 12420 3105 15885 3105 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 13275 2790 13275 2295 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 14850 2790 14850 2160 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 15975 1530 15975 3195 12330 3195 12330 1530 15975 1530 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 20880 5085 20880 3690 18000 3690 18000 5085 20880 5085 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 21195 3015 21195 1485 17280 1485 17280 3015 21195 3015 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 21375 5220 21375 900 12150 900 12150 5220 21375 5220 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 10260 9810 10260 10125 13725 10125 13725 9810 10260 9810 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 12870 10125 12870 10620 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 10170 11385 10170 9720 13815 9720 13815 11385 10170 11385 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 5265 7830 5265 9225 8145 9225 8145 7830 5265 7830 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 5355 8595 5355 8865 6525 8865 6525 8595 5355 8595 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 5895 7920 5895 8235 7380 8235 7380 7920 5895 7920 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 6705 8370 6705 9135 8055 9135 8055 8370 6705 8370 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 4725 7650 4725 11970 13950 11970 13950 7650 4725 7650 2 4 0 2 0 7 50 0 -1 0.000 0 0 11 0 0 5 5040 11025 5040 11340 8775 11340 8775 11025 5040 11025 2 4 2 3 0 7 50 0 -1 2.000 0 0 11 0 0 5 4950 11430 4950 9900 8865 9900 8865 11430 4950 11430 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 11295 10125 11295 10755 2 4 0 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 12375 10620 12375 11115 13455 11115 13455 10620 12375 10620 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 12465 10710 13545 10710 13545 11205 12465 11205 12465 10710 2 4 1 2 0 7 50 0 -1 6.000 0 0 11 0 0 5 12555 10800 13635 10800 13635 11295 12555 11295 12555 10800 2 1 1 2 0 7 50 0 -1 6.000 0 0 11 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 4095 2925 4095 2295 2 2 3 2 1 7 50 0 -1 6.000 0 0 -1 0 0 5 12735 4185 14985 4185 14985 4815 12735 4815 12735 4185 2 4 2 3 2 7 50 0 -1 2.000 0 0 11 0 0 5 10935 5445 1305 5445 1305 855 10935 855 10935 5445 3 2 1 2 4 7 50 0 -1 6.000 0 1 1 3 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 7740 3150 14355 3780 18360 2790 0.000 -1.000 0.000 3 2 1 2 4 7 50 0 -1 6.000 0 1 0 2 1 1 1.00 120.00 150.00 10620 2745 12420 2970 0.000 0.000 3 2 1 2 4 7 50 0 -1 6.000 0 1 0 2 1 1 1.00 120.00 150.00 10125 2880 11250 9810 0.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 7245 4365 5535 3375 6525 1845 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 5040 3060 6300 2925 0.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 7245 4275 6930 4005 6930 3330 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 6975 2565 7740 2340 8325 2565 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 8325 2565 9000 2340 9765 2565 0.000 -1.000 0.000 3 2 1 2 4 7 50 0 -1 6.000 0 1 0 4 1 1 1.00 120.00 150.00 10035 2565 9450 2115 6480 2205 4905 2880 0.000 -1.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 18090 4275 16380 3285 17370 1755 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 15885 2970 18315 2565 0.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 18090 4185 17820 3645 18405 2925 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 8055 8640 9765 9630 8775 11160 0.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 1 2 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 10260 9945 7830 10350 0.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 8055 8730 8370 9000 7740 9990 0.000 -1.000 0.000 3 2 1 2 4 7 50 0 -1 6.000 0 1 0 5 1 1 1.00 120.00 150.00 6345 10575 3330 9765 2610 6165 3780 4095 6300 3060 0.000 -1.000 -1.000 -1.000 0.000 3 2 1 2 4 7 50 0 -1 6.000 0 0 1 5 1 1 1.00 120.00 150.00 6300 10305 4095 8460 3420 5850 4635 3825 6300 3195 0.000 -1.000 -1.000 -1.000 0.000 3 2 1 2 0 7 50 0 -1 6.000 0 1 0 3 1 1 1.00 120.00 150.00 8055 2610 7785 2475 7380 2565 0.000 -1.000 0.000 3 2 3 2 1 7 50 0 -1 6.000 0 1 0 4 1 1 1.00 120.00 150.00 16650 4365 17730 1755 13050 1350 10260 1800 0.000 -1.000 -1.000 0.000 3 2 3 2 1 7 50 0 -1 6.000 0 1 1 3 1 1 1.00 120.00 150.00 1 1 1.00 120.00 150.00 14940 4185 17730 1800 18810 2160 0.000 -1.000 0.000 4 1 0 50 0 14 16 0.0000 4 195 3300 3330 3105 Local Resource Manager\001 4 1 0 50 0 14 14 0.0000 4 120 720 5625 4050 Events\001 -4 1 0 50 0 14 18 0.0000 4 240 4455 5850 1125 Designated Coordinator Node\001 +4 1 0 50 0 14 18 0.0000 4 240 4455 5850 1125 Designated Controller Node\001 4 1 0 50 0 14 16 0.0000 4 195 3300 14175 3015 Local Resource Manager\001 4 1 0 50 0 14 14 0.0000 4 120 720 16470 3960 Events\001 4 1 0 50 0 14 18 0.0000 4 240 5280 16650 1215 Any client node in the partition\001 4 1 0 50 0 14 14 0.0000 4 120 720 9675 8955 Events\001 4 1 0 50 0 14 16 0.0000 4 150 1350 6660 8145 heartbeat\001 4 1 0 50 0 14 14 0.0000 4 180 1080 5940 8775 Messaging\001 4 1 0 50 0 14 16 0.0000 4 195 3600 6885 11250 Cluster Resource Manager\001 4 1 0 50 0 14 16 0.0000 4 195 3300 12015 10035 Local Resource Manager\001 4 1 0 50 0 14 14 0.0000 4 120 360 13050 11025 RAs\001 4 1 0 50 0 14 18 0.0000 4 240 5280 9495 11700 Any client node in the partition\001 4 2 4 50 0 12 14 0.0000 4 120 720 2970 4920 status\001 4 2 4 50 0 12 14 0.0000 4 120 840 3105 4680 Gathers\001 4 0 4 50 0 12 14 0.0000 4 135 3000 10665 5940 Instructs and coordinates\001 4 0 4 50 0 12 14 0.0000 4 165 1200 3825 4905 Replicates\001 4 0 4 50 0 12 14 0.0000 4 165 1560 3735 5130 configuration\001 4 1 1 50 0 14 16 0.0000 4 150 1950 13860 4410 Adminstrative\001 4 1 1 50 0 14 16 0.0000 4 195 2100 13860 4695 status inquiry\001 diff --git a/doc/crm.txt b/doc/crm.txt deleted file mode 100644 index f003d78c43..0000000000 --- a/doc/crm.txt +++ /dev/null @@ -1,852 +0,0 @@ -DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! - -NOTICE: Some ideas in this paper aren't yet well sorted. Some ideas aren't -complete. Some phrasings I'm myself not happy with yet. Some ideas need -further explanation. Most of the ideas presented are not final yet. It is -mostly a braindump. - -And did I say yet that this is still a DRAFT!!!!!! ? - - -Title: Design of a Cluster Resource Manager -Revision: $Id: crm.txt,v 1.2 2003/12/01 14:10:14 lars Exp $ -Author: Lars Marowsky-Brée -Acknowledgements: Andrew Beekhof - Luis Claudio R. Goncalves - Fábio Olivé Leite - Alan Robertson - - -XX. Global TODO for this document - -In this section I keep track of tasks which I still want to perform on this -document; probably of little interest to anyone else, and this section should -be gone in the final version ;-) - -- Break out major parts (CIB, Policy Engine etc) into their own documents. -- Add references to sub-features used and do not replicate too much - information here. References should take the form of the feature id from the - feature list. -- Ensure unified use of words and terms when referring to the - components. -- Convert to docbook (not pressing) - - -0. Abstract - -This paper outlines the design of a clustered recovery/resource manager -to be running on top of and enhancing the Open Clustering Framework -(OCF) infrastructure provided by heartbeat. - -The goal is to allow flexible resource allocation and globally ordered -actions in a cluster of N nodes and dynamic reallocation of resources in -case of failures (ie, "fail-over") or administrative changes to the -cluster ("switch-over"). - -This new Cluster Resource Manager is intended to be a replacement for the -currently used "resource manager" of heartbeat. - - -1. Introduction - -1.1. Requirements overview - -The CRM needs to provide the following functionality: - -- Secure. -- Simple. - - heartbeat already provides these two properties; by no means may the - new resource manager be more insecure. For sanity and stability - both - of the developers, but in particular the users -, complexity needs to - be kept at a minimum (but no simpler). - -- Support for more than 2 nodes. - -- Fault tolerant. - (Failures can be node, network or resource failures.) - -- Ability to deal with complex policies for resource allocation and - dependencies. Examples: - - Support for globally ordered starting order, - - Support for feedback in the allocation process to better support - replicated resources, - - Negated dependencies etc - -- Ability to make administrative requests for resource migration, adding new - resources, removing resources et cetera while online. - -- Extensible framework. - - -1.2. Scenario description - -The design outlined in this document is aimed at a cluster with the following -properties; some of them will be further specified later. - -- The cluster provides a Concensus Membership Layer, as outlined in the OCF - documents and provided by the CCM implementation by Ram Pai. - - (This provides all nodes in a partition with a common and agreed upon - view of cluster membership, which tries to compute the largest set of - fully connected nodes.) - - - It is possible for the nodes in a given partition to order all nodes - in the partition in the same way, without the need for a distributed - decision. This can be either achieved by having the membership be - returned in the same order on all nodes, or by attaching a - distinguishing attribute to each node. - -- The cluster provides a communication mechanism for unicast as well as - broadcast messaging. - - Messages don't necessarily satisfy any special ordering, but they are - atomic. - -- Byzantine failures are rare and self-contained. Stable storage is stable and - not otherwise corrupted; network packets aren't spuriously generated etc. - - These errors are self-contained to the respective components which take - appropriate precautions to prevent them from propagating upwards. - (Checksums, authentication, error recovery or fail-fast) - -- Time is synchronized cluster-wide. Even in the face of a network - partition, an upper bound for time diversion can be safely assumed. - This can be virtually guaranteed by running NTP across all nodes in - the cluster. - -- An IO fencing mechanism is available in the cluster. - - The fencing mechanism provides definitive feedback on whether a given - fencing request succeeded or not. - -- A node knows which resources itself currently holds and their state - (running / failed / stopped); it can provide this information if - queried and will inform the CRM of state changes. - - (This part will be provided by the Local Resource Manager.) - - - -2.1. Basic algorithm - -The basic algorithm can be summarized as follows: - -a) Every partition elects a "Designated Coordinator"; this node will -activate special logic to coordinate the recovery and administrative -actions on all nodes in the cluster. - -a.1) The DC has the full state of the cluster available (or is able to -retrieve it), as well as an uptodate copy of the administrative -policies, information about fenced nodes etc; this shall further be -referred to as "Cluster Information Base". - -b) Whenever a cluster event occurs, be it an adminstrative request, a node -failure by membership services or a resource failure reported by a -participating LRM, it is forwarded to the "Designated Coordinator". - -b1) For administrative requests, the DC arbitates whether or not they will be -accepted into the CIB and serializes these updates. ie, policy changes which -cannot be satisfied or would lead to an inconsistent state of the cluster will -be rejected (unless explicitly overridden). - -c) It then computes via the Policy Engine: -c.1) the new CIB -c.2) The Transition Graph, an dependency-ordered graph of the actions - necessary to go from the current cluster state as close as possible to - the cluster state described by the CIB. - -d) Leading the transition to the target state: - -d1) Replicating the new CIB to all clients. - -d2) Executing each step of the transition graph in dependency order. - (Potentially parallelized.) - -e) Exception handling if _any_ event or failure occurs: - -e1) The algorithm is aborted cleanly; pending operations are allowed to -finish, but no new commands are issued to clients (in particular during -phase d2) - -e2) The algorithm is invoked again from scratch. - -(It is obvious that there is room for optimization here by only -recomputing smaller parts of the dependency tree or not broadcasting the -full CIB every time, in particular if the DC has not been re-elected. -However, these complicate the implementation and are not necessary for -the first phase.) - - -2.2) Feature analysis - -This meets the requirements listed as follows: - -- It is reasonably simple to have a policy engine capable of dealing - with more than two nodes as distributed decisions are avoided as far - as possible; only a single node leads the transition and coordinates - participating nodes. - - Local nodes are only required to know their own state; whenever the - coordinator fails, the necessary state can simply be reconstructed by the - DC. - -- An event can be anything from a failed node or a request to add a new policy - rule (ie, a new resource, a change to an allocation policy etc); this - satisfies the requirement to deal with adminstrative requests at runtime. - -- Support for new kinds of policies, types of resources, ... can be added via - the policy engine. - -- The modular design keeps complexity in any single component down. - - -2.3. Stability of the algorithm - -This algorithm will eventually converge if no new events occur for a -sufficiently long period of time. - -TODO: Add a discussion on factors affecting stability or kill the section -entirely. - - -2.4. Components - -This approach neatly divides the task into various components: (see -crm-flowchart.fig!) - -- Cluster infrastructure - - heartbeat - - Concensus Cluster Membership - - Messaging - -- Cluster Resource Manager - - Policy Engine - - Cluster Information Base - - Transitioner - -- Local Resource Manager - - Executioner - - Resource Agents - - -3. Local Resource Manager - -Note: This section only documents the requirements on the LRM from the point -of view of the cluster-wide resource management. For a more detailled -explanation, see the documentation by lclaudio. - -[ TODO: Add reference to lclaudio's document as soon as available ] - -This component knows which resources the node currently holds, their status -(running, running/failed, stopping, stopped, stopped/failed, etc) and can -provide this information to the CRM. It can start, restart and stop resources -on demand. It can provide the CRM with a list of supported resource types. - -It will initiate any recovery action which is applicable and limitted to the -local node and escalate all other events to the CRM. - -Any correct node in the cluster is running this service. Any node which fails -to run this service will be evicted and fenced. - -It has access to the CIB for the resource parameters when starting a resource. - -NOTE: Because it might be necessary for the success of the monitoring -operation that it is invoked with the same instance parameters as the resource -was started with, it needs to keep a copy of that data, because the CIB might -change at runtime. - - -4. Cluster Resource Manager - -The CRM coordinates all non-local interactions in the cluster. It interacts -with: - - - The membership layer - - The local resource managers - - non-local CRMs - - Administrative commands - - Fencing functionality - -Only one node is running the "Designated Coordinator" CRM at any given time in -any given partition. All other nodes forward their input to this node only, -and will relay its decisions to the local LRM. - -The coordinator is a "primus inter pares"; in theory, any CRM can act in this -fashion, but the arbitation algorithm will distinguish a designated node. - -4.1. How to deal with failure of the designated CRM - -There are two major groups of failures here; if the entire node running the -CRM fails, the underlaying membership algorithm shall take care of this. - -If the CRM logic fails, this shall be detected by internal consistency checks -and local heartbeating to apphbd must stop immediately, effectively committing -'suicide' and providing a failfast mechanism. However, for now we will -assume that the CRM does not fail. Coping with internal failures -internally is always difficult. - -This can later be enhanced by providing 'peer monitoring' of the CRMs among -eachother and initiating fencing if necessary; however this has many pitfalls -and is not inherently better. - - -4.2. Election algorithm for the DC - -The election algorithm exploits the fact that there is a global ordering of -the nodes in a given partition; it will simply select the first one. The node -will know that it has been "distinguished" and take control. - -As the first action in the algorithm is to collect the status from all -nodes, this will inform any node about the newly elected leader. Should -any of them have been a DC until the new membership (in the case of -cluster partitionings healing), he will cease operation and handover -control in an orderly fashion. - -4.3. Consistency audits - -The DC can perform a variety of consistency checks: - -- Exclusive allocation of resources - -- All nodes have the necessary Resource Agents for the resources which might - be running on them - -- Adminstrative requests (ie, rule additions) can be rejected if it would - prevent the policy engine from computing a valid target state - -Implementing any of the audits is optional. - -4.4. Communication between the CRM and LRM - -- Every resource in the system is clearly identified in the CIB via a "UUID". - This is the key passed around between CRM/LRM. - - (This avoids issues like with FailSafe, which used a "resource name / type" - combination as the key; however, this made it very difficult to have, for - example, two filesystems mounted at /usr/sap - production system and the - test system - , because that is a key-clash. It is however a perfectly - legal combination, as long as the resources are not activated on the same - node; which can however be avoided by an appropriate negative dependency. - Combined with "resource priority", the higher priority production system - will push off the lower-priority test system and operation will continue) - -TODO: Protocol and channel need to be decided; based on the heartbeat IPC -layer, wrapper library so they don't have to know all the gory details. AI -lclaudio - - -4.4. Executing the Transition Graph - -It is also the task of the DC to execute the computed dependency graph. - -The graph will be traversed and evaluated in dependency-compliant order. - -Every node in the graph corresponds to a single action; the links between them -correspond to the dependencies. - -Only nodes for which all dependencies have been satisfied will be executed; -the CRM is allowed to parallelize these however. - -If a task cannot be successfully evaluated and has to be ultimately considered -failed, this will be treated as a hard barrier - all currently pending tasks -will run to completion, appropriate constraints added to the CIB (ie, -"resource X cannot start on node N1", for example) and the graph execution -will be aborted with a failure; this will escalate the recovery back to the -higher level again. (ie, trigger a rerun of the recovery algorithm) - -TODO: This needs more thought. Especially the failure paths could simply be -treated just as normal nodes in the graph, simplifying the logic here; and for -some tasks where re-running the recovery algorithm is pointless, the Policy -Engine could precompute alternate actions (like marking the resource -hierarchy failed from that node upwards etc). Could be a possible future -extension not implemented in v1.0. - -TODO: A pure dependency graph as outlined doesn't easily allow to express "OR" -conditions, but only "AND". "OR" conditions might be helpful if a node could -be fenced via multiple mechanisms, and if each one should be tried in order as -a fallback; this could be implemented by allowing a node in the graph to be a -_list_ of actions to be tried in order. - - -5. Cluster Information Base - -The CIB is also running on every node in the cluster. In essence, it provides -a distributed database with weak transactional semantics, exploiting the fact -that all updates are serialized by the DC and that each node itself knows -its own latest status. - -5.1. Contents of the CIB - -The CIB is divided into two major parts; the "configuration" data and the -"runtime" data. - -a) The configuration part of the database is setup by the -administrator. - -The configuration present on any given node is appropriately versioned; a -combination of timestamps and generation counter seems sensible. Thus the most -recent version available in a partition can always be clearly identified and -selected. - -- Configured resources - - Resource identifier - - Resource instance parameters - -- Special node attributes - -- Administrative policies - - Resource placement constraints - - Resource dependencies - -b) Runtime information - -This includes: - -- Information about the currently participating nodes in the partition. - - Resource Agents present on each node - - Resources active on the node and their status - - Operational status of the node itself - -- Fencing data - - - results: the timestamped result of a fencing request. - - metadata: each nodes contributes the list of fencing devices available - to it. - - TODO: Maybe this is part of the "static" configuration data? - -- Dynamic administrative policies and constraints: - - Temporary constraints to deny placing a resource on a node, ie in response to - failures etc. - - Resource migration requests by the administrator. - - (These might be ignored by the policy engine if otherwise a consistent - state cannot be computed or because they have a limitted life time, for - example "until the node has booted again") - -The runtime data can be constructed by merging all available data in the -partition by exploiting that every node holds authoritive data on itself. - - -5.2. Process of generating an uptodate CIB - -If necessary, the DC will retrieve the CIB from each node and compute the -uptodate CIB from this data and broadcast the result to all nodes. - -The algorithm for merging the CIBs is rather straightforward: - -- Select the most recent configuration from all nodes. - - Note that this relies on the fact that an adminstrator has the wits to not - force incompatible changes to separate cluster partitions; if he does, some - configuration changes might be lost (or rather, overridden by the others), - but that is a classical PEBKAC anyway. - - More complex merging algorithms could always be devised later for this step; - the straightforward approach is to complain loudly about this problem as it - can be clearly detected, and to also try to reject configuration changes in - a degraded cluster unless forced. - - If any node in the cluster has a valid copy of the configuration, the - cluster will be able to proceed. The case where this is not true is very - unlikely; it requires at least a double failure to occur. The worstcase in - this scenario is that some configuration changes might be reverted. - -- Merge the runtime data. - - A node present in the partition is assumed to always have authoritive data - on itself, its current status, resources it helds etc. This overrides any - other data about it from other nodes. - - ("Normative power of facts" as opposed to rumours) - - - For nodes not present in the partition, their latest status can be - identified from the partial information present on other nodes because it - is timestamped. ie, it is possible to say whether they were cleanly - shutdown, whether they have already been fenced cleanly or whether a - fencing attempt has failed etc. - -- Update timestamps and generation counters as appropriate. - -- Commit locally and broadcast. - - -5.3. How are updates to the CIB handled? - -Any updates to the configuration will be serialized by the DC; they will be -verified, committed to its own CIB and also broadcast to all nodes in the -partition as an incremental update. - -Any updates pertaining to the runtime portion of the CIB can simply be -broadcast to all nodes; the DC will receive them too, and all other nodes can -save them for further reference so they will be available if the (new) DC -needs to compute a new CIB. - - -6. Policy Engine - -6.1. Functionality provided - -Even if only simple dependencies are supported at first (of the first two -types, probably) which is straightforward to implement, the model can be -easily extended. - -6.1.1. Required constraints - -- To be started on the same node as resX after resX -- To only be started on a node from the set {A, B, C} - -6.1.2. Future extensions - -- To be started after resX, but without tieing them both to a given node, ie - providing globally order -- To NOT be run on the same node as resX (or generically, negated constraints) -- To only be placed on a node with the attribute FOO (for example, connected - to the FC-AL rack, or with at least XX mb of memory and others) - -6.2. Thoughts about algorithm implementation - -The process of arriving at a target state / transition graph for the cluster -from the set of rules and the current cluster state (basically, all -information in the CIB is available) can be implemented by a "constraint -solver". - -- Analyse dependency graph in the configuration; -- Compute eligible nodes for each subtree (intersection of nodes - configured for resources within a dependency subtree) -- Order eligible nodes by priority, stickiness etc and select target node - - - All other eligible nodes either have to be part of the partition or must - be fenced successfully. - -... - -TODO: UNFINISHED - -TODO: Alternate implementation: Steal Finite Domain solver from GNU Prolog; - steal constraint solver from - http://www.cs.washington.edu/research/constraints/cassowary/ etc - - would allow policies to be expressed as intuitive constraints. Rather - cool indeed. Evaluation. - - -6.4. Design considerations - -The following part should answer the most common questions regarding this -approach: - -6.4.1. Whether to deal with resource groups or "only" resources and -dependencies? - -At the first glance, resource groups a la FailSafe are a welcome -simplification. They can be treated as atomic units, resource allocation -policy seems to become simpler, the CRM wouldn't even have to know about -resources but simply allocate/reallocate resource groups. It seems natural to -group resources in this fashion; all related resources are put into a resource -group and done. - -However, it is also limitting. Resource groups become awkward if they stop -being a logical mapping between resource and provided service; for example, -the natural answer is to throw everything which should be running on a single -node into a single resource group. If this becomes unnecessary later, resource -group splitting is difficult. (Same for merging) - -Dependencies which spawn resource groups are also commonly requested; ie, a -resource group should not run on the same node as a given other one. They are -also cumbersome if one has to have a resource group for _everything_, even if -it might only be a single resource or two. Then the abstraction gains little. - -Some decisions also spawn the boundary between single resources and resource -groups, in which case the CRM has to know about both again. For example, the -allocation policy of a resource group must be in line with how each resource -itself can be allocated. - -In short, resource groups seem to fall short of easing bigger clusters and -also lack some flexibility which can only be added at the expense of -complexity. - -It seems more natural to me (by now) to only deal with resources and -dependencies between them and to external constraints. I believe that in the -end, both are just mindsets and that none is inherently harder to understand -than the other, just one which is cleaner to implement. - -A very important point to keep in mind is that this document reflects the -_internal_ representation of the configuration. An appropriate configuration -wizard could (reasonably easily, although with limited features) present the -user with a 'resource group'-style frontend. - - -6.4.2. Why a transition graph - -The dependency information in the graph will allow operations to be -parallelized to speed up recovery. (All operations for which all requirements -are satisfied can be carried out in parallel.) - -The graph is "global" (by virtue of being centralized at the DC) and thus even -allows the possibility of "barriers"; ie, starting a resource only if a -resource has been started on another node. - - -6.4.3. Who orders resource operations on a single node - -The LRM does not have enough information to order resource operations (start, -stop etc) even for the local node, because it might - in theory - depend on -non-local actions (resource operations, fencing etc). - -Only the transition graph as constructed by the DC contains such information. - -A possible optimisation here is that the DC transmits all operations which do -not depend on external events as a single block to a given node, which can -then proceed accordingly. Initially, just issueing the actions one by one -shall suffice. - - -7. Executioner - -7.1. Integration - -The Executioner is responsible for the fencing of nodes on request. It is a -local component on each node and knows which fencing devices are available to -it. - -This information is contributed to the CIB. - -On request of the CRM (relayed from the DC), it will carry out a fencing -attempt and report the status back. The CRM might of course try to fence a -given node from multiple nodes until one succeeds. - -See "executioner.txt" for details. - - -7.2. Fencing algorithm - -At the start of the recovery process, the CRM shall verify the list of devices -reachable by each node; this shall be done as part of querying the current CIB -from them. - -It will then retrieve the list of nodes fenceable via each device. If this -list has already been retrieved in the past, it may be reused from cache -appropriately. (Reloaded only when new nodes get added to the cluster, or -something) - -For nodes which need to be fenced, appropriate dependencies will need to be -added to the transition graph. These shall ensure that any given device is not -concurrently accessed, and that fencing requests are appropriately retried on -other devices if available. - -THOUGHT: Adding "meatware" as a ultima ratio fallback fencing would allow -disaster resilient setups; if the one site failed completely, the -administrator could flip the switch and the transition graph would proceed. -This would neatly address this part of the requirements list. - -7.3. Un-fencing - -Aborting an on-going fencing / STONITH operation is not supported. - -7.3.1. After a successful fencing - -Okay, so a node has been fenced. How does it properly rejoin the cluster? - -For example, a node n1 can notice an unclean shutdown at startup, and must -assume it has been STONITHed (in particular, if uptime very low ;-). - -When can it clear this flag and initiate recovery / startup actions on its -own? - -The answer seems to relate to the "thought" under 6.2.; when is recovery -action initiated for a resource group anyway. An additional option "Proceed if -more than 50% of the nodes for the resource group are in our partition; if it -is a tie, do not proceed to initiate recovery if the 'unclean' flag is set on -any node." - -(This moves the unclean flag into the runtime portion of the CIB) - -The flag can be cleared once majority for all resources has been reached -again. - -7.3.2. Rejoining node while fencing requests are still pending - -Answer: Running fencing requests can not be aborted. In this rather -unfortunate case, the node will most likely disappear soon because of the -fencing request. Provisions might be made to consider this node as "down" -already, to prevent resource pingpong. - -TODO: Ok, so n1 tried to fence n2 and vice-versa because of a partition, they -failed via the STONITH device and have fallen back to meatware. The partition -resolves. Shouldn't the system try to abort the meatware request? Maybe a -special case for "meatware" requests? Or should meatware just notice this? - - -7.3.3. Rejoining node for which fencing has ultimately failed in the past - -Two options are basically possible; if a node has rejoined the cluster, the -"fencing failed" flag could be cleared for it, and the cluster could proceed -as normal. (This seems sensible) - -The other option would be to request the node to commit suicide. Again, this -is difficult in the case of a resolving cluster partition; both sides would -commit suicide. This doesn't seem sensible. - -If the failure was due to a local hang - ie, scheduler bug, power management -running wild etc - this shall be considered outside the scope of this -discussion. The local health monitoring on such a node shall be responsible -for containing such failures and reacting to it appropriately. - - -8. Interaction with quorum - -TODO: Highly unfinished! - -Summary: CRM does not need quorum. However, the CRM could easily compute -'quorum' as just another resource. - -"Quorum" is in fact not necessary for this design. It is implicit in the -policy engine / CRM which will only bring a resources for which all -dependencies - including fencing - have been satisfied. - -This in fact is quorum with slightly finer granularity. It allows the cluster -to proceed in a scenario like: - -- {a,b,c,d} form a cluster. {a,b} share resources (called R1) and {c,d} share - resources (called R2). If the cluster is partitioned int {a,b} and {c,d}, no - fencing has to carried out; cluster operation can proceed as normal. - -Of course, as soon as a global resource spawning {a,b,c,d} is added, this in -fact translates to "global quorum". - -This makes me think that if global quorum is in fact required it can be best -expressed in this design by mapping it to such a global ('configured on all -nodes') resource and communicating to the partition that it has quorum if it -was able to recover this resource (or failed to recover it, that is). - -However, it also allows for "sub-quorum"; ie, given the example of an -application requiring quorum to operate, it will usually only be interested in -quorum of the _nodes eligible for the related resources_. So quorum could -potentially be different if reported to different clients... - - -8.1. Issues wrt quorum - -TODO: Certainly ;-) - - -10. Integration with other projects - -10.1. Integration with heartbeat - -TODO: Elaborate. - -This component is supposed to replace the "resource manager" present in -heartbeat currently. - -Issues which need to be addressed / kept in mind: - -- Start order; heartbeat should only join the cluster if the startup of CRM is - successfully completed locally. - -- CRM makes extensive use of heartbeat's libraries (IPC, HBcomm, - STONITH, PILS etc). -... - - -10.2. Integration with CCM - -TODO: Elaborate, if necessary. - -CRM is a client of CCM; CCM provides the set of nodes in the partition, CRM -only operates on this data. - -It would be nice if CCM enforced policy before allowing a node to join a -partition; ie, time not vastly desynchronized, CRM/LRM etc running and other -ideas come to mind. - - -10.3. Integration with Group Services - -Should Group Services functionality - in particular, group messaging - be -available one day, the exact semantics will of course have to be taken into -account. - -The obvious simplification possible with this component would be that the CRM -could form a distributed process group in the cluster; instead of building on -top of node messaging primitives and filtering these. - - -10.4. Integration with non-heartbeat clusters - -In theory, the software should be able to run in any "compliant" environment; -I'd safely assume that it should be reasonably easy to port on top of the -Compaq CI, for example, or any Service-Availability Forum AIS. - -This would complement the respective feature lists as well as demonstrate that -such interoperability is in fact possible, providing great leverage for OCF! - -TODO: Keep that in mind while designing the code. Encapsulate -interactions with the lower levels cleanly. - -10.5. Integration with cluster-aware applications - -Software like cluster-aware volume managers, filesystems or distributed -applications (databases) need to be integrated into the 'recovery' tree just -like Resource Agent based ones. - -This should be reasonably simple - because it is a matter of inserting the -necessary trigger in the right order into the transition graph -, but these -clients need to be aware that they don't need to provide any fencing -themselves etc, because the external framework has already taken care of -this. - -10.6. Relation to OCF - -All applicable OCF specifications shall be implemented. Areas which OCF does -not specify yet shall be implemented as prototypes, hopefully serving as a -basis for OCF discussion. - - -11. Monitoring - -11.1. Integration with health monitoring - -Health monitoring should prevent startup of cluster software on a "sick" -node. ("Hey, I've rebooted 12 times within the last 60 minutes, maybe -bringing the cluster software online wouldn't be so smart!") Should be -handled by Local Resource Manager? - -Health monitoring can send events to DC: -- I'm about to crash, migrate everything while you can -- I'm overloaded, please migrate some resources if possible - -etc - -11.2. Monitoring the CRM - -Monitoring software can query CIB to get access to all data. - -TODO: Should traps/events be triggered from inside the different components -themselves, or might it be a good idea to allow clients to "subscribe" to -certain parts of the CIB and be notified if a change occurs? This would -be helpful for SNMP/CIM traps. - -(The later would be somewhat alike FailSafe's cdbd) - - -XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -Attention: Here be dragons. Anything following these lines are unordered -thoughts which haven't yet been incorporated into the grand scheme of things. -XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - -X. ... - -Question: Why isn't moving a resource to another node an basic operation? - -Answer: Because it involves more than one node and needs to be coordinated by -the designated CRM. - -Question: How can this deal with disaster resilient setups, where one site may -be physical separate and fencing the other side is not possible? - -Answer: See the "note" under "Hangman". - - -