Service (Resource Group) Manager Errors Herein lie explanations as to the various errors and warnings that you may see while running the resource group manager. This is meant to be all-inclusive; if any error messages or warning messages are experienced which are not explained below, please file a bugzilla: http://bugzilla.redhat.com/bugzilla #1: Quorum Dissolved The cluster infrastructure has reported to the resource group manager that the local node and/or entire cluster is inquorate. At this point, all services and resources managed by the resource group manager are stopped and the resource group manager restarts, waiting for a quorum to form. If this node was disconnected, it should be evicted and fenced by the rest of the cluster. Nodes which become inquorate may reboot themselves. #2: Service returned failure code. Last owner: The resource group named has failed to stop. This generally means that it may not be automatically recovered and that the system administrator must intervene in order to cleanly restart the service. Services which fail must first be disabled then re-enabled. However, be sure that all resources have been properly cleaned up first. Generally, a hard reset of the node on which the service failed will restore it to working order and is the safest measure to take prior to restarting the failed service. This, however, is not required. #3: Service returned failure code. Last owner: Same as #2, but the node name was not determinable given the node ID. #4: Administrator intervention required. Only occurs after #2 or #3. Complaint stating that the administrator must take action after a service failed. See #2. #5: Couldn't connect to ccsd! The resource group manager was unable to connect to ccsd. This generally means that ccsd was not running at the time the resource group manager tried to connect. Starting ccsd generally solves this. If this does not solve it, try checking firewall rules to ensure that the connection to ccsd's port is not blocked. Additionally, it could be that there is a conflict between ccsd and the version of the system library (libccs) that rgmanager was built against. #6: Error loading services The resource group manager was unable to load configuration information from ccsd. This could mean a communication problem or an invalid configuration. This error is fatal; the resource group manager aborts. #7: Error building resource tree The resource group manager was unable to load configuration information from ccsd. This could mean a communication problem or an invalid configuration. This error is fatal; the resource group manager aborts. #8: Couldn't initialize services The resource group manager was unable to load configuration information from ccsd. This could mean a communication problem or an invalid configuration. #9: Couldn't connect to cluster The resource group manager was unable to find a plugin which was able to talk to the cluster infrastructure. Generally, this occurs when no cluster infrastruture is running. Try starting the preferred cluster infrastructure for your configuration (e.g. CMAN+DLM, GuLM) and restarting rgmanager. #10: Couldn't set up listen socket The resource group manager was unable to bind to its listening socket. This generally happens when there is already a resource group manager running, but it is possible for other applications to use the port that rgmanager wants to use for this purpose. #11: Couldn't set up VF listen socket The resource group manager was unable to bind to its listening socket which is used fo internal state distribution. This generally happens when there is already a resource group manager running, but it is possible for other applications to use the port that rgmanager wants to use for this purpose. #12: RG failed to stop; intervention required The resource group manager has failed to cleanly stop a service. See #2 for actions to take. #13: Service failed to stop cleanly The resource group manager has failed to cleanly stop a service after failing to start the same resource group during an enable or start operation. See #2 for actions to take. #14: Failed to send bytes to During a broadcast operation to all nodes, the view-formation library has failed to send a message to one of the nodes. This is generally recovered automatically. #15: rmtab_modified: stat: The stat(2) function received an error while scanning for changes to /var/lib/nfs/rmtab. Though generally handled automatically, repeats of this kind of error could indicate a problem with /var/lib/nfs/rmtab. If repeated errors occur: * Ensure /var/lib/nfs exists and is a directory. * Ensure /var/lib/nfs/rmtab exists and is a regular file (not a directory) #16: Failed to reread rmtab: Clurmtabd had trouble reading or parsing /var/lib/nfs/rmtab. This could mean that there is garbage in /var/lib/nfs/rmtab or that the internal format has changed or some problem with the filesystem. If the latter is suspected, please file a Bugzilla. Ensure you include your /var/lib/nfs/rmtab and version of rgmanager. #17: Failed to prune rmtab: Clurmtabd failed to prune a new copy of /var/lib/nfs/rmtab against its specified mount point. #18: Failed to diff rbtab: Clurmtabd failed to determine the differences, if any, between a previous copy of /var/lib/nfs/rmtab and the current data stored there. #19: (Obsolete) #20: Failed to set log level Clurmtabd failed to change its log level to the specified level. This is non-fatal. The side effect is that more or less verbose logging will be seen depending on whether the log level was increased or decreased. #21: Couldn't read/create Clurmtabd stores service-specific rmtab information in /.clumanager/rmtab so that when the service fails over, the new instance of clurmtabd on the other node can pick up and know which hosts already have mounted any NFS exports. This error could indicate that clients will receive ESTALE (Stale NFS file handle) after failover. #22: Failed to read : Clurmtabd failed to read or parse . This could indicate garbage in that file, or another error. This could have side effects of ESTALEs being received by clients. The error should be indicative of the cause. #23: Failed to read /var/lib/nfs/rmtab: Clurmtabd failed to read or parse . This could indicate garbage in that file, or another error. This could have side effects of ESTALEs being received by clients. The error should be indicative of the cause. #24: Failed to prune rmtab: Clurmtabd failed to prune unrelated entries in /var/lib/nfs/rmtab. The error should be indicative of the cause. #25: Failed to write /var/lib/nfs/rmtab: Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after merging changes between private cluster data (/.clumanager/rmtab) and the system-wide copy (/var/lib/nfs/rmtab). The error should be indicative of the cause. #26: Failed to write : Clurmtabd failed to atomically write a new copy of its private cluster data file after changes between it and the system-wide copy in (/var/lib/nfs/rmtab). The error should be indicative of the cause. #27: Couldn't initialize - exiting Clurmtabd failed to initialize for one reason or another. A previous error should indicate the reason why. #28: daemonize: Clurmtabd failed to become a daemon. The possible reasons this could happen are documented in fork(2) and setsid(2). #29: rmtab_write_atomic: Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after merging changes between private cluster data (/.clumanager/rmtab) and the system-wide copy (/var/lib/nfs/rmtab). The error should be indicative of the cause. #30: Node defined multiple times in domain This indicates a configuration error where a node was defined more than once in a given failover domain . If this occurs, only the first entry for the node will be used. Remove the duplicate copy and restart rgmanager on all nodes. #31: Domain defined multiple times This indicates a configuration error where a domain was defined more than once. Failover domains may not have duplicate names. If this occurs, the first one found will be used. #32: Code path error: Invalid return from node_in_domain() If this occurs, please file a Bugzilla. #33: Unable to obtain cluster lock: This occurs while evaluating services after a node transition. If this occurs, the current service under evaluation will not be able to be checked for possible starting. Possible reasons obtaining a cluster lock would fail: * Loss of node/cluster quorum * Broken connection to GuLM/DLM * Error in magma-plugins package. #34: Cannot get status for service This occurs while evaluating services after a node transition. The service has an indeterminable state. This could indicate a bug with the data distribution subsystem, or an invalid service. #35: Unable to inform partner to start failback This occurs after a node transition where a node asks other nodes for services (resource groups) which it should own. This generally indicates a communication problem between the cluster nodes. If this occurs, services may not migrate to the node which just came online. #36: Cannot initialize services The resource group manager could not initialize services after a node transition. This is fatal, and rgmanager exits uncleanly afterward. #37: Error receiving message header This occurs after an incoming request causes the resource group manager to accept a new connection. After a new connection is received, there is a short amount of time during which to receive the message header. If this does not occur, the connection is dropped and the message (if any) is rejected. #38: Invalid magic: Wanted 0x, got 0x This occurs after an incoming request causes the resource group manager to accept a new connection. This could indicate a mismatched version between resource group managers or an unauthorized program attempting to communicate with the resource group manager. The connection is dropped and the message is rejected. #39: Error receiving entire request This occurs after an incoming request causes the service manager to accept a new connection. The amount of data received did not match the amount of data specified in the message header. This could indicate a mismatched version between service managers or an unauthorized program attempting to communicate with the service manager. The connection is dropped and the message is rejected. #40: Error replying to action request. A resource action request was received (enable/disable/etc.) while the resource groups were locked and we failed to reply properly to the waiting client connection. #41: Couldn't obtain lock for RG : While trying to report a failed service (resource group) to the other cluster members, we failed to obtain a cluster lock. This could indicate that the cluster quorum has dissolved, communication errors with the lock server, or other problems. The effect of this is, however, minimal; simply put, the #2 and #3 messages won't appear in the logs, so the last owner of the service will be unknown. See #33 for reasons as to why obtaining a lock might fail. #42: Cannot stop service : Invalid State The service could not be stopped. It was in an invalid state. This could indicate a bug in rgmanager. #43: Service has failed on all applicable members; can not start. I don't know how to make this more verbose. The service must be disabled and enabled prior to being allowed to start. See #2 and #3 for more information. #44: Cannot start service : Invalid State The service could not be stopped. It was in an invalid state. This could indicate a bug in rgmanager. #45: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group prior to starting it. If this occurs, the start operation will fail. The error should be indicative of the reason. See #33 for reasons as to why obtaining a lock might fail. #46: Failed getting status for RG This occurs while trying to determine the state of a resource group prior to starting it. If this occurs, the start operation will fail. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved. #47: Failed changing service status This occurs while trying to write out a new ownership state of a resource group prior to starting it. If this occurs, the start operation will fail. Generally, this indicates attempt to write a new view of that resource group's state after quorum has dissolved. #48: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group prior to performing a status operation on it. If this occurs, the status operation will fail. See #33 for reasons as to why obtaining a lock might fail. #49: Failed getting status for RG This occurs while trying to determine the state of a resource group prior to performing a status operation on it. If this occurs, the status operation will fail. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved. #50: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group prior to performing a stop or disable operation on it. If this occurs, the stop operation will fail. If a stop operation fails on a service, the service is marked as 'failed', if possible. See #33 for reasons as to why obtaining a lock might fail. See #2 for steps to take after a resource group has failed. #51: Failed getting status for service This occurs while trying to determine the state of a resource group prior to performing a stop or disable operation on it. If this occurs, the stop operation will fail. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved. If a stop operation fails on a service, the service is marked as 'failed', if possible (if the cluster is not quorate, then this is not possible). See #2 for steps to take after a resource group has failed. #52: Failed changing RG status This occurs while trying to write out a new ownership state of a resource group prior to stopping it. If this occurs, the stop operation will fail. Generally, this indicates attempt to write a new view of that resource group's state after quorum has dissolved. See #2 for steps to take after a resource group has failed. #53: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group after performing a stop or disable operation on it. If this occurs, the stop operation will fail. If a stop operation fails on a service, the service is marked as 'failed', if possible. See #33 for reasons as to why obtaining a lock might fail. See #2 for steps to take after a resource group has failed. #54: Failed getting status for RG This occurs while trying to determine the state of a resource group after to performing a stop or disable operation on it. If this occurs, the stop operation will fail. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved. If a stop operation fails on a service, the service is marked as 'failed', if possible (if the cluster is not quorate, then this is not possible). #55: Failed changing RG status This occurs while trying to write out a new ownership state of a resource group after stopping it. If this occurs, the stop operation will fail. Generally, this indicates attempt to write a new view of that resource group's state after quorum has dissolved. See #2 for steps to take after a resource group has failed. #55: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group after a stop operation has failed, while the cluster is trying to disable the service. If this occurs, the operation will fail. See #33 for reasons as to why obtaining a lock might fail. See #2 for steps to take after a resource group has failed. #56: Failed getting status for RG This occurs while trying to determine the state of a resource group after failing to perform a stop or disable operation on it. If this occurs, the operation to lock the service will fail. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved. #57: Failed changing RG status This occurs while trying to write out a new ownership state of a resource group after marking it as failed. If this occurs, the stop operation will fail. Generally, this indicates attempt to write a new view of that resource group's state after quorum has dissolved. See #2 for steps to take after a resource group has failed. #58: Failed opening connection to member # We attempted to relocate a resource group (service) to another node, but failed to actually connect to that node's resource group manager. This could indicate that rgmanager is not running on that node. In any case, the next node in the cluster member list is tried. #59: Error sending relocate request to member # We attempted to relocate a resource group (service) to another node, but failed to send the relocation message. This could indicate a problem with network connectivity, extremely high local/remote load, or other problems. The next node in the cluster member list is tried. #60: Mangled reply from member # during RG relocate We sent a resource group to another node, but it failed to send us a useful reply. At this point, the state of the resource group is unknown, but we give it the benefit of the doubt and assume it started okay. #61: Invalid reply from member during relocate operation! Similar to #60, but this occurs only after the inital preferred node failed to start the service and/or failed to communicate a proper reply. #62: /var/lib/nfs/rmtab does not exist - creating /var/lib/nfs/rmtab did not exist. Clurmtabd creates it. #63: Couldn't write PID! Clurmtabd failed to write its pid to /.clumanager/pid. This will cause fs.sh to kill it with -9 during a stop operation, preventing it from synchronizing with /var/lib/nfs/rmtab prior to exiting. This increases the risk of ESTALE (Stale NFS file handle) on clients after a relocation. #64: Could not validate #65: NFS failover of will malfunction Clurmtabd failed to initialize the mount point's private cluster rmtab file. This will prevent updating of that mount point's rmtab file, which means that clients will receive ESTALE after a relocation or failover. #66: Domain '' specified for resource group nonexistent! The failover domain does not exist in the current view of the cluster configuration. This is a configuration error. #67: Shutting down uncleanly The node has left the cluster cleanly, but rgmanager was still running. All services are halted as quickly as possible to prevent data corruption. (It may be a good idea to have rgmanager reboot if this is received) #68: Failed to start ; return value: The resource group failed to start and returned the value . This could indicate missing resources on the node or an improperly configured resource group. Check your resource group's configuration against your hardware and software configuration and ensure that it is correct. #69: Unclean [stop|disable] of The resource group is being stopped because of a local node exiting or loss of quorum. The distributed state is left unchanged. #70: Attempting to restart resource group locally. The resource group failed to start on all other applicable nodes during processing of a relocate operation. (A relocate operation occurs either by an administrator manually relocating a service or the service being relocated after a fail-to-restart event.) #71: Relocating failed service The resource group failed a status check and subsequently failed to restart. At this point, we try to send it to another applicable node in the cluster. #72: clunfsops: NFS syscall failed: . #73: clunfsops: Kernel may not have NFS failover enhancements. Required NFS failover enhancements were not present on the host kernel. It is impossible to restart or relocate NFS services without these, but they should properly work in the case of true failover situations (i.e. the node on whicch the NFS service was running has failed and been fenced by the cluster). #74: Unable to obtain cluster lock: This occurs while trying to determine the state of a resource group after an attempt to start it has completed (at the script level). If this occurs, the start operation will fail. See #33 for reasons as to why obtaining a lock might fail. #75: Failed getting status for RG This occurs while trying to determine the state of a resource group after an attempt to start it has completed. If this occurs, the start operation will report a failure. Generally, this indicates attempt to retrieve the current view of that resource group's state after quorum has dissolved.