Service (Resource Group) Manager Errors

Herein lie explanations as to the various errors and warnings that you may
see while running the resource group manager.  This is meant to be
all-inclusive; if any error messages or warning messages are experienced
which are not explained below, please file a bugzilla:

http://bugzilla.redhat.com/bugzilla

#1: Quorum Dissolved

The cluster infrastructure has reported to the resource group manager that
the local node and/or entire cluster is inquorate.  At this point, all
services and resources managed by the resource group manager are stopped
and the resource group manager restarts, waiting for a quorum to form.

If this node was disconnected, it should be evicted and fenced by the rest
of the cluster.  Nodes which become inquorate may reboot themselves.

#2: Service <name> returned failure code.  Last owner: <name>

The resource group named <name> has failed to stop.  This generally means
that it may not be automatically recovered and that the system administrator
must intervene in order to cleanly restart the service.  Services which
fail must first be disabled then re-enabled.  However, be sure that all
resources have been properly cleaned up first.  Generally, a hard reset
of the node on which the service failed will restore it to working order
and is the safest measure to take prior to restarting the failed service.

This, however, is not required.

#3: Service <name> returned failure code.  Last owner:
<integer>

Same as #2, but the node name was not determinable given the node ID.

#4: Administrator intervention required.

Only occurs after #2 or #3.  Complaint stating that the administrator
must take action after a service failed.  See #2.

#5: Couldn't connect to ccsd!

The resource group manager was unable to connect to ccsd.  This generally
means that ccsd was not running at the time the resource group manager
tried to connect.  Starting ccsd generally solves this.  If this does
not solve it, try checking firewall rules to ensure that the connection
to ccsd's port is not blocked.  Additionally, it could be that there is
a conflict between ccsd and the version of the system library (libccs) that
rgmanager was built against.

#6: Error loading services

The resource group manager was unable to load configuration information
from ccsd.  This could mean a communication problem or an invalid
configuration. 

This error is fatal; the resource group manager aborts.

#7: Error building resource tree

The resource group manager was unable to load configuration information
from ccsd.  This could mean a communication problem or an invalid
configuration.

This error is fatal; the resource group manager aborts.

#8: Couldn't initialize services

The resource group manager was unable to load configuration information
from ccsd.  This could mean a communication problem or an invalid
configuration.

#9: Couldn't connect to cluster

The resource group manager was unable to find a plugin which was able to
talk to the cluster infrastructure.  Generally, this occurs when no cluster
infrastruture is running.  Try starting the preferred cluster infrastructure
for your configuration (e.g. CMAN+DLM, GuLM) and restarting rgmanager.

#10: Couldn't set up listen socket

The resource group manager was unable to bind to its listening socket.
This generally happens when there is already a resource group manager running,
but it is possible for other applications to use the port that rgmanager
wants to use for this purpose.

#11: Couldn't set up VF listen socket

The resource group manager was unable to bind to its listening socket which
is used fo internal state distribution.  This generally happens when there
is already a resource group manager running, but it is possible for other
applications to use the port that rgmanager wants to use for this purpose.

#12: RG <name> failed to stop; intervention required

The resource group manager has failed to cleanly stop a service.
See #2 for actions to take.

#13: Service <name> failed to stop cleanly

The resource group manager has failed to cleanly stop a service after
failing to start the same resource group during an enable or start operation.
See #2 for actions to take.

#14: Failed to send <integer> bytes to <integer>

During a broadcast operation to all nodes, the view-formation library has
failed to send a message to one of the nodes.  This is generally recovered
automatically.

#15: rmtab_modified: stat: <error>

The stat(2) function received an error while scanning for changes to
/var/lib/nfs/rmtab.  Though generally handled automatically, repeats of
this kind of error could indicate a problem with /var/lib/nfs/rmtab.

If repeated errors occur:
 * Ensure /var/lib/nfs exists and is a directory.
 * Ensure /var/lib/nfs/rmtab exists and is a regular file (not a directory)

#16: Failed to reread rmtab: <error>

Clurmtabd had trouble reading or parsing /var/lib/nfs/rmtab.  This could
mean that there is garbage in /var/lib/nfs/rmtab or that the internal
format has changed or some problem with the filesystem.  If the latter is
suspected, please file a Bugzilla.  Ensure you include your
/var/lib/nfs/rmtab and version of rgmanager.

#17: Failed to prune rmtab: <error>

Clurmtabd failed to prune a new copy of /var/lib/nfs/rmtab against its 
specified mount point.

#18: Failed to diff rbtab: <error>

Clurmtabd failed to determine the differences, if any, between a previous
copy of /var/lib/nfs/rmtab and the current data stored there.

#19: (Obsolete)

#20: Failed to set log level

Clurmtabd failed to change its log level to the specified level.  This
is non-fatal.  The side effect is that more or less verbose logging will
be seen depending on whether the log level was increased or decreased.

#21: Couldn't read/create <filename>

Clurmtabd stores service-specific rmtab information in
<mount_point>/.clumanager/rmtab so that when the service fails over,
the new instance of clurmtabd on the other node can pick up and know which
hosts already have mounted any NFS exports.  This error could indicate that
clients will receive ESTALE (Stale NFS file handle) after failover.

#22: Failed to read <filename>: <error>

Clurmtabd failed to read or parse <filename>.  This could indicate garbage
in that file, or another error.  This could have side effects of ESTALEs
being received by clients.  The error should be indicative of the cause.

#23: Failed to read /var/lib/nfs/rmtab: <error>

Clurmtabd failed to read or parse <filename>.  This could indicate garbage
in that file, or another error.  This could have side effects of ESTALEs
being received by clients.  The error should be indicative of the cause.

#24: Failed to prune rmtab: <error>

Clurmtabd failed to prune unrelated entries in /var/lib/nfs/rmtab.  The
error should be indicative of the cause.

#25: Failed to write /var/lib/nfs/rmtab: <error>

Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after
merging changes between private cluster data (<mount_point>/.clumanager/rmtab)
and the system-wide copy (/var/lib/nfs/rmtab).  The error should be indicative
of the cause.

#26: Failed to write <filename>: <error>

Clurmtabd failed to atomically write a new copy of its private cluster data
file after changes between it and the system-wide copy in
(/var/lib/nfs/rmtab).  The error should be indicative of the cause.

#27: Couldn't initialize - exiting

Clurmtabd failed to initialize for one reason or another.  A previous
error should indicate the reason why.

#28: daemonize: <error>

Clurmtabd failed to become a daemon.  The possible reasons this could happen
are documented in fork(2) and setsid(2).

#29: rmtab_write_atomic: <error>

Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after
merging changes between private cluster data (<mount_point>/.clumanager/rmtab)
and the system-wide copy (/var/lib/nfs/rmtab).  The error should be indicative
of the cause.

#30: Node <name> defined multiple times in domain <domain>

This indicates a configuration error where a node <name> was defined more
than once in a given failover domain <domain>.  If this occurs, only the first
entry for the node <name> will be used. Remove the duplicate copy and restart
rgmanager on all nodes.

#31: Domain <domain> defined multiple times

This indicates a configuration error where a domain <domain> was defined
more than once.  Failover domains may not have duplicate names.  If this
occurs, the first one found will be used.

#32: Code path error: Invalid return from node_in_domain()

If this occurs, please file a Bugzilla.

#33: Unable to obtain cluster lock: <error>

This occurs while evaluating services after a node transition.  If
this occurs, the current service under evaluation will not be able
to be checked for possible starting.

Possible reasons obtaining a cluster lock would fail:
 * Loss of node/cluster quorum
 * Broken connection to GuLM/DLM
 * Error in magma-plugins package.

#34: Cannot get status for service <name>

This occurs while evaluating services after a node transition.  The
service <name> has an indeterminable state.  This could indicate a
bug with the data distribution subsystem, or an invalid service.

#35: Unable to inform partner to start failback

This occurs after a node transition where a node asks other nodes for
services (resource groups) which it should own.  This generally indicates
a communication problem between the cluster nodes.  If this occurs,
services may not migrate to the node which just came online.

#36: Cannot initialize services

The resource group manager could not initialize services after a node
transition.  This is fatal, and rgmanager exits uncleanly afterward.

#37: Error receiving message header

This occurs after an incoming request causes the resource group manager to
accept a new connection.  After a new connection is received, there is a
short amount of time during which to receive the message header.  If this
does not occur, the connection is dropped and the message (if any) is
rejected.

#38: Invalid magic: Wanted 0x<hex>, got 0x<hex>

This occurs after an incoming request causes the resource group manager to
accept a new connection.  This could indicate a mismatched version between
resource group managers or an unauthorized program attempting to communicate
with the resource group manager.  The connection is dropped and the message
is rejected.

#39: Error receiving entire request

This occurs after an incoming request causes the service manager to
accept a new connection.  The amount of data received did not match the amount
of data specified in the message header.  This could indicate a mismatched
version between service managers or an unauthorized program attempting
to communicate with the service manager.  The connection is dropped
and the message is rejected.

#40: Error replying to action request.

A resource action request was received (enable/disable/etc.) while the 
resource groups were locked and we failed to reply properly to the waiting
client connection.

#41: Couldn't obtain lock for RG <name>: <error>

While trying to report a failed service (resource group) to the other
cluster members, we failed to obtain a cluster lock.  This could indicate
that the cluster quorum has dissolved, communication errors with the lock
server, or other problems.  The effect of this is, however, minimal; simply
put, the #2 and #3 messages won't appear in the logs, so the last owner
of the service will be unknown.

See #33 for reasons as to why obtaining a lock might fail.

#42: Cannot stop service <name>: Invalid State <integer>

The service <name> could not be stopped.  It was in an invalid state.
This could indicate a bug in rgmanager.

#43: Service <name> has failed on all applicable members; can not
start.

I don't know how to make this more verbose.  The service must be disabled and
enabled prior to being allowed to start.  See #2 and #3 for more information.

#44: Cannot start service <name>: Invalid State <integer>

The service <name> could not be stopped.  It was in an invalid state.
This could indicate a bug in rgmanager.

#45: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group prior
to starting it.  If this occurs, the start operation will fail.  The error
should be indicative of the reason.

See #33 for reasons as to why obtaining a lock might fail.

#46: Failed getting status for RG <name>

This occurs while trying to determine the state of a resource group prior
to starting it.  If this occurs, the start operation will fail.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.

#47: Failed changing service status

This occurs while trying to write out a new ownership state of a resource 
group prior to starting it.  If this occurs, the start operation will fail.

Generally, this indicates attempt to write a new view of that resource
group's state after quorum has dissolved.

#48: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group prior
to performing a status operation on it.  If this occurs, the status
operation will fail.

See #33 for reasons as to why obtaining a lock might fail.

#49: Failed getting status for RG <name>

This occurs while trying to determine the state of a resource group prior
to performing a status operation on it.  If this occurs, the status
operation will fail.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.

#50: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group prior
to performing a stop or disable operation on it.  If this occurs, the stop
operation will fail.

If a stop operation fails on a service, the service is marked as 'failed',
if possible.

See #33 for reasons as to why obtaining a lock might fail.
See #2 for steps to take after a resource group has failed.

#51: Failed getting status for service <name>

This occurs while trying to determine the state of a resource group prior
to performing a stop or disable operation on it.  If this occurs, the stop
operation will fail.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.

If a stop operation fails on a service, the service is marked as 'failed',
if possible (if the cluster is not quorate, then this is not possible).

See #2 for steps to take after a resource group has failed.

#52: Failed changing RG status

This occurs while trying to write out a new ownership state of a resource 
group prior to stopping it.  If this occurs, the stop operation will fail.

Generally, this indicates attempt to write a new view of that resource
group's state after quorum has dissolved.

See #2 for steps to take after a resource group has failed.

#53: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group after
performing a stop or disable operation on it.  If this occurs, the stop
operation will fail.

If a stop operation fails on a service, the service is marked as 'failed',
if possible.

See #33 for reasons as to why obtaining a lock might fail.
See #2 for steps to take after a resource group has failed.

#54: Failed getting status for RG <name>

This occurs while trying to determine the state of a resource group after
to performing a stop or disable operation on it.  If this occurs, the stop
operation will fail.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.

If a stop operation fails on a service, the service is marked as 'failed',
if possible (if the cluster is not quorate, then this is not possible).

#55: Failed changing RG status

This occurs while trying to write out a new ownership state of a resource 
group after stopping it.  If this occurs, the stop operation will fail.

Generally, this indicates attempt to write a new view of that resource
group's state after quorum has dissolved.

See #2 for steps to take after a resource group has failed.

#55: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group after
a stop operation has failed, while the cluster is trying to disable
the service.  If this occurs, the operation will fail.

See #33 for reasons as to why obtaining a lock might fail.
See #2 for steps to take after a resource group has failed.

#56: Failed getting status for RG <name>

This occurs while trying to determine the state of a resource group after
failing to perform a stop or disable operation on it.  If this occurs,
the operation to lock the service will fail.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.

#57: Failed changing RG status

This occurs while trying to write out a new ownership state of a resource 
group after marking it as failed.  If this occurs, the stop operation
will fail.

Generally, this indicates attempt to write a new view of that resource
group's state after quorum has dissolved.

See #2 for steps to take after a resource group has failed.

#58: Failed opening connection to member #<integer>

We attempted to relocate a resource group (service) to another node, but
failed to actually connect to that node's resource group manager.  This
could indicate that rgmanager is not running on that node.  In any case,
the next node in the cluster member list is tried.

#59: Error sending relocate request to member #<integer>

We attempted to relocate a resource group (service) to another node, but
failed to send the relocation message.  This could indicate a problem with
network connectivity, extremely high local/remote load, or other problems.
The next node in the cluster member list is tried.

#60: Mangled reply from member #<integer> during RG relocate

We sent a resource group to another node, but it failed to send us a 
useful reply.  At this point, the state of the resource group is unknown, but
we give it the benefit of the doubt and assume it started okay.

#61: Invalid reply from member <integer> during relocate operation!

Similar to #60, but this occurs only after the inital preferred node failed
to start the service and/or failed to communicate a proper reply.

#62: /var/lib/nfs/rmtab does not exist - creating

/var/lib/nfs/rmtab did not exist.  Clurmtabd creates it.

#63: Couldn't write PID!

Clurmtabd failed to write its pid to <mount_point>/.clumanager/pid.  This
will cause fs.sh to kill it with -9 during a stop operation, preventing
it from synchronizing with /var/lib/nfs/rmtab prior to exiting.  This
increases the risk of ESTALE (Stale NFS file handle) on clients after
a relocation.

#64: Could not validate <mount_point>
#65: NFS failover of <mount_point> will malfunction

Clurmtabd failed to initialize the mount point's private cluster rmtab
file.  This will prevent updating of that mount point's rmtab file, which
means that clients will receive ESTALE after a relocation or failover.

#66: Domain '<domain>' specified for resource group <name> nonexistent!

The failover domain <domain> does not exist in the current view of the
cluster configuration.  This is a configuration error.

#67: Shutting down uncleanly

The node has left the cluster cleanly, but rgmanager was still running.
All services are halted as quickly as possible to prevent data corruption.

(It may be a good idea to have rgmanager reboot if this is received)

#68: Failed to start <name>; return value: <integer>

The resource group <name> failed to start and returned the value <integer>.
This could indicate missing resources on the node or an improperly configured
resource group.  Check your resource group's configuration against your 
hardware and software configuration and ensure that it is correct.

#69: Unclean [stop|disable] of <name>

The resource group is being stopped because of a local node exiting or
loss of quorum.  The distributed state is left unchanged.

#70: Attempting to restart resource group <name> locally.

The resource group failed to start on all other applicable nodes during
processing of a relocate operation.  (A relocate operation occurs either
by an administrator manually relocating a service or the service being
relocated after a fail-to-restart event.)

#71: Relocating failed service <name>

The resource group <name> failed a status check and subsequently failed to
restart.  At this point, we try to send it to another applicable node in
the cluster.

#72: clunfsops: NFS syscall <name> failed: <error>.
#73: clunfsops: Kernel may not have NFS failover enhancements.

Required NFS failover enhancements were not present on the host kernel.
It is impossible to restart or relocate NFS services without these, but
they should properly work in the case of true failover situations (i.e.
the node on whicch the NFS service was running has failed and been
fenced by the cluster).

#74: Unable to obtain cluster lock: <error>

This occurs while trying to determine the state of a resource group after
an attempt to start it has completed (at the script level).  If this occurs,
the start operation will fail.

See #33 for reasons as to why obtaining a lock might fail.

#75: Failed getting status for RG <name>

This occurs while trying to determine the state of a resource group after
an attempt to start it has completed.  If this occurs, the start operation
will report a failure.

Generally, this indicates attempt to retrieve the current view of that 
resource group's state after quorum has dissolved.