errors.txt
No OneTemporary
Actions

Size

21 KB

Referenced Files

None

Subscribers

None

errors.txt
View Options

	Service (Resource Group) Manager Errors

	Herein lie explanations as to the various errors and warnings that you may
	see while running the resource group manager. This is meant to be
	all-inclusive; if any error messages or warning messages are experienced
	which are not explained below, please file a bugzilla:

	http://bugzilla.redhat.com/bugzilla

	#1: Quorum Dissolved

	The cluster infrastructure has reported to the resource group manager that
	the local node and/or entire cluster is inquorate. At this point, all
	services and resources managed by the resource group manager are stopped
	and the resource group manager restarts, waiting for a quorum to form.

	If this node was disconnected, it should be evicted and fenced by the rest
	of the cluster. Nodes which become inquorate may reboot themselves.

	#2: Service <name> returned failure code. Last owner: <name>

	The resource group named <name> has failed to stop. This generally means
	that it may not be automatically recovered and that the system administrator
	must intervene in order to cleanly restart the service. Services which
	fail must first be disabled then re-enabled. However, be sure that all
	resources have been properly cleaned up first. Generally, a hard reset
	of the node on which the service failed will restore it to working order
	and is the safest measure to take prior to restarting the failed service.

	This, however, is not required.

	#3: Service <name> returned failure code. Last owner:
	<integer>

	Same as #2, but the node name was not determinable given the node ID.

	#4: Administrator intervention required.

	Only occurs after #2 or #3. Complaint stating that the administrator
	must take action after a service failed. See #2.

	#5: Couldn't connect to ccsd!

	The resource group manager was unable to connect to ccsd. This generally
	means that ccsd was not running at the time the resource group manager
	tried to connect. Starting ccsd generally solves this. If this does
	not solve it, try checking firewall rules to ensure that the connection
	to ccsd's port is not blocked. Additionally, it could be that there is
	a conflict between ccsd and the version of the system library (libccs) that
	rgmanager was built against.

	#6: Error loading services

	The resource group manager was unable to load configuration information
	from ccsd. This could mean a communication problem or an invalid
	configuration.

	This error is fatal; the resource group manager aborts.

	#7: Error building resource tree

	The resource group manager was unable to load configuration information
	from ccsd. This could mean a communication problem or an invalid
	configuration.

	This error is fatal; the resource group manager aborts.

	#8: Couldn't initialize services

	The resource group manager was unable to load configuration information
	from ccsd. This could mean a communication problem or an invalid
	configuration.

	#9: Couldn't connect to cluster

	The resource group manager was unable to find a plugin which was able to
	talk to the cluster infrastructure. Generally, this occurs when no cluster
	infrastruture is running. Try starting the preferred cluster infrastructure
	for your configuration (e.g. CMAN+DLM, GuLM) and restarting rgmanager.

	#10: Couldn't set up listen socket

	The resource group manager was unable to bind to its listening socket.
	This generally happens when there is already a resource group manager running,
	but it is possible for other applications to use the port that rgmanager
	wants to use for this purpose.

	#11: Couldn't set up VF listen socket

	The resource group manager was unable to bind to its listening socket which
	is used fo internal state distribution. This generally happens when there
	is already a resource group manager running, but it is possible for other
	applications to use the port that rgmanager wants to use for this purpose.

	#12: RG <name> failed to stop; intervention required

	The resource group manager has failed to cleanly stop a service.
	See #2 for actions to take.

	#13: Service <name> failed to stop cleanly

	The resource group manager has failed to cleanly stop a service after
	failing to start the same resource group during an enable or start operation.
	See #2 for actions to take.

	#14: Failed to send <integer> bytes to <integer>

	During a broadcast operation to all nodes, the view-formation library has
	failed to send a message to one of the nodes. This is generally recovered
	automatically.

	#15: rmtab_modified: stat: <error>

	The stat(2) function received an error while scanning for changes to
	/var/lib/nfs/rmtab. Though generally handled automatically, repeats of
	this kind of error could indicate a problem with /var/lib/nfs/rmtab.

	If repeated errors occur:
	* Ensure /var/lib/nfs exists and is a directory.
	* Ensure /var/lib/nfs/rmtab exists and is a regular file (not a directory)

	#16: Failed to reread rmtab: <error>

	Clurmtabd had trouble reading or parsing /var/lib/nfs/rmtab. This could
	mean that there is garbage in /var/lib/nfs/rmtab or that the internal
	format has changed or some problem with the filesystem. If the latter is
	suspected, please file a Bugzilla. Ensure you include your
	/var/lib/nfs/rmtab and version of rgmanager.

	#17: Failed to prune rmtab: <error>

	Clurmtabd failed to prune a new copy of /var/lib/nfs/rmtab against its
	specified mount point.

	#18: Failed to diff rbtab: <error>

	Clurmtabd failed to determine the differences, if any, between a previous
	copy of /var/lib/nfs/rmtab and the current data stored there.

	#19: (Obsolete)

	#20: Failed to set log level

	Clurmtabd failed to change its log level to the specified level. This
	is non-fatal. The side effect is that more or less verbose logging will
	be seen depending on whether the log level was increased or decreased.

	#21: Couldn't read/create <filename>

	Clurmtabd stores service-specific rmtab information in
	<mount_point>/.clumanager/rmtab so that when the service fails over,
	the new instance of clurmtabd on the other node can pick up and know which
	hosts already have mounted any NFS exports. This error could indicate that
	clients will receive ESTALE (Stale NFS file handle) after failover.

	#22: Failed to read <filename>: <error>

	Clurmtabd failed to read or parse <filename>. This could indicate garbage
	in that file, or another error. This could have side effects of ESTALEs
	being received by clients. The error should be indicative of the cause.

	#23: Failed to read /var/lib/nfs/rmtab: <error>

	Clurmtabd failed to read or parse <filename>. This could indicate garbage
	in that file, or another error. This could have side effects of ESTALEs
	being received by clients. The error should be indicative of the cause.

	#24: Failed to prune rmtab: <error>

	Clurmtabd failed to prune unrelated entries in /var/lib/nfs/rmtab. The
	error should be indicative of the cause.

	#25: Failed to write /var/lib/nfs/rmtab: <error>

	Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after
	merging changes between private cluster data (<mount_point>/.clumanager/rmtab)
	and the system-wide copy (/var/lib/nfs/rmtab). The error should be indicative
	of the cause.

	#26: Failed to write <filename>: <error>

	Clurmtabd failed to atomically write a new copy of its private cluster data
	file after changes between it and the system-wide copy in
	(/var/lib/nfs/rmtab). The error should be indicative of the cause.

	#27: Couldn't initialize - exiting

	Clurmtabd failed to initialize for one reason or another. A previous
	error should indicate the reason why.

	#28: daemonize: <error>

	Clurmtabd failed to become a daemon. The possible reasons this could happen
	are documented in fork(2) and setsid(2).

	#29: rmtab_write_atomic: <error>

	Clurmtabd failed to atomically write a new copy of /var/lib/nfs/rmtab after
	merging changes between private cluster data (<mount_point>/.clumanager/rmtab)
	and the system-wide copy (/var/lib/nfs/rmtab). The error should be indicative
	of the cause.

	#30: Node <name> defined multiple times in domain <domain>

	This indicates a configuration error where a node <name> was defined more
	than once in a given failover domain <domain>. If this occurs, only the first
	entry for the node <name> will be used. Remove the duplicate copy and restart
	rgmanager on all nodes.

	#31: Domain <domain> defined multiple times

	This indicates a configuration error where a domain <domain> was defined
	more than once. Failover domains may not have duplicate names. If this
	occurs, the first one found will be used.

	#32: Code path error: Invalid return from node_in_domain()

	If this occurs, please file a Bugzilla.

	#33: Unable to obtain cluster lock: <error>

	This occurs while evaluating services after a node transition. If
	this occurs, the current service under evaluation will not be able
	to be checked for possible starting.

	Possible reasons obtaining a cluster lock would fail:
	* Loss of node/cluster quorum
	* Broken connection to GuLM/DLM
	* Error in magma-plugins package.

	#34: Cannot get status for service <name>

	This occurs while evaluating services after a node transition. The
	service <name> has an indeterminable state. This could indicate a
	bug with the data distribution subsystem, or an invalid service.

	#35: Unable to inform partner to start failback

	This occurs after a node transition where a node asks other nodes for
	services (resource groups) which it should own. This generally indicates
	a communication problem between the cluster nodes. If this occurs,
	services may not migrate to the node which just came online.

	#36: Cannot initialize services

	The resource group manager could not initialize services after a node
	transition. This is fatal, and rgmanager exits uncleanly afterward.

	#37: Error receiving message header

	This occurs after an incoming request causes the resource group manager to
	accept a new connection. After a new connection is received, there is a
	short amount of time during which to receive the message header. If this
	does not occur, the connection is dropped and the message (if any) is
	rejected.

	#38: Invalid magic: Wanted 0x<hex>, got 0x<hex>

	This occurs after an incoming request causes the resource group manager to
	accept a new connection. This could indicate a mismatched version between
	resource group managers or an unauthorized program attempting to communicate
	with the resource group manager. The connection is dropped and the message
	is rejected.

	#39: Error receiving entire request

	This occurs after an incoming request causes the service manager to
	accept a new connection. The amount of data received did not match the amount
	of data specified in the message header. This could indicate a mismatched
	version between service managers or an unauthorized program attempting
	to communicate with the service manager. The connection is dropped
	and the message is rejected.

	#40: Error replying to action request.

	A resource action request was received (enable/disable/etc.) while the
	resource groups were locked and we failed to reply properly to the waiting
	client connection.

	#41: Couldn't obtain lock for RG <name>: <error>

	While trying to report a failed service (resource group) to the other
	cluster members, we failed to obtain a cluster lock. This could indicate
	that the cluster quorum has dissolved, communication errors with the lock
	server, or other problems. The effect of this is, however, minimal; simply
	put, the #2 and #3 messages won't appear in the logs, so the last owner
	of the service will be unknown.

	See #33 for reasons as to why obtaining a lock might fail.

	#42: Cannot stop service <name>: Invalid State <integer>

	The service <name> could not be stopped. It was in an invalid state.
	This could indicate a bug in rgmanager.

	#43: Service <name> has failed on all applicable members; can not
	start.

	I don't know how to make this more verbose. The service must be disabled and
	enabled prior to being allowed to start. See #2 and #3 for more information.

	#44: Cannot start service <name>: Invalid State <integer>

	The service <name> could not be stopped. It was in an invalid state.
	This could indicate a bug in rgmanager.

	#45: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group prior
	to starting it. If this occurs, the start operation will fail. The error
	should be indicative of the reason.

	See #33 for reasons as to why obtaining a lock might fail.

	#46: Failed getting status for RG <name>

	This occurs while trying to determine the state of a resource group prior
	to starting it. If this occurs, the start operation will fail.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

	#47: Failed changing service status

	This occurs while trying to write out a new ownership state of a resource
	group prior to starting it. If this occurs, the start operation will fail.

	Generally, this indicates attempt to write a new view of that resource
	group's state after quorum has dissolved.

	#48: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group prior
	to performing a status operation on it. If this occurs, the status
	operation will fail.

	See #33 for reasons as to why obtaining a lock might fail.

	#49: Failed getting status for RG <name>

	This occurs while trying to determine the state of a resource group prior
	to performing a status operation on it. If this occurs, the status
	operation will fail.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

	#50: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group prior
	to performing a stop or disable operation on it. If this occurs, the stop
	operation will fail.

	If a stop operation fails on a service, the service is marked as 'failed',
	if possible.

	See #33 for reasons as to why obtaining a lock might fail.
	See #2 for steps to take after a resource group has failed.

	#51: Failed getting status for service <name>

	This occurs while trying to determine the state of a resource group prior
	to performing a stop or disable operation on it. If this occurs, the stop
	operation will fail.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

	If a stop operation fails on a service, the service is marked as 'failed',
	if possible (if the cluster is not quorate, then this is not possible).

	See #2 for steps to take after a resource group has failed.

	#52: Failed changing RG status

	This occurs while trying to write out a new ownership state of a resource
	group prior to stopping it. If this occurs, the stop operation will fail.

	Generally, this indicates attempt to write a new view of that resource
	group's state after quorum has dissolved.

	See #2 for steps to take after a resource group has failed.

	#53: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group after
	performing a stop or disable operation on it. If this occurs, the stop
	operation will fail.

	If a stop operation fails on a service, the service is marked as 'failed',
	if possible.

	See #33 for reasons as to why obtaining a lock might fail.
	See #2 for steps to take after a resource group has failed.

	#54: Failed getting status for RG <name>

	This occurs while trying to determine the state of a resource group after
	to performing a stop or disable operation on it. If this occurs, the stop
	operation will fail.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

	If a stop operation fails on a service, the service is marked as 'failed',
	if possible (if the cluster is not quorate, then this is not possible).

	#55: Failed changing RG status

	This occurs while trying to write out a new ownership state of a resource
	group after stopping it. If this occurs, the stop operation will fail.

	Generally, this indicates attempt to write a new view of that resource
	group's state after quorum has dissolved.

	See #2 for steps to take after a resource group has failed.

	#55: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group after
	a stop operation has failed, while the cluster is trying to disable
	the service. If this occurs, the operation will fail.

	See #33 for reasons as to why obtaining a lock might fail.
	See #2 for steps to take after a resource group has failed.

	#56: Failed getting status for RG <name>

	This occurs while trying to determine the state of a resource group after
	failing to perform a stop or disable operation on it. If this occurs,
	the operation to lock the service will fail.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

	#57: Failed changing RG status

	This occurs while trying to write out a new ownership state of a resource
	group after marking it as failed. If this occurs, the stop operation
	will fail.

	Generally, this indicates attempt to write a new view of that resource
	group's state after quorum has dissolved.

	See #2 for steps to take after a resource group has failed.

	#58: Failed opening connection to member #<integer>

	We attempted to relocate a resource group (service) to another node, but
	failed to actually connect to that node's resource group manager. This
	could indicate that rgmanager is not running on that node. In any case,
	the next node in the cluster member list is tried.

	#59: Error sending relocate request to member #<integer>

	We attempted to relocate a resource group (service) to another node, but
	failed to send the relocation message. This could indicate a problem with
	network connectivity, extremely high local/remote load, or other problems.
	The next node in the cluster member list is tried.

	#60: Mangled reply from member #<integer> during RG relocate

	We sent a resource group to another node, but it failed to send us a
	useful reply. At this point, the state of the resource group is unknown, but
	we give it the benefit of the doubt and assume it started okay.

	#61: Invalid reply from member <integer> during relocate operation!

	Similar to #60, but this occurs only after the inital preferred node failed
	to start the service and/or failed to communicate a proper reply.

	#62: /var/lib/nfs/rmtab does not exist - creating

	/var/lib/nfs/rmtab did not exist. Clurmtabd creates it.

	#63: Couldn't write PID!

	Clurmtabd failed to write its pid to <mount_point>/.clumanager/pid. This
	will cause fs.sh to kill it with -9 during a stop operation, preventing
	it from synchronizing with /var/lib/nfs/rmtab prior to exiting. This
	increases the risk of ESTALE (Stale NFS file handle) on clients after
	a relocation.

	#64: Could not validate <mount_point>
	#65: NFS failover of <mount_point> will malfunction

	Clurmtabd failed to initialize the mount point's private cluster rmtab
	file. This will prevent updating of that mount point's rmtab file, which
	means that clients will receive ESTALE after a relocation or failover.

	#66: Domain '<domain>' specified for resource group <name> nonexistent!

	The failover domain <domain> does not exist in the current view of the
	cluster configuration. This is a configuration error.

	#67: Shutting down uncleanly

	The node has left the cluster cleanly, but rgmanager was still running.
	All services are halted as quickly as possible to prevent data corruption.

	(It may be a good idea to have rgmanager reboot if this is received)

	#68: Failed to start <name>; return value: <integer>

	The resource group <name> failed to start and returned the value <integer>.
	This could indicate missing resources on the node or an improperly configured
	resource group. Check your resource group's configuration against your
	hardware and software configuration and ensure that it is correct.

	#69: Unclean [stop\|disable] of <name>

	The resource group is being stopped because of a local node exiting or
	loss of quorum. The distributed state is left unchanged.

	#70: Attempting to restart resource group <name> locally.

	The resource group failed to start on all other applicable nodes during
	processing of a relocate operation. (A relocate operation occurs either
	by an administrator manually relocating a service or the service being
	relocated after a fail-to-restart event.)

	#71: Relocating failed service <name>

	The resource group <name> failed a status check and subsequently failed to
	restart. At this point, we try to send it to another applicable node in
	the cluster.

	#72: clunfsops: NFS syscall <name> failed: <error>.
	#73: clunfsops: Kernel may not have NFS failover enhancements.

	Required NFS failover enhancements were not present on the host kernel.
	It is impossible to restart or relocate NFS services without these, but
	they should properly work in the case of true failover situations (i.e.
	the node on whicch the NFS service was running has failed and been
	fenced by the cluster).

	#74: Unable to obtain cluster lock: <error>

	This occurs while trying to determine the state of a resource group after
	an attempt to start it has completed (at the script level). If this occurs,
	the start operation will fail.

	See #33 for reasons as to why obtaining a lock might fail.

	#75: Failed getting status for RG <name>

	This occurs while trying to determine the state of a resource group after
	an attempt to start it has completed. If this occurs, the start operation
	will report a failure.

	Generally, this indicates attempt to retrieve the current view of that
	resource group's state after quorum has dissolved.

File Metadata

Mime Type: text/plain
Expires: Wed, Feb 26, 1:37 AM (1 d, 4 h)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 1463869
Default Alt Text: errors.txt (21 KB)

errors.txtNo OneTemporaryActions

errors.txtView Options

File Metadata

Event Timeline

errors.txt
No OneTemporary
Actions

errors.txt
View Options