Pacemaker issues requiring investigation
Open, WishlistPublic
Actions

Assigned To

None

Authored By

	kgaillot
	Jul 24 2024, 11:01 AM

Description

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task.

Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code.

Serious

peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC
in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce)
A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits
peer_update_callback() only updates node_state on DC; what happens if there is no DC?
What happens to lrm_state->resource_history when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote?

Annoying

cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info)
Pending fencing actions are shown in status as "last pending"

Minor or unconfirmed

closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9)
see if can reproduce issue in closed RHBZ#1447916 with current code (see bz notes)
pcs cluster node add reloads corosync, but pacemaker doesn't pick up change
test whether RA with comment before <resource-agent> is parsed correctly
Pacemaker Remote
- in AirFrance's RHBZ#1322822, and reproducers, fencing a remote node's connection host leads to the remote node's resources being scheduled for a Restart, and then their state is inferred as Stopped without an actual stop action, then even if the re-probe finds them running, a start action is sent. The resources should probably be left as-is until the re-probe.
- does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case)
- trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore)
libcrmcommon
- Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use sd_journal_send() with metadata rather than syslog() (mainly, determine what metadata is logged, and whether it's useful or could be made useful)
- acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name?
- pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value)
- glib documents that fork() is not valid while using GMainContext; we use it when draining mainloops, so investigate whether we might fork() in that time
libcrmservice
- create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"?
- services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error
controller
- Should we set a timeout on the controller's scheduler connection attempt (controld_schedulerd.c:do_pe_invoke())?
- "Peer (slave) node deleting master's transient_attributes" (2021-02-01 users list post): possible race condition in DC election
- from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state?
- Should controld_record_action_timeout() add reload/secure digests?
- Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled?
- action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed?
- peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)?
- does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()?
- 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes
- maybe remember that we haven't seen a node since it was last fenced, and the first time it rejoins, erase #node-unfenced iff its re-synced history says that it's not running any already-recovered resources (so fabric-fenced nodes can automatically get unfenced after rejoining if safe to do so)
- in a test, killall -9 pacemaker-controld pacemaker-fenced caused DC to schedule fencing of the node, but the node respawned and started join process, which led to a warning: Input I_PE_SUCCESS received in state S_INTEGRATION from handle_response log and the new transition not scheduling fencing
- when a node leaves, the peer update callback on the DC will call check_join_state(), which will set I_NODE_JOIN, which will log An unknown node joined and re-offer membership to all nodes; maybe replace the log with Membership changed, or maybe we don't need to re-offer
- controld_trigger_delete_refresh() should be done after successful delete is confirmed (in CIB callback) rather than after delete request is sent
- maybe replace crm_resource --wait functionality with a new controller request that would be forwarded to the DC when the DC's version is different from the local node's, to avoid issues in mixed-version clusters (currently a warning in pcs); however, consider possible trouble scenarios such as DC absence/failure/re-election, network issues
- do_lrm_delete() should probably set unregister to FALSE for user cleanups of stonith devices as it does for remote connections, to avoid delete_resource() unregistering the resource, otherwise cleanup is equivalent to stop
- possible bug in controld_execd.c:cancel_op(): final if block (after lrm_state_cancel()) returns TRUE (probably incorrectly, since the cancellation is done, not in progress), which makes most callers return FALSE, which makes do_lrm_rsc_op() not count it in log message
- possible bug in controld_membership.c:populate_cib_nodes(): fsa_cib_update() leaves call_id untouched (= 0) on error, but fsa_register_cib_callback() treats 0 as a valid call ID; maybe treat 0 as error, or initialize call_id to a negative value; probably not a big deal because fsa_cib_conn will be NULL only before do_startup() and after crmd_exit()
- investigate clearing stonith fail counts (st_fail_rec) when any fence device is added or changes configuration; see do_cib_updated(); when any resource with class stonith is added, removed or changed, maybe clear all, and if a fencing topology for a node changes, clear that node (compare RHBZ#1430112); also see CLPR#836 comments (maybe clear when node rejoins after fence is expected)
executor
- in a rhel9 test, SIGSTOP to execd hangs the cluster forever (controller doesn't get a timeout until SIGCONT execd)
- The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it
fencer
- The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped)
- Should all nodes (not just remote nodes) use the unfencing attributes? Otherwise when an unfencing device is changed, only the cluster node running it gets re-unfenced (see check_action_definition())
- If a node chosen to execute fencing times out, but eventually replies, an Already sent notification error is logged because a notification was sent for the timeout; that log should be downgraded and/or worded differently
- stonith_device_execute() appears to not know about remaining timeout for complex requests and so schedules each action with its full usual timeout
- in a remapped reboot, any automatic unfencing will be left for usual execution while other devices will perform unfencing (as is appropriate), but when the automatic unfencing does happen, all devices will unfence rather than just the automatic ones, should we remember that the non-automatic ones were already done?
attribute manager
- If a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using #attrd-protocol; if not, maybe every node could send a targeted #attrd-protocol peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back)
- attrd_cpg_destroy() and attrd_cib_destroy_cb() set attrd_exit_status, but attrd_shutdown() will exit OK if there is no main loop; this shouldn't be a problem because the main loop should always exit at that point, but it would be more future-proof to make attrd_shutdown() use attrd_exit_status
CIB manager
- do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)?
scheduler
- why doesn't scheduler info-log " capacity:" for each node?
- do standby nodes affect utilization at all?
- what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.)
- test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check)
- old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted
- what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code)
- native_add_running() should probably call ban_from_all_nodes() for multiply_active_stop
- figure out what the scheduler does for a primitive converted to a group or clone of the same name, and add a regression test if none exists already
- op configs accept migrate as a synonym for migrate_to or migrate_from (see pcmk__find_action_config()); it should be either documented or deprecated
- The order-mandatory regression test is substantially identical to order-required except that order-mandatory has a resource history entry for starting rsc4 with rc=0 (and the wrong ID) whereas order-required has a probe; determine whether these are redundant, and if so, drop order-mandatory
- Determine the intent of the coloc-negative-group test ("negative colocation with a group"); it has a group anti-colocated with a primitive while the second group member is unmanaged and orphaned monitors need to be stopped, but it's unclear what exactly is being tested
- Does add_collective_constraints() (which uses pcmk__is_everywhere()) wrongly assume all clones are anonymous?
- Is it a good thing that best_node_score_matching_attr() ignores -INFINITY scores?
- In set_instance_priority() by the "Add relevant location constraint scores for promoted role" comment, would it miss location constraints configured explicitly for a particular group member of a cloned group?
- Should resource-discovery-enabled be ignored in unpack_handle_remote_attrs() if the guest's host is unclean and fencing is disabled?
- The bundle-order-stop-clone regression test has an ordering for storage-clone then galera-bundle, and storage-clone is stopping on metal-1, but galera-bundle-master has start and running pseudo-actions (in addition to stopped pseudo-action) even though its bundled resource and container are both stopping; figure out whether that's a bug
cibadmin
- figure out why can't update resource history digests with cibadmin
- when running cibadmin command for testing alerts, comments were moved to the end of the section
crm_resource
- crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?)
- crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC)