Change Details

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) * Pending fencing actions are shown in status as "last pending" ## Minor or unconfirmed * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * `pcs cluster node add` reloads corosync, but pacemaker doesn't pick up change * test whether RA with comment before <resource-agent> is parsed correctly * Pacemaker Remote * in AirFrance's RHBZ#1322822, and reproducers, fencing a remote node's connection host leads to the remote node's resources being scheduled for a Restart, and then their state is inferred as Stopped without an actual stop action, then even if the re-probe finds them running, a start action is sent. The resources should probably be left as-is until the re-probe. * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * libcrmcommon * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? * pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value) * libcrmservice * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * controller * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * Should controld_record_action_timeout() add reload/secure digests? * Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled? * action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed? * peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? * does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()? * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * executor * in a rhel9 test, SIGSTOP to execd hangs the cluster forever (controller doesn't get a timeout until SIGCONT execd) * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * fencer * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped) * attribute manager * If a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back) * CIB manager * do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)? * scheduler * why doesn't scheduler info-log " capacity:" for each node? * do standby nodes affect utilization at all? * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * cibadmin * figure out why can't update resource history digests with cibadmin * when running cibadmin command for testing alerts, comments were moved to the end of the section * crm_resource * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) * crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC)

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) * Pending fencing actions are shown in status as "last pending" ## Minor or unconfirmed * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * `pcs cluster node add` reloads corosync, but pacemaker doesn't pick up change * test whether RA with comment before <resource-agent> is parsed correctly * Pacemaker Remote * in AirFrance's RHBZ#1322822, and reproducers, fencing a remote node's connection host leads to the remote node's resources being scheduled for a Restart, and then their state is inferred as Stopped without an actual stop action, then even if the re-probe finds them running, a start action is sent. The resources should probably be left as-is until the re-probe. * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * libcrmcommon * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? * pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value) * libcrmservice * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * controller * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * Should controld_record_action_timeout() add reload/secure digests? * Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled? * action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed? * peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? * does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()? * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * executor * in a rhel9 test, SIGSTOP to execd hangs the cluster forever (controller doesn't get a timeout until SIGCONT execd) * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * fencer * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped) * Should all nodes (not just remote nodes) use the unfencing attributes? Otherwise when an unfencing device is changed, only the cluster node running it gets re-unfenced (see check_action_definition()) * attribute manager * If a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back) * CIB manager * do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)? * scheduler * why doesn't scheduler info-log " capacity:" for each node? * do standby nodes affect utilization at all? * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * `native_add_running()` should probably call `ban_from_all_nodes()` for `multiply_active_stop` * cibadmin * figure out why can't update resource history digests with cibadmin * when running cibadmin command for testing alerts, comments were moved to the end of the section * crm_resource * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) * crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC)