Change Details

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * in a rhel9 test, SIGSTOP to execd hangs forever (controller doesn't get a timeout until SIGCONT execd) * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) ## Minor or unconfirmed * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * do standby nodes affect utilization at all? * does start-delay affect unrelated resources? a users list post reported that if there are two resources with no relationship between them, one having a start-delay on its monitor, then if the delayed resource starts first, the second can't start until the first resource's delay expires * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * ACLs: is pre-2.0 schema compat code needed? only scheduler upgrades schema, tools need to be able to work during rolling upgrade; also consider newer remote nodes) * why doesn't scheduler info-log " capacity:" for each node? * figure out why can't update resource history digests with cibadmin * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * "pcs cluster node add" reloads corosync, but pacemaker doesn't pick up change * when running cibadmin command for testing alerts, comments were moved to the end of the section * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * test whether RA with comment before <resource-agent> is parsed correctly * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?)

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * in a rhel9 test, SIGSTOP to execd hangs forever (controller doesn't get a timeout until SIGCONT execd) * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) ## Minor or unconfirmed * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * do standby nodes affect utilization at all? * does start-delay affect unrelated resources? a users list post reported that if there are two resources with no relationship between them, one having a start-delay on its monitor, then if the delayed resource starts first, the second can't start until the first resource's delay expires * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * ACLs: is pre-2.0 schema compat code needed? only scheduler upgrades schema, tools need to be able to work during rolling upgrade; also consider newer remote nodes) * why doesn't scheduler info-log " capacity:" for each node? * figure out why can't update resource history digests with cibadmin * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * "pcs cluster node add" reloads corosync, but pacemaker doesn't pick up change * when running cibadmin command for testing alerts, comments were moved to the end of the section * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * test whether RA with comment before <resource-agent> is parsed correctly * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped)