Change Details

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * in a rhel9 test, SIGSTOP to execd hangs forever (controller doesn't get a timeout until SIGCONT execd) * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) * Pending fencing actions are shown in status as "last pending" ## Minor or unconfirmed * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * do standby nodes affect utilization at all? * does start-delay affect unrelated resources? a users list post reported that if there are two resources with no relationship between them, one having a start-delay on its monitor, then if the delayed resource starts first, the second can't start until the first resource's delay expires * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * ACLs: ** acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? ** pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value) * why doesn't scheduler info-log " capacity:" for each node? * figure out why can't update resource history digests with cibadmin * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * "pcs cluster node add" reloads corosync, but pacemaker doesn't pick up change * when running cibadmin command for testing alerts, comments were moved to the end of the section * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * test whether RA with comment before <resource-agent> is parsed correctly * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * crm_resource: ** crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) ** crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC) * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped) * pacemaker-attrd: if a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back) * CIB manager: ** do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)? * Controller: ** Should controld_record_action_timeout() add reload/secure digests? ** Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled? ** action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed? ** peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? ** does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()?

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) * Pending fencing actions are shown in status as "last pending" ## Minor or unconfirmed * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * `pcs cluster node add` reloads corosync, but pacemaker doesn't pick up change * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) * test whether RA with comment before <resource-agent> is parsed correctly * libcrmcommon * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? * pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value) * libcrmservice * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, and agents explicitly specified with ".service"? * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * controller * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join state? * Should controld_record_action_timeout() add reload/secure digests? * Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled? * action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed? * peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? * does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()? * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * executor * in a rhel9 test, SIGSTOP to execd hangs the cluster forever (controller doesn't get a timeout until SIGCONT execd) * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * fencer * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped) * attribute manager * If a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back) * CIB manager * do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)? * scheduler * why doesn't scheduler info-log " capacity:" for each node? * do standby nodes affect utilization at all? * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, etc.) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * cibadmin * figure out why can't update resource history digests with cibadmin * when running cibadmin command for testing alerts, comments were moved to the end of the section * crm_resource * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) * crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC)

These are items that may or may not be problems, and need to be investigated. Anything that turns out to be a real problem should be moved to its own task. Some of these may refer to code that has since changed significantly. Just delete them if you can't figure out what they apply to. Similarly, delete anything that can't be reproduced with current code. ## Serious * in a rhel9 test, SIGSTOP to execd hangs forever (controller doesn't get a timeout until SIGCONT execd) * peer_update_callback() "Ignoring peer status change because stopping": bad idea if DC * in a long ago test, bumped admin_epoch all but one node (not in cluster at the time), but the one node's CIB became used anyway when it joined (try to reproduce) * A long-ago "endian mismatch" thread on the user list said that Pacemaker Remote got flaky around 40 remote nodes; others have successfully had more nodes than that, so it may be dependent on some aspect of the environment; in any case, it would be nice to know what reasonable limits are and what happens when various capacities approach their limits * `peer_update_callback()` only updates node_state on DC; what happens if there is no DC? * What happens to `lrm_state->resource_history` when a remote connection moves to a new node? Would the new host be unable to (for example) stop an orphan resource on the remote? ## Annoying * cts-lab still getting occasional "Could not get executable for PID" (maybe just lower ENOENT to info) * Pending fencing actions are shown in status as "last pending" ## Minor or unconfirmed * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC electionclosed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9) * do standby nodes affect utilization at all?see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * does start-delay affect unrelated resources? a users list post reported that if there are two resources with no relationship between them, one having a start-delay on its monitor,failed fencing of a remote node prevent resource recovery elsewhere? then if the delayed resource starts firstdynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, the second cabut it doesn't start until the first resource's delay expiresaffect failed start in this case) * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed`pcs cluster node add` reloads corosync, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restartedbut pacemaker doesn't pick up change * ACLs:trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore) ** acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? test whether RA with comment before <resource-agent> is parsed correctly ** pcmk__update_acl_user() doxygen says return value is actually used username, but it's always requested_user -- should it be user instead?; return value should always be checked for NULL (assert?) (do_lrm_invoke(), handle_lrm_delete() are only callers that use return value) libcrmcommon * why doesn't scheduler info-log " capacity:" for each node? * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful) * figure out why can't update resource history digests with cibadmin * acl.c:pcmk__check_acl() changes create requests to write requests if the requested name is an attribute; does this mean an attribute and child element cannot have same name? * what happens if invalid resource (i.e. * pcmk__update_acl_user() doxygen says return value is actually used username, fails unpacking) ibut it's always referenced in rest of configuration (constraints,quested_user -- should it be user instead?; id-ref'sreturn value should always be checked for NULL (assert?) (do_lrm_invoke(), etc.handle_lrm_delete() are only callers that use return value) * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check)libcrmservice * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); * create_override_dir() creates systemd override file in <agent>.service.d -- what about sockets, however will scheduler know not to schedule fencing? why is node state not updated after change in peer's expected join stateand agents explicitly specified with ".service"? * does failed fencing of a remote node prevent resource recovery elsewhere? dynamic recheck interval bz QE issue investigation suggested no (->remote_was_fenced should prevent recheck interval from kicking in for recurring monitor failures, but it doesn't affect failed start in this case) * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error * closed rhbz, but try to reproduce with current code: create a guest node, assign 2 resources to it, then live-migrate the guest node to another cluster node (user saw "Error in the push function" message on 7.9)ontroller * see if can reproduce issue in closed [[https://bugzilla.redhat.com/show_bug.cgi?id=1447916 | RHBZ#1447916]] with current code (see bz notes) * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()`)? * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? (hopefully this is the "Re-initiated expired calculated failure" code) * [[https://lists.clusterlabs.org/pipermail/users/2021-February/028397.html | "Peer (slave) node deleting master's transient_attributes"]] (2021-02-01 users list post): possible race condition in DC election * "pcs cluster node add" reloads corosync, * from users list: peer_update_callback() down==NULL appeared==FALSE should probably handle the case where expected==down, node is DC, but scheduler hasn't been run so there is no down to be found (i.e. new DC is elected between when old DC leaves CPG and membership); however will scheduler know not to schedule fencing? but pacemaker doesn't pick up changewhy is node state not updated after change in peer's expected join state? * when running cibadmin command for testing alerts, comments were moved to the end of the section * Should controld_record_action_timeout() add reload/secure digests? * Should clone post-notifications still go out when a transition is aborted, or will/can they be rescheduled? * action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed? * peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? * does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()? * 7d6cdb8 may have made abort_transition(..., "pacemaker_remote node integrated", ...) unnecessary for guest nodes, and similar may be possible for remote nodes * trying to set a node attribute for a remote node results in no attribute being set (-l reboot) or the attribute being set on the uname (-l forever) when run on the remote node without -N (may not apply anymore)executor * create_override_dir() creates systemd override file * in <agent>.service.d -- what about socketsa rhel9 test, and agents explicitly specified with ".service"?SIGSTOP to execd hangs the cluster forever (controller doesn't get a timeout until SIGCONT execd) * test whether RA with comment before <resource-agent> is parsed correctly * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove it * services_os_action_execute(): sigchld_setup() failure should probably call services_handle_exec_error* fencer * crm_resource: * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; investigate whether we can drop that requirement (it may have only been needed for functionality that was dropped) ** crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) attribute manager ** crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC * If a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came back) * Determine whether it's worthwhile to enable QB_LOG_CONF_USE_JOURNAL (automatically or optionally), which makes libqb use `sd_journal_send()` with metadata rather than `syslog()` (mainly, determine what metadata is logged, and whether it's useful or could be made useful)CIB manager * Should we set a timeout on the controller's scheduler connection attempt (`controld_schedulerd.c:do_pe_invoke()` * do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); should we try again in one of those places or does libqb do that automatically (in which case just log differently)? * The executor serializes operations in some fashion, but libcrmservice was long ago updated to do that, so investigate where the executor does that and whether we can remove itscheduler * The fencer currently requires devices be API-registered (not just CIB-registered) for monitors; * why doesn't scheduler info-log " capacity:" for each node? * do standby nodes affect utilization at all? * what happens if invalid resource (i.e. fails unpacking) is referenced in rest of configuration (constraints, id-ref's, investigate whether we can drop that requirement (it may have only been needed for functionality that was droppedetc.) * pacemaker-attrd: if a node leaves and rejoins when there's no writer, attributes won't be sync'd until a writer is elected; investigate whether this is acceptable for protocol detection purposes using `#attrd-protocol`; if not, maybe every node could send a targeted `#attrd-protocol` peer update to any joining peer (from a rejoining node's perspective, all the other nodes left and came ba * test what happens if you unmanage a remote resource then stop the cluster on connection host (might at least want to exclude remote resources from ->active() managed check) * CIB manager: * old report: setting enabled=false on a recurring monitor while the resource is unmanaged or in maintenance mode does not disable the monitor until normal mode is resumed, and setting it back to true (not sure if this was while unmanaged or after) did not enable it until the resource was restarted ** do_local_notify(): pcmk__ipc_send_xml() can return EAGAIN (likely from qb_ipcs_event_sendv()); * what happens if: actions need to be done sequentially, later action has short failure timeout, transition is aborted after its failure timeout but before it is initiated -- do we know to schedule the action again? should we try again in one of those places or does libqb do that automatically (in which case just log differently)?(hopefully this is the "Re-initiated expired calculated failure" code) * Controller:* cibadmin ** Should controld_record_action_timeout() add reload/secure digests? * figure out why can't update resource history digests with cibadmin ** Should clone post-notifications still go out when a transition is aborted * when running cibadmin command for testing alerts, or will/cancomments were moved to they be rescheduled? end of the section ** action_timer_callback() assumes the local node is still DC (transition_graph is not NULL); is that guaranteed?* crm_resource ** peer_update_callback(): Should "Sending hello" instead be done with "Fence history will be sychronized" to avoid a bunch of broadcasts when multiple nodes join (such as at startup)? * crm_resource_runtime.c:find_matching_attr_resource(): "Could be a cloned group" comment is wrong, but it could be a cloned primitive; checking first child makes sense for cloned primitive, but not group (group shouldn't check children, just set on group); also, update/delete attribute handles utilization attributes inconsistently (should be handled like instance attributes in update?) ** does search_conflicting_node_callback() leak user_data? and is it redundant with remove_conflicting_peer()? * crm_resource -R gets a response for each node but appears to be expecting only one (trace the IPC)