High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (4/4)
[4/4: CPG users to be careful about now-more-probable rival processes]
In essence, this comes down to pacemaker confusing at-node CPG members
with effectively the only plausible to co-exist at particular node,
which doesn't hold and asks for a wider reconciliation of this
reality-check.
However, in practical terms, since there are two factors lowering the
priority of doing so:
1/ possibly the only non-self-inflicted scenario is either that
some of the cluster stack processes fail -- this the problem that shall rather be deferred to arranged node disarming/fencing to stay on the safe side with 100% certainty, at the cost of possibly long-lasting failover process at other nodes (for other possibility, someone running some of these by accident so they effectively become rival processes, it's like getting hands cut when playing with a lawnmower in an unintended way)
2/ for state tracking of the peer nodes, it may possibly cause troubles
in case the process observed as left wasn't the last for the particular node, even if presumably just temporary, since the situation may eventually resolve with imposed serialization of the rival processes via API end-point singleton restriction (this is also the most likely cause of why such non-final leave gets observed in the first place), except in one case -- the legitimate API end-point carrier won't possibly acknowledged as returned by its peers, at least not immediately, unless it tries to join anew, which verges on undefined behaviour (at least per corosync documentation)
we make do just with a light code change so as to
- limit 1/ some more with in-daemon self-check for pre-existing end-point existence (this is to complement the checks already made in the parent daemon prior to spawning new instances, only some moments later; note that we don't have any lock file etc. mechanisms to prevent parallel runs of the same daemons, and people could run these on their own deliberation), and to
- guard against the interferences from the rivals at the same node per 2/ with ignoring their non-final leave messages altogether.
Note that CPG at this point is already expected to be authenticity-safe.
Regarding now-more-probable part, we actually traded the inherently racy
procfs scanning for something (exactly that singleton mentioned above)
rather firm (and unfakeable), but we admittedly got lost track of
processes that are after CPG membership (that is, another form of
a shared state) prior to (or in non-deterministic order allowing for
the same) carring about publishing the end-point.
Big thanks is owed to Yan Gao of SUSE, for early discovery and reporting
this discrepancy arising from the earlier commits in the set.