HomeClusterLabs Projects

Feature: resources, tools: Drop SystemHealth, ipmiservicelogd, ...

Description

Feature: resources, tools: Drop SystemHealth, ipmiservicelogd, ...

...and notifyServicelogEvent. These three tools are designed to be used
together. ipmiservicelogd listens for events in the IPMI SEL and writes
them to the servicelog (an SQLite database). notifyServicelogEvent gets
notified of new events in the servicelog and updates the #health-ipmi
attribute based on the event. The SystemHealth resource agent manages
both of these and sets up the notifications.

These tools can only be used on ppc architectures, since that's the only
place servicelog is supported.

A number of issues were found while trying to bring these into
conformance with best practices:

  • ipmiservicelogd uses dmidecode, which doesn't exist on ppc. This causes errors to be logged but no other issues since there's a fallback.
  • SystemHealth is limited to managing IPMI connections on SMI 0. It can't handle LAN connections currently.
  • SystemHealth passes an argument to ps -p in with spaces that cause the PID not to be found. So every monitor operation fails.
  • notifyServicelogEvent can't create the #health-ipmi attribute if it doesn't already exist, because it uses pcmknode_attr_pattern (regex) instead of pcmknode_attr_value (exact) in its pcmk__attrd_api_update() call.
  • ipmiservicelogd logs every event with WARNING severity and RECOVERABLE disposition. But notifyServicelogEvent sets status "red" only if the disposition is unrecoverable. Further, it sets status "green" only if the severity is not WARNING, which never happens. Finally, if the severity is above WARNING and the disposition is not unrecoverable, the event is ignored.
  • ipmiservicelogd logs every event as closed and deassert ("problem resolved"). The fixed deassert is due to a pointer mistake (&bmc_data vs. bmc_data).

However, the entire paradigm is unsuitable for its intended use case.
These tools set node health based on the latest log message rather than
the current state. It would be possible to rewrite them to keep track of
node state by tracking which events are resolved vs. still open. Aside
from the effort and size of the change involved, there are non-trivial
issues with concurrent access to the servicelog DB that get in the way.

Here we drop these tools since they may have never worked reliably in
the past, they can be used only on ppc with particular libraries
installed, and rewriting and maintaining them would require significant
effort.

Full discussion at https://projects.clusterlabs.org/T101.

Closes T53
Closes T101
Closes T102

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Sep 6 2022, 4:18 PM
Parents
rPdf90a631eb6a: Merge pull request #2807 from clumens/attrd-req
Branches
Unknown
Tags
Unknown
Tasks
Restricted Maniphest Task
Restricted Maniphest Task
Restricted Maniphest Task

Event Timeline

nrwahl2 added a task: Restricted Maniphest Task.Sep 6 2022, 6:59 PM
nrwahl2 added a task: Restricted Maniphest Task.
nrwahl2 added a task: Restricted Maniphest Task.