Page MenuHomeClusterLabs Projects

Support easier debugging of timeouts
Open, LowPublic

Assigned To
None
Authored By
kgaillot
Dec 12 2023, 12:44 PM
Tags
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

Resource agent timeouts can be difficult to debug, especially if they are not reproducible.

It could be helpful to collect system statistics just before the timeout (CPU load, process listing, stack trace, etc.). T203 and T355 could help as well, and any user interface (possibly an operation meta-attribute) should consider all issues.

Challenges:

  • The exact statistics desired could vary arbitrarily and be platform-specific. We could let the user specify an arbitrary binary to run just before a timeout, or support a new OCF action for collecting relevant system info that resource agents would have to define (which would lessen the burden on users, but would limit the scope to OCF resources). Alternatively we could define a list of supported statistics and the user could specify which ones they want.
  • If the timeout is due to severe load issues, launching another script or child process could make the problem worse. A compiled binary would be better though still potentially a problem.
  • The collector would need a timeout itself, and we'd have to ensure that the collector either finishes or is killed before killing the agent process.

If the cluster collects stats itself, they should be stored in a rolling binary file with tools for querying and analysis. This could be done via a child process or even potentially a new subdaemon.

It's possible that existing system collection projects (like sar or collectd) are the best solution, but we may be able to integrate with something like that better.

If possible, it might also be helpful on Linux to run Pacemaker processes in their own cgroup for analyzing CPU and memory usage, so the collector could focus on (for example) /sys/fs/cgroup/pids/pacemaker/tasks. However it might not be desirable to run resource agents in that cgroup (the user may already have a cgroup for a particular service), which would limit the helpfulness for agent timeouts.

See:

Event Timeline

kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".