Page MenuHomeClusterLabs Projects

Properly check on-fail when getting fail counts
Open, NormalPublic

Assigned To
None
Authored By
kgaillot
Nov 14 2024, 2:48 PM
Tags
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

When getting a resource's fail count, the scheduler checks whether failures are expired. As part of that, it checks whether the resource has a failed action with on-fail="block", in which case the failure never expires.

Currently, this uses an XPath to search the configuration for on-fail="block". However, that does not detect on-fail="block" when used via id-ref, operation-specific meta-attributes, or op_defaults, and does not evaluate rules. Also, it determines failure by simply comparing the actual exit status against the expected exit status, which does not account for all the situations handled by remap_operation().

Instead, we should (temporarily) instantiate an action, which considers everything when calculating on-fail, and remap exit status appropriately.

Event Timeline

kgaillot triaged this task as Normal priority.Nov 14 2024, 2:48 PM
kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".