Page MenuHomeClusterLabs Projects

Investigate CIB transaction handling during rolling upgrade
Open, NormalPublic

Assigned To
Authored By
kgaillot
Oct 14 2024, 4:12 PM
Tags
  • Restricted Project
  • Restricted Project
Referenced Files
None
Subscribers

Description

In a rolling upgrade from 2.1.0 to a recent version supporting CIB transactions, the final node logged errors like:

pacemaker-based     [23579] (cib_get_operation_id)  error: Operation cib_commit_transact is not valid
pacemaker-based     [23579] (cib_process_request)   error: Pre-processing of command failed: Invalid argument

I believe transactions were implemented under the assumption that the pacemaker-attrd writer, pacemaker-controld DC, and pacemaker-based primary instance will always have the same feature set, due to the election algorithm. However, that is not necessarily the case since the subdaemons exit at different times.

When pacemaker-attrd exits on the last node in a rolling upgrade, the remaining (newer) cluster nodes elect a new attribute writer, which then writes out its known attributes (using a CIB transaction). The pacemaker-based primary instance is still on the older node (even if about to exit), so it fails.

The main question is what problems this can cause other than log messages. I suspect it can only happen on the last node in a rolling upgrade in a tight window (probably hundredths of a second in most cases). A similar issue likely exists for the controller's DC election since the controller is the first subdaemon to exit.

The next question would be whether it's worth doing anything about it.

Event Timeline

kgaillot triaged this task as Normal priority.Oct 14 2024, 4:12 PM
kgaillot created this task.
kgaillot created this object with edit policy "Restricted Project (Project)".