Skip to content

Fix disk removal deadlock during rolling upgrade#254

Open
dobrerazvan wants to merge 3 commits into
masterfrom
fix-disk-removal-deadlock-during-rolling-upgrade
Open

Fix disk removal deadlock during rolling upgrade#254
dobrerazvan wants to merge 3 commits into
masterfrom
fix-disk-removal-deadlock-during-rolling-upgrade

Conversation

@dobrerazvan
Copy link
Copy Markdown

@dobrerazvan dobrerazvan commented May 18, 2026

When a broker pod is deleted during rolling upgrade and a disk removal is pending (GracefulDiskRemovalScheduled), the operator enters a deadlock: reconcileKafkaPvc blocks the entire reconcile with "Disk removal pending", preventing reconcileKafkaPod from recreating the missing pod. Meanwhile, Cruise Control cannot complete the disk removal because the broker isn't running.

Fix: move runningBrokers map building before reconcileKafkaPvc and pass it in. Before returning the blocking error, check if any broker with pending disk removal has a missing pod. If so, allow the reconcile to proceed so the pod can be recreated. The disk removal check is re-evaluated on the next cycle once the broker is back up.

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to
related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this
change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

dobrerazvan and others added 3 commits May 18, 2026 18:52
When a broker pod is deleted during rolling upgrade and a disk removal is
pending (GracefulDiskRemovalScheduled), the operator enters a deadlock:
reconcileKafkaPvc blocks the entire reconcile with "Disk removal pending",
preventing reconcileKafkaPod from recreating the missing pod. Meanwhile,
Cruise Control cannot complete the disk removal because the broker isn't
running.

Fix: move runningBrokers map building before reconcileKafkaPvc and pass it
in. Before returning the blocking error, check if any broker with pending
disk removal has a missing pod. If so, allow the reconcile to proceed so
the pod can be recreated. The disk removal check is re-evaluated on the
next cycle once the broker is back up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… add tests

- Fix #1 (HIGH): Override now checks IsDiskRebalance() in addition to
  IsDiskRemoval(), closing the same deadlock vector for rebalance states
- Fix #2 (LOW): Include mountPath in the bypass log message for
  consistency with other disk-removal log messages
- Fix #3 (LOW): Add tests for rebalance-state deadlock bypass and for
  newly-marked-for-removal with missing pod

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Proposal, design, and task tracking for the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant