Skip to content

🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out#2566

Open
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-bug-analyse
Open

🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out#2566
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-bug-analyse

Conversation

@camilamacedo86
Copy link
Contributor

@camilamacedo86 camilamacedo86 commented Mar 16, 2026

What's wrong

When a revision gets stuck rolling out (e.g., probe failure), the controller ignores everything — spec changes, progression deadlines — and keeps reusing the stuck revision forever. The admin has no way to recover without manually deleting revisions.

What this PR does

Adds two triggers that tell the controller to call the resolver again during a stuck rollout:

  • Progression deadline exceeded — if the revision controller marks a revision as ProgressDeadlineExceeded, re-resolve automatically. This works even if the admin hasn't changed the spec.

  • Spec change — if the CE consumer changes the requested version and no rolling-out revision matches the new spec, re-resolve immediately.

If neither trigger fires, the controller keeps using the current revision (same as today).

Decision Tree

Today:

Rolling out? → always reuse, never re-resolve

After:

Rolling out?
→ Deadline exceeded? → YES → re-resolve
→ Revision matches spec? → YES → reuse
→ No match → re-resolve

What's tested

  • 9 unit tests covering: spec mismatch, spec match, deadline exceeded, empty version, version range, multiple revisions, "from" version verification, no installed revision
  • 2 e2e scenarios: spec-change trigger and deadline trigger

Motivated by: https://redhat.atlassian.net/browse/OCPBUGS-77829

Copilot AI review requested due to automatic review settings March 16, 2026 10:29
@netlify
Copy link

netlify bot commented Mar 16, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 7252d79
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69b972c14fb92c00086a2d4d
😎 Deploy Preview https://deploy-preview-2566--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes operator-controller (Boxcutter runtime) behavior where a ClusterExtension spec change could be ignored while a prior ClusterExtensionRevision is still rolling out, by re-resolving from the catalog when the rolling-out revision no longer matches the current spec.

Changes:

  • Update ResolveBundle to reuse the rolling-out revision only when it still matches the current package/version constraints; otherwise re-resolve from the catalog.
  • Add rollingOutRevisionMatchesSpec helper to compare rolling-out revision metadata against spec version constraints.
  • Add unit and E2E coverage for “spec change while rolling out” behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Implements spec-vs-rolling-out compatibility check and re-resolve behavior.
internal/operator-controller/controllers/clusterextension_controller_test.go Updates an existing Boxcutter test fixture and adds 2 new tests for re-resolve vs reuse behavior.
test/e2e/features/update.feature Adds E2E scenario verifying version change during a stuck rollout triggers re-resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 79.31034% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.64%. Comparing base (147cfa8) to head (7252d79).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...er/controllers/clusterextension_reconcile_steps.go 79.31% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2566      +/-   ##
==========================================
+ Coverage   64.33%   68.64%   +4.31%     
==========================================
  Files         131      131              
  Lines        9333     9380      +47     
==========================================
+ Hits         6004     6439     +435     
+ Misses       2853     2444     -409     
- Partials      476      497      +21     
Flag Coverage Δ
e2e 39.19% <25.86%> (-3.02%) ⬇️
experimental-e2e 51.75% <79.31%> (+39.88%) ⬆️
unit 53.96% <79.31%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

IMHO, the added e2e test could be made faster.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2026
Copilot AI review requested due to automatic review settings March 16, 2026 17:31
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2026
@openshift-ci
Copy link

openshift-ci bot commented Mar 16, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link

openshift-ci bot commented Mar 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from pedjak. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Boxcutter ClusterExtension reconciliation so spec updates (notably version changes) aren’t silently ignored when there are existing rolling-out revisions (e.g., rollout stuck on probe failure). This ensures the controller can re-resolve from the catalog to honor updated desired state.

Changes:

  • Update ResolveBundle to reuse a rolling-out revision only if it still matches the current spec; otherwise re-resolve from the catalog.
  • Add helper logic to compare a rolling-out revision’s package/version against the current catalog constraints.
  • Add/adjust unit and E2E coverage for “spec changed while rolling out” behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Conditional short-circuit: reuse rolling-out revision only when it matches spec; otherwise re-resolve.
internal/operator-controller/controllers/clusterextension_controller_test.go Adds unit tests for re-resolve on mismatch and no re-resolve on match; adjusts an existing test setup.
test/e2e/features/update.feature Adds E2E scenario covering version change while rollout is stuck.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@camilamacedo86 camilamacedo86 changed the title 🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out WIP 🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out Mar 16, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 16, 2026
@camilamacedo86 camilamacedo86 force-pushed the fix-bug-analyse branch 2 times, most recently from 392ae57 to d952dcf Compare March 17, 2026 14:42
@camilamacedo86 camilamacedo86 changed the title WIP 🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out 🐛 fix(Boxcutter): Fix ClusterExtension spec changes ignored while revision is rolling out Mar 17, 2026
@camilamacedo86 camilamacedo86 requested a review from Copilot March 17, 2026 14:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the ClusterExtension controller’s bundle-resolution behavior during active (and potentially stuck) rollouts so that it can recover instead of indefinitely reusing a rolling-out revision.

Changes:

  • Add logic to conditionally reuse a rolling-out revision vs re-resolving, based on (a) ProgressDeadlineExceeded on the latest rolling-out revision and (b) whether any rolling-out revision still matches the requested package/version constraint.
  • Add unit tests covering the new “reuse vs re-resolve” decision logic for rolling-out revisions.
  • Add e2e scenarios intended to validate re-resolution on spec change and on progression deadline exceeded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Introduces reuseRollingOutRevision and helpers to decide whether to reuse an in-flight revision or call the resolver again.
internal/operator-controller/controllers/clusterextension_controller_test.go Adds unit tests for re-resolution triggers and matching logic during rollouts.
test/e2e/features/update.feature Adds e2e coverage for spec-change and (intended) progress-deadline-based re-resolution while a rollout is stuck.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@camilamacedo86 camilamacedo86 force-pushed the fix-bug-analyse branch 2 times, most recently from 259bebc to 9c7214a Compare March 17, 2026 15:06
Copilot AI review requested due to automatic review settings March 17, 2026 15:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Boxcutter rollout behavior so that, during an active/stuck rollout, the controller can re-engage bundle resolution instead of always reusing the existing rolling revision—enabling recovery from stuck rollouts and honoring certain spec changes.

Changes:

  • Update ResolveBundle logic to conditionally reuse rolling-out revisions vs re-resolve based on (a) ProgressDeadlineExceeded and (b) whether a rolling revision matches the requested version constraint.
  • Add unit tests covering spec/version mismatch and progress-deadline-triggered re-resolution behaviors (including version ranges and multiple rolling revisions).
  • Add E2E scenarios for version-change re-resolution during stuck rollout and a progress-deadline-related scenario.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Introduces “reuse vs re-resolve” decision logic while revisions are rolling out, including deadline and version-constraint matching.
internal/operator-controller/controllers/clusterextension_controller_test.go Adds extensive unit coverage for the new rolling-out reuse / re-resolve decision paths.
test/e2e/features/update.feature Adds E2E scenarios intended to validate spec-change and progress-deadline triggers during stuck rollouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Boxcutter rollout behavior where the ClusterExtension controller would indefinitely reuse a stuck rolling-out revision, ignoring spec changes and rollout progression deadlines. It introduces targeted re-resolution triggers so the controller can recover by re-engaging the resolver when appropriate.

Changes:

  • Add rollout-time decision logic to either reuse an existing rolling-out revision or re-resolve when (a) ProgressDeadlineExceeded is observed or (b) no rolling-out revision matches the current spec version constraint.
  • Update resolution-error handling to use semver range satisfaction (not string equality) when deciding whether fallback to the installed revision is allowed.
  • Add unit and e2e coverage for spec-mismatch and progress-deadline re-resolution behaviors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
test/e2e/features/update.feature Adds e2e scenarios for re-resolution during stuck rollouts (spec-change trigger + deadline-related scenario).
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Implements reuseRollingOutRevision and semver-range matching for spec compatibility; updates resolution error decision logic accordingly.
internal/operator-controller/controllers/clusterextension_controller_test.go Adds unit tests covering stuck-rollout re-resolution decision paths and updates an existing Boxcutter failure test setup.
Comments suppressed due to low confidence (1)

internal/operator-controller/controllers/clusterextension_reconcile_steps.go:353

  • In handleResolutionError, this branch now triggers whenever the installed version does not satisfy the spec’s version constraint (including invalid/unparseable constraints), but the log/error text still says “spec requests version change”. Consider updating the log message (and possibly the formatted status message) to reflect “spec version constraint not satisfied by installed bundle” so it’s accurate for range constraints and parse failures, not only explicit version changes.
	// Check if the spec is requesting a version that differs from what's installed.
	// Uses the same semver range matching as reuseRollingOutRevision so that ranges
	// like ">=1.0.0, <2.0.0" are correctly recognized as satisfied by "1.0.0".
	if !versionMatchesSpec(state.revisionStates.Installed.Version, ext) {
		specVersion := ""
		if ext.Spec.Source.Catalog != nil {
			specVersion = ext.Spec.Source.Catalog.Version
		}
		installedVersion := state.revisionStates.Installed.Version
		msg := fmt.Sprintf("unable to upgrade to version %s: %v (currently installed: %s)", specVersion, err, installedVersion)
		l.Error(err, "resolution failed and spec requests version change - cannot fall back",
			"requestedVersion", specVersion,
			"installedVersion", installedVersion)
		setStatusProgressing(ext, err)
		setInstalledStatusFromRevisionStates(ext, state.revisionStates)
		ensureFailureConditionsWithReason(ext, ocv1.ReasonRetrying, msg)
		return nil, err

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +265 to +272
version: 1.0.2
upgradeConstraintPolicy: SelfCertified
"""
And ClusterExtensionRevision "${NAME}-1" reports Available as False with Reason ProbeFailure
When ClusterExtension is updated to version "1.0.1"
Then ClusterExtension is rolled out
And ClusterExtension is available
And bundle "test-operator.1.0.1" is installed in version "1.0.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential downgrade from 1.0.2 to 1.0.1. What if the deployment for 1.0.2 included an initContainer that performed a data migration, which succeeded, but then the container that runs the 1.0.2 controller fails the probe because of a silly typo the user made in an envvar in their deploymentConfig?

We'd downgrade to 1.0.1, which would be unable to read the migrated data, and maybe 1.0.1 would make matters worse.

If we instead required human intervention, a human would find the log in the failing deployment, see it complaining about the envvar typo, and then go and fix it in the CE.spec.config.deploymentConfig. And then 1.0.2 would progress cleanly.

Comment on lines +268 to +276
func rollingOutRevisionMatchesSpec(rm *RevisionMetadata, ext *ocv1.ClusterExtension) bool {
if ext.Spec.Source.Catalog == nil {
return false
}
if rm.Package != ext.Spec.Source.Catalog.PackageName {
return false
}
return versionMatchesSpec(rm.Version, ext)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this check feels way too implicit and unintuitive. If I understand it correctly, it requires that I specifically craft an updated version range to exclude all of the versions of the active releases?

But if I expand the range to include a new version, no dice.

@joelanford
Copy link
Member

/hold
this one please!

I'm concerned that this could cause unsafe roll-forward or roll-back without the explicit acknowledgement from the user that forcing a re-resolution is an unsafe thing to do. IMO, default behavior needs to remain "wait for human intervention".

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants