Skip to content

Conversation

@camilamacedo86
Copy link
Contributor

@camilamacedo86 camilamacedo86 commented Jan 12, 2026

Description

Implements catalog deletion resilience for Helm runtime per the 3 assumptions.

Assumptions Addressed

When a catalog is removed:

  1. Workload remains untouched, no reconciliation until resolution succeeds

    • Extensions continue running with installed bundle
    • Controller watches ClusterCatalog → auto-resumes when available
  2. Managed objects don't break

    • Helm reconciliation maintains resources
    • Watchers restore if externally deleted/changed
  3. Full reconciliation when resolution succeeds

    • Normal resolution and upgrades resume automatically

How It Works

When resolution fails:

  • Check if catalogs exist in cluster
  • If deleted → Use installed bundle, maintain resources (fallback)
  • If exist → Retry (transient error)

For Helm:

  • reconcileExistingRelease() uses Helm's reconciliation
  • Maintains resources via Server-Side Apply
  • Sets up watchers for restoration

Tests

E2E (4 scenarios):

  1. Extension continues running after catalog deletion
  2. Resources restored after catalog deletion
  3. Config changes work without catalog
  4. Version upgrade blocked without catalog

These tests verify what should happen when catalogs are deleted
while Helm extensions are running.

4 Test Scenarios Added:

Scenario 1: Extension continues running after catalog deletion
  Given: Helm extension is installed and running
  When: Catalog is deleted
  Then: Extension keeps running, resources stay available

Scenario 2: Resources are restored after catalog deletion
  Given: Helm extension is installed
  When: Catalog is deleted AND someone deletes a resource
  Then: Resource is automatically restored

Scenario 3: Config changes work without catalog
  Given: Helm extension is installed
  When: Catalog is deleted AND user changes preflight config
  Then: Config change is accepted

Scenario 4: Version upgrade blocked without catalog
  Given: Helm extension v1.0.0 is installed
  When: Catalog is deleted AND user requests upgrade to v1.0.1
  Then: Upgrade is blocked (Retrying status)
  And: Extension stays on v1.0.0

Test Infrastructure Added:
- CatalogIsDeleted() step function
- Proper timeout and cleanup handling

Note: These tests will FAIL until the fix is applied.
      This commit captures the desired behavior first.
Copilot AI review requested due to automatic review settings January 12, 2026 19:42
@netlify
Copy link

netlify bot commented Jan 12, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit ad6d701
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/6965520243a7020008e79ee4
😎 Deploy Preview https://deploy-preview-2442--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link

openshift-ci bot commented Jan 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pedjak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@camilamacedo86 camilamacedo86 changed the title Fix helm only 🐛 Ensures Helm-based extensions keep running when their catalog is deleted or unavailable Jan 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements catalog deletion resilience for Helm-based ClusterExtensions, allowing installed extensions to continue functioning even when their source catalog is deleted. The implementation adds logic to detect catalog deletion and fall back to the currently installed bundle content, while still blocking version upgrades when catalogs are unavailable.

Changes:

  • Added catalog deletion detection logic in the ResolveBundle reconcile step
  • Implemented fallback to installed bundle when catalogs are deleted and no version change is requested
  • Added nil contentFS handling in Helm applier to reconcile existing releases without catalog access
  • Added E2E tests to verify workload resilience during catalog deletion

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/operator-controller/controllers/clusterextension_reconcile_steps.go Core logic for handling resolution errors, checking catalog existence, and skipping bundle unpacking when using installed content
internal/operator-controller/applier/helm.go Added reconcileExistingRelease() to maintain Helm releases without catalog access
cmd/operator-controller/main.go Updated ResolveBundle calls to pass client for catalog existence checks
internal/operator-controller/controllers/suite_test.go Updated test setup to pass client to ResolveBundle
internal/operator-controller/controllers/clusterextension_controller_test.go Added comprehensive unit tests for fallback scenarios
internal/operator-controller/controllers/clusterextension_controller.go Added nolint comment for ensureAllConditionsWithReason
test/e2e/features/catalog-deletion-resilience.feature E2E test scenarios for catalog deletion resilience
test/e2e/steps/steps.go CatalogIsDeleted step implementation for E2E tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 260 to 262
// imageFS will remain nil; applier implementations MUST handle nil imageFS by using
// existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply()
// for nil contentFS handling.
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states that Boxcutter.apply() handles nil contentFS, but the current implementation of Boxcutter.Apply does not check for nil contentFS. When contentFS is nil, Boxcutter.apply() will call RevisionGenerator.GenerateRevision(ctx, contentFS, ...) which will fail when it tries to read from the nil filesystem.

Only Helm.Apply has been updated to handle nil contentFS by checking if contentFS is nil and calling reconcileExistingRelease. The Boxcutter applier needs similar handling or this code path will fail for boxcutter-based installations during catalog deletion scenarios.

Suggested change
// imageFS will remain nil; applier implementations MUST handle nil imageFS by using
// existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply()
// for nil contentFS handling.
// imageFS will remain nil; appliers that support a nil imageFS must fall back to
// existing installed content (see Helm.reconcileExistingRelease() for an example).
// Other applier implementations must not assume imageFS is non-nil in this path.

Copilot uses AI. Check for mistakes.
Comment on lines 255 to 265
if state.revisionStates != nil &&
state.revisionStates.Installed != nil &&
state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name &&
state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version {
l.Info("skipping unpack - using installed bundle content")
// imageFS will remain nil; applier implementations MUST handle nil imageFS by using
// existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply()
// for nil contentFS handling.
return nil, nil
}

Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer dereference. The code accesses state.resolvedRevisionMetadata.Name and state.resolvedRevisionMetadata.Version without first checking if state.resolvedRevisionMetadata is non-nil. While this may be safe in the fallback path (line 200 sets it), it's risky if UnpackBundle is called in other contexts or if the logic changes. Consider adding a nil check: if state.resolvedRevisionMetadata != nil && state.revisionStates != nil && ...

Suggested change
if state.revisionStates != nil &&
state.revisionStates.Installed != nil &&
state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name &&
state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version {
l.Info("skipping unpack - using installed bundle content")
// imageFS will remain nil; applier implementations MUST handle nil imageFS by using
// existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply()
// for nil contentFS handling.
return nil, nil
}
if state.revisionStates != nil && state.revisionStates.Installed != nil {
if state.resolvedRevisionMetadata != nil &&
state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name &&
state.resolvedRevisionMetadata.Version == state.revisionStates.Installed.Version {
l.Info("skipping unpack - using installed bundle content")
// imageFS will remain nil; applier implementations MUST handle nil imageFS by using
// existing installed content. See Helm.reconcileExistingRelease() and Boxcutter.apply()
// for nil contentFS handling.
return nil, nil
}
if state.resolvedRevisionMetadata == nil {
// We don't have resolved metadata, but we do have an installed revision.
// Rely on the existing installed content and skip unpacking.
l.Info("skipping unpack - using installed bundle content (no resolved metadata)")
return nil, nil
}
}
if state.resolvedRevisionMetadata == nil {
return nil, fmt.Errorf("resolved revision metadata is nil; cannot unpack bundle")
}

Copilot uses AI. Check for mistakes.
Comment on lines 679 to 699
deadline := time.Now().Add(timeout)
for {
select {
case <-ctx.Done():
// Return the context error if we timed out or were cancelled while waiting.
return ctx.Err()
case <-ticker.C:
// Check if we've exceeded the timeout
if time.Now().After(deadline) {
return fmt.Errorf("timeout waiting for catalog %s to be deleted", catalogFullName)
}

_, err := k8sClient("get", "clustercatalog", catalogFullName)
if err != nil {
// Catalog is deleted - verification succeeded
return nil
}
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition or timing issue in the deletion check logic. The ticker fires first at time T, checks if the catalog exists, then the next check happens at T+tick. However, the timeout deadline check happens after the ticker fires, meaning if the timeout expires between ticks, we won't detect it until the next tick fires. This could result in waiting up to one additional tick interval beyond the intended timeout.

Consider checking the deadline before waiting on the ticker, or restructuring the loop to check the timeout immediately after each catalog existence check.

Copilot uses AI. Check for mistakes.
@camilamacedo86 camilamacedo86 changed the title 🐛 Ensures Helm-based extensions keep running when their catalog is deleted or unavailable WIP 🐛 Ensures Helm-based extensions keep running when their catalog is deleted or unavailable Jan 12, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2026
This fix resolves the issues discovered by the e2e tests, ensuring
Helm extensions keep running when catalogs are deleted.

Issues Fixed:

Issue 1: Extensions fail when catalog is deleted
  Problem: Resolution failure causes complete failure
  Fix: Fall back to installed bundle when catalog unavailable

Issue 2: Resources break when catalog is deleted
  Problem: No reconciliation without catalog content
  Fix: Helm reconcileExistingRelease() maintains resources

Issue 3: Config changes rejected without catalog
  Problem: Any spec change triggers resolution failure
  Fix: Allow non-version changes to proceed with installed bundle

Issue 4: Version upgrades fail silently
  Problem: Can't distinguish catalog deletion from other errors
  Fix: Check if catalogs exist, block upgrades explicitly

How the Fix Works:

When bundle resolution fails, the controller now:
1. Checks if a bundle is already installed
2. Checks if spec is requesting a version change
3. Checks if catalogs still exist in the cluster

Decision Logic:
- Installed + No version change + Catalog deleted → Use installed bundle
- Installed + Version change + Catalog deleted → Block and retry
- Installed + No version change + Catalog exists → Retry (transient error)
- No installed bundle → Fail (can't proceed)

For Helm Runtime:
- reconcileExistingRelease() uses Helm's built-in reconciliation
- Maintains resources via Server-Side Apply
- Sets up watchers to monitor and restore resources
- Returns success if reconcile succeeds (even if watch setup fails)

Controller watches ClusterCatalog resources, so reconciliation
automatically resumes when catalogs return.

Unit Tests Added:
- TestResolutionFallbackToInstalledBundle (3 scenarios)
- TestCheckCatalogsExist (5 scenarios)

All e2e tests now pass.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +255 to +258
if state.resolvedRevisionMetadata != nil &&
state.revisionStates != nil &&
state.revisionStates.Installed != nil &&
state.resolvedRevisionMetadata.Name == state.revisionStates.Installed.Name &&
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential nil pointer dereference here. The code checks if state.revisionStates and state.revisionStates.Installed are not nil, but does not check if state.resolvedRevisionMetadata is nil before accessing its Name and Version fields on lines 257-258. Add a nil check for state.resolvedRevisionMetadata at the beginning of the condition to prevent a panic.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 74.48980% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.31%. Comparing base (1fa4169) to head (ad6d701).

Files with missing lines Patch % Lines
internal/operator-controller/applier/helm.go 39.13% 7 Missing and 7 partials ⚠️
...er/controllers/clusterextension_reconcile_steps.go 89.23% 6 Missing and 1 partial ⚠️
internal/operator-controller/applier/boxcutter.go 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2442      +/-   ##
==========================================
+ Coverage   73.00%   73.31%   +0.31%     
==========================================
  Files         100      100              
  Lines        7641     7728      +87     
==========================================
+ Hits         5578     5666      +88     
+ Misses       1625     1620       -5     
- Partials      438      442       +4     
Flag Coverage Δ
e2e 47.90% <60.20%> (+1.05%) ⬆️
experimental-e2e 49.67% <55.10%> (+0.98%) ⬆️
unit 57.05% <53.06%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant