Skip to content

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164

Open
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix
Open

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix

Conversation

@sg00dwin
Copy link
Copy Markdown
Member

@sg00dwin sg00dwin commented May 27, 2026

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
    • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
    • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability by adding automatic retry logic for transient Kubernetes API errors across many console controllers and operator sync flows (create, update, delete, apply), reducing spurious failures and improving reconciliation robustness.
  • Tests

    • Added comprehensive unit tests validating retry/backoff behavior and retryable vs non-retryable error handling.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e79befcb-4ce2-4ebe-9672-3e5e23d8b71a

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8cda1 and 58cf789.

📒 Files selected for processing (13)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/operator/retry.go
  • pkg/console/operator/sync_v400.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/console/operator/retry.go
🚧 Files skipped from review as they are similar to previous changes (11)
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/controllers/util/retry.go
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Follow Go coding standards and patterns documented in CONVENTIONS.md
Organize imports according to conventions documented in CONVENTIONS.md
Use gofmt to format Go code with standard formatting
Run go vet checks on all Go packages

Follow Go coding standards and patterns as documented in CONVENTIONS.md, including proper import organization

Organize Go code following the repository structure: main entry point in cmd/console/main.go, API constants in pkg/api/, operator command setup in pkg/cmd/operator/, and version command in pkg/cmd/version/

**/*.go: Use gofmt for formatting Go code
Follow standard Go naming conventions
Group imports in order: standard lib, 3rd party, kube/openshift, internal (marked with comments)
Use meaningful error messages with context in Go code
Set status conditions using status.Handle* functions with type prefixes (*Degraded, *Progressing, *Available, *Upgradeable)
Use typed errors and wrap errors to preserve stack context

Files:

  • pkg/console/operator/sync_v400.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

  • Controller struct types (e.g., type *Controller struct)
  • func New*Controller( factory functions
  • factory.New().WithFilteredEventsInformers( pattern
  • .ToController( method calls
  • Sync(ctx context.Context, controllerContext factory.SyncContext) methods
  • operatorConfig.Spec.ManagementState checks
  • status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

  • Main operator sync functions (e.g., sync_v400.go content)
  • Sequential resource syncing with early returns
  • Incremental reconciliation loops
  • Multiple resourceapply.Apply*() calls in sequence
  • Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
  • Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

  • Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll
  • Deprecated patterns: Dial without DialContext
  • Error handling: missing %w in fmt.Errorf
  • Code smells: deep nesting (4+ levels), functions >100 lines
  • Magic values: unexplained numbers/strings
  • Context propagation: context.Background() instead of passed ctx
  • Missing godoc on exported functions

Files:

  • pkg/console/operator/sync_v400.go
{pkg,cmd}/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Use gofmt for code formatting on pkg and cmd directories

{pkg,cmd}/**/*.go: Format code using gofmt -w ./pkg ./cmd
Run go vet checks on all Go packages in ./pkg and ./cmd

Files:

  • pkg/console/operator/sync_v400.go
**/*sync*.go

📄 CodeRabbit inference engine (CONVENTIONS.md)

Implement sync loops (sync_v400) incrementally: start from zero, create/update missing requirements, and return to continue on next loop

Files:

  • pkg/console/operator/sync_v400.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

  • SQL: parameterized queries only; no string concatenation
  • Command: no shell=True, os.system, or backtick exec with user input
  • LDAP/XPath: escape special characters in filters
  • Path traversal: canonicalize paths, reject ../
  • Deserialization: no pickle/yaml.load()/eval on untrusted data
  • Prototype pollution: no recursive merge of untrusted objects
  • Validate at trust boundaries with allow-lists, not deny-lists
  • Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

  • pkg/console/operator/sync_v400.go
🔇 Additional comments (11)
pkg/console/operator/sync_v400.go (11)

9-9: LGTM!


265-271: LGTM!


278-286: LGTM!


324-335: LGTM!


435-442: LGTM!


518-528: LGTM!


541-551: LGTM!


558-568: LGTM!


581-591: LGTM!


786-789: LGTM!


849-854: LGTM!


Walkthrough

Adds a transient-error retry utility and applies it to controller and operator Kubernetes write operations (create/update/apply/delete), replacing single-attempt calls with retries and propagating final retry errors into existing failure-reason flows.

Changes

Transient Error Retry Integration

Layer / File(s) Summary
Retry infrastructure
pkg/console/controllers/util/retry.go, pkg/console/controllers/util/retry_test.go
Defines TransientBackoff, IsRetryableError, and RetryOnTransientError(fn) with V(4) logging; adds unit tests for classification and retry behavior.
Operator retry helper & sync changes
pkg/console/operator/retry.go, pkg/console/operator/sync_v400.go
Adds an operator-local retryOnTransientError delegating to the controller retry util; wraps ConfigMap/Secret/Deployment apply and status update calls in retry wrappers and removes prior conflict-retry import.
Controller write-paths wrapped with retry
pkg/console/controllers/... (clidownloads, downloadsdeployment, oauthclients, oauthclientsecret, poddisruptionbudget, service, serviceaccounts, storageversionmigration, upgradenotification)
Nine controllers replace direct create/update/apply/delete calls with util.RetryOnTransientError wrappers, capturing and returning the retried error in existing failure-reason paths.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 11

❌ Failed checks (1 warning, 10 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Stable And Deterministic Test Names ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: adding retry logic for transient API errors to prevent Degraded status flips, directly matching the primary objective of the PR.
Description check ✅ Passed The PR description covers Analysis/Root cause (intermittent Degraded flips during upgrades), Solution description (shared retry utility wrapping API calls), and Testing (unit tests pass, make test-unit clean). However, it lacks Browser conformance section, explicit Test setup/cases, and Reviewers/assignees section as required by the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jhadvig and spadgett May 27, 2026 18:31
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sg00dwin
Once this PR has been reviewed and has the lgtm label, please assign jhadvig for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sg00dwin
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.

  • Tests

  • Added comprehensive test coverage validating transient error retry behavior and backoff policies.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/console/controllers/service/controller.go (1)

144-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate redirect sync errors instead of shadowing svcErr.

At Line 144, redirectSvcErrReason, svcErr := ... shadows the outer svcErr. Line 148 then returns the old outer value, so redirect failures are not returned from Sync.

Suggested fix
-	if c.serviceName == api.OpenShiftConsoleServiceName {
-		redirectSvcErrReason, svcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
-		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, svcErr))
-	}
-
-	return statusHandler.FlushAndReturn(svcErr)
+	if c.serviceName == api.OpenShiftConsoleServiceName {
+		redirectSvcErrReason, redirectSvcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
+		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, redirectSvcErr))
+		if redirectSvcErr != nil {
+			return statusHandler.FlushAndReturn(redirectSvcErr)
+		}
+	}
+
+	return statusHandler.FlushAndReturn(svcErr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/controllers/service/controller.go` around lines 144 - 149, The
redirect sync call is shadowing the outer svcErr (redirectSvcErrReason, svcErr
:= c.SyncRedirectService...), causing the function to return the old svcErr;
change the declaration to assign to the existing outer variable (use = instead
of :=) or use a distinct temporary name and then set the outer svcErr = the
returned error; ensure the call to SyncRedirectService (and the variables
redirectSvcErrReason and svcErr) updates the same svcErr that FlushAndReturn
uses and keep statusHandler.AddConditions using the updated svcErr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/console/operator/sync_v400.go`:
- Around line 517-527: The Create retry may succeed server-side then later
return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and
SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the
createErr handling after retryOnTransientError (the block that sets actual and
createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
- Around line 264-270: The retry logic reuses stale object instances causing
Conflict retries to fail; inside retryOnTransientError closures for
SyncConsoleConfig (UpdateStatus) and for SyncServiceCAConfigMap /
SyncTrustedCAConfigMap (ConfigMaps Update) refetch the latest resource from the
API (not reuse the original updated/existing struct) before calling UpdateStatus
or Update so each attempt uses current resourceVersion; additionally, in the
Create wrappers used in those Sync* functions treat AlreadyExists as success by
fetching the existing object and proceeding (instead of returning FailedCreate),
and keep IsRetryableError behavior but ensure Update/UpdateStatus retries
operate on freshly fetched objects rather than the initial stale instance.

---

Outside diff comments:
In `@pkg/console/controllers/service/controller.go`:
- Around line 144-149: The redirect sync call is shadowing the outer svcErr
(redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function
to return the old svcErr; change the declaration to assign to the existing outer
variable (use = instead of :=) or use a distinct temporary name and then set the
outer svcErr = the returned error; ensure the call to SyncRedirectService (and
the variables redirectSvcErrReason and svcErr) updates the same svcErr that
FlushAndReturn uses and keep statusHandler.AddConditions using the updated
svcErr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 12887e8d-b1d9-4882-b7b9-b035820dd2e0

📥 Commits

Reviewing files that changed from the base of the PR and between c54e288 and 1a8cda1.

📒 Files selected for processing (13)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/operator/retry.go
  • pkg/console/operator/sync_v400.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{yaml,yml,go}

📄 CodeRabbit inference engine (Custom checks)

Topology-Aware Scheduling Compatibility: Deployment manifests, operator code, and controllers should not introduce scheduling constraints that assume a standard HA topology with 3+ control-plane nodes and dedicated workers. Flag required pod anti-affinity without topology checks, topology spread constraints with DoNotSchedule that exceed schedulable node counts, replica counts derived from node count without considering SNO/TNF/TNA, nodeSelector targeting control-plane nodes, broad tolerations that could schedule to arbiter nodes on TNA, code targeting only worker nodes, or PodDisruptionBudgets designed for 3+ nodes.

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

  • Controller struct types (e.g., type *Controller struct)
  • func New*Controller( factory functions
  • factory.New().WithFilteredEventsInformers( pattern
  • .ToController( method calls
  • Sync(ctx context.Context, controllerContext factory.SyncContext) methods
  • operatorConfig.Spec.ManagementState checks
  • status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

  • Main operator sync functions (e.g., sync_v400.go content)
  • Sequential resource syncing with early returns
  • Incremental reconciliation loops
  • Multiple resourceapply.Apply*() calls in sequence
  • Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
  • Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

  • Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll
  • Deprecated patterns: Dial without DialContext
  • Error handling: missing %w in fmt.Errorf
  • Code smells: deep nesting (4+ levels), functions >100 lines
  • Magic values: unexplained numbers/strings
  • Context propagation: context.Background() instead of passed ctx
  • Missing godoc on exported functions

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

  • SQL: parameterized queries only; no string concatenation
  • Command: no shell=True, os.system, or backtick exec with user input
  • LDAP/XPath: escape special characters in filters
  • Path traversal: canonicalize paths, reject ../
  • Deserialization: no pickle/yaml.load()/eval on untrusted data
  • Prototype pollution: no recursive merge of untrusted objects
  • Validate at trust boundaries with allow-lists, not deny-lists
  • Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*_test.go

📄 CodeRabbit inference engine (Custom checks)

**/*_test.go: IPv6 and Disconnected Network Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that contain IPv4 assumptions such as hardcoded IPv4 addresses, IPv4-only parsing logic, IPv4-specific service/endpoint creation, or hardcoded IPv4 network policies. Also flag tests requiring external connectivity to public internet services, pulling images from public registries, or DNS resolution of public hostnames.
MicroShift Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that reference OpenShift-specific APIs unavailable on MicroShift (except Route and SecurityContextConstraints), such as Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM, MachineSet/Machine, ClusterAutoscaler, Console, Monitoring components, ImageRegistry operator, Samples operator, OperatorHub, CloudCredential, or Storage/Network operators.
OTE Binary Stdout Contract: OpenShift Tests Extension (OTE) binaries must communicate with openshift-tests via JSON on stdout only. Flag any stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers). Common violations include klog writing to stdout (must redirect to stderr via klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), and fmt.Print/Println/Printf to stdout in main or suite setup.
Single Node OpenShift (SNO) Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that make multi-node or HA assumptions such as expecting multiple control-plane nodes, multiple worker nodes, pod anti-affinity across nodes, node-to-node communication patterns, leader election failover across replicas, pod rescheduling to different nodes, node scaling operations, or separate node roles.
Stable and Deterministic Te...

Files:

  • pkg/console/controllers/util/retry_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Review test code for quality and patterns.

Refer to /unit-test-review when test is in pkg//*_test.go:**

  • Table-driven test structure with test cases
  • Use of go-test/deep for struct comparisons
  • Test naming conventions (TestFunctionName)
  • Error handling with wantErr pattern
  • Edge case coverage (nil, empty, boundary values)
  • Proper assertions with helpful error messages
  • Test isolation (no shared mutable state)

Refer to /e2e-test-review when test contains:

  • framework.MustNewClientset(t, nil) or similar e2e framework usage
  • wait.Poll or wait.PollImmediate patterns
  • retry.RetryOnConflict for updates
  • Cleanup via defer functions
  • Console/operator CR manipulations
  • Test assertions on cluster state

Suggest to use /e2e-test-review when:

  • PR adds new feature requiring e2e coverage
  • Test file is empty or skeleton
  • Comments indicate "TODO: add test"

Review for common issues:

  • Missing cleanup (defer statements)
  • Using time.Sleep instead of wait.Poll
  • Missing context timeouts
  • Vague error messages in assertions
  • Tests without table-driven structure when testing multiple cases
  • Ignoring errors with _
  • Tests without assertions

Files:

  • pkg/console/controllers/util/retry_test.go
🔇 Additional comments (12)
pkg/console/controllers/util/retry.go (1)

13-51: LGTM!

pkg/console/controllers/util/retry_test.go (1)

11-165: LGTM!

pkg/console/controllers/clidownloads/controller.go (1)

231-240: LGTM!

Also applies to: 257-266

pkg/console/controllers/downloadsdeployment/controller.go (1)

121-134: LGTM!

pkg/console/controllers/oauthclients/oauthclients.go (1)

229-233: LGTM!

pkg/console/controllers/oauthclientsecret/oauthclientsecret.go (1)

164-167: LGTM!

pkg/console/controllers/poddisruptionbudget/controller.go (1)

100-103: LGTM!

pkg/console/controllers/serviceaccounts/controller.go (1)

155-166: LGTM!

pkg/console/controllers/storageversionmigration/controller.go (1)

17-17: LGTM!

Also applies to: 107-109

pkg/console/controllers/upgradenotification/controller.go (1)

134-147: LGTM!

pkg/console/operator/retry.go (1)

8-10: LGTM!

pkg/console/operator/sync_v400.go (1)

277-285: LGTM!

Also applies to: 323-334, 434-441, 844-849

Comment thread pkg/console/operator/sync_v400.go
Comment on lines +517 to +527
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify create retry call sites.
rg -n "retryOnTransientError\\(func\\(\\) error \\{[\\s\\S]*?Create\\(" pkg/console/operator/sync_v400.go -U

# Verify how AlreadyExists is classified in retry utility.
rg -n "AlreadyExists|IsAlreadyExists|IsRetryableError" pkg/console/controllers/util/retry.go -C3

Repository: openshift/console-operator

Length of output: 11743


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show exact code around the flagged create retry blocks.
sed -n '500,590p' pkg/console/operator/sync_v400.go

# Locate and show retry helper definitions (name + implementation).
rg -n "retryOnTransientError|RetryOnTransientError" pkg/console/operator/sync_v400.go pkg/console/controllers/util/retry.go
sed -n '1,120p' pkg/console/controllers/util/retry.go

Repository: openshift/console-operator

Length of output: 5929


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "retryOnTransientError" pkg/console/operator/sync_v400.go pkg/console -S

Repository: openshift/console-operator

Length of output: 1718


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,160p' pkg/console/operator/retry.go

Repository: openshift/console-operator

Length of output: 276


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "\"FailedCreate\"|FailedCreate" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 233


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "SyncServiceCAConfigMap\\(|SyncTrustedCAConfigMap\\(" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 756


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '120,220p' pkg/console/operator/sync_v400.go

Repository: openshift/console-operator

Length of output: 4300


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func HandleProgressingOrDegraded\\(" -S pkg/console-operator pkg || true
rg -n "HandleProgressingOrDegraded\\(" -S pkg/console/operator pkg/console/controllers pkg | head

Repository: openshift/console-operator

Length of output: 2161


🏁 Script executed:

sed -n '80,180p' pkg/console/status/status.go

Repository: openshift/console-operator

Length of output: 4251


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func IsSyncError\\(|IsSyncError\\(" pkg/console -S

Repository: openshift/console-operator

Length of output: 205


🏁 Script executed:

sed -n '1,120p' pkg/console/errors/sync_errors.go

Repository: openshift/console-operator

Length of output: 920


Avoid Degraded blips when ConfigMap Create succeeded but retry hits AlreadyExists

  • SyncServiceCAConfigMap and SyncTrustedCAConfigMap retry ConfigMaps(...).Create() via retryOnTransientError; apierrors.IsAlreadyExists is treated as non-retryable, so a second attempt that returns AlreadyExists leads to returning "FailedCreate".
  • With a transient failure after server-side success, this returns a non-nil error into status.HandleProgressingOrDegraded(...), setting Degraded=True for ServiceCASync/TrustedCASync even though the ConfigMap exists.
Suggested fix pattern
  createErr := retryOnTransientError(func() error {
    var e error
    actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
    return e
  })
+ if apierrors.IsAlreadyExists(createErr) {
+   actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
+   if createErr == nil {
+     return actual, "", nil
+   }
+ }
  if createErr == nil {
    return actual, "", nil
  }

Apply the same AlreadyExists -> Get handling to both the service-ca and trusted-ca Create retries in pkg/console/operator/sync_v400.go.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if apierrors.IsAlreadyExists(createErr) {
actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
if createErr == nil {
return actual, "", nil
}
}
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", nil
}
return actual, "FailedCreate", createErr
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/operator/sync_v400.go` around lines 517 - 527, The Create retry
may succeed server-side then later return apierrors.IsAlreadyExists, causing
SyncServiceCAConfigMap and SyncTrustedCAConfigMap to surface a create error and
flip Degraded; update the createErr handling after retryOnTransientError (the
block that sets actual and createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.

Copy link
Copy Markdown
Member Author

@sg00dwin sg00dwin May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] Same reasoning as the reply above — the Create paths here are only reached when the lister returned NotFound. AlreadyExists is already classified as non-retryable in IsRetryableError, so the retry stops immediately on that error with no wasted attempts.

The phantom-success scenario (server-side Create succeeds but client gets a timeout, retry hits AlreadyExists) is a narrow race that self-heals on the next 60-second reconcile. Adding a Get-on-AlreadyExists fallback would add complexity for an edge case that resolves itself within one sync cycle. The original code before this PR had the same behavior — AlreadyExists on Create was always returned as a failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sg00dwin, fair point — since AlreadyExists is already classified as non-retryable, the retry wrapper exits immediately on that error without looping, and the phantom-success race is both narrow and self-correcting within one reconcile cycle. The added complexity of a Get-on-AlreadyExists fallback isn't justified for such an edge case. Understood, withdrawing the suggestion.

@sg00dwin
Copy link
Copy Markdown
Member Author

/test e2e-aws-console

1 similar comment
@sg00dwin
Copy link
Copy Markdown
Member Author

/test e2e-aws-console

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
  blips

  Wrap all API write operations (Apply, Create, Update, Delete) with
  retry logic to absorb transient errors (conflicts, timeouts, connection
  refused) during upgrades before they reach status reporting. Without
  retry, a single transient failure sets Degraded=True for ~1 minute
  until the next sync resolves it, causing CI test failures.

  Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max).
  Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried.

  Assisted by Claude Code (Opus 4.6)
@sg00dwin sg00dwin force-pushed the OCPBUGS-38676-degraded-blips-fix branch from 1a8cda1 to 58cf789 Compare May 29, 2026 13:58
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@sg00dwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn 58cf789 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-console 58cf789 link true /test e2e-aws-console

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -263,14 +262,27 @@ func (co *consoleOperator) SyncConsoleConfig(ctx context.Context, consoleConfig
klog.V(4).Infof("updating console.config.openshift.io with url: %v", consoleURL)
updated := consoleConfig.DeepCopy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in the closure, otherwise e.g on a 409 every retry will use the stale (conflicting data)

var actual *corev1.ConfigMap
updateErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Update(ctx, existing, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I think if existing is stale we will fail on 409s

var actual *corev1.ConfigMap
updateErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Update(ctx, existing, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update can also be stale on retry here

var actualCLIDownloads *v1.ConsoleCLIDownload
updateErr := controllersutil.RetryOnTransientError(func() error {
var e error
actualCLIDownloads, e = consoleClient.Update(ctx, existingCLIDownloadsCopy, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything changes this underneath us we will 409, and the retry doesn't support another get :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants