Skip to content

CCO-738, CCO-739, CCO-740: Overhaul status logic to correctly match spec#1008

Open
dlom wants to merge 1 commit intoopenshift:masterfrom
dlom:CCO-738-redux
Open

CCO-738, CCO-739, CCO-740: Overhaul status logic to correctly match spec#1008
dlom wants to merge 1 commit intoopenshift:masterfrom
dlom:CCO-738-redux

Conversation

@dlom
Copy link
Copy Markdown
Contributor

@dlom dlom commented Apr 1, 2026

xref: CCO-738
xref: CCO-739
xref: CCO-740

This PR is a combination of #977 and #978

Progressing:

  • Detect progress via generation and resource version changes, not provisioning status
  • Report Progressing=True on operator version change
  • Track root secret and infrastructure resource version changes

Degraded:

  • Suppress Degraded when Progressing is active
  • Enforce 20-minute ProgressingTooLong timeout in status controller

Available:

  • Only report Available=False on pod-identity-webhook when not progressing

(based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)

Assisted-By: Claude Opus 4.6 noreply@anthropic.com

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 1, 2026

@dlom: This pull request references CCO-738 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-739 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-740 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

xref: CCO-738
xref: CCO-739
xref: CCO-740

Progressing:

  • Detect progress via generation and resource version changes, not provisioning status
  • Report Progressing=True on operator version change
  • Track root secret and infrastructure resource version changes

Degraded:

  • Suppress Degraded when Progressing is active
  • Enforce 20-minute ProgressingTooLong timeout in status controller

Available:

  • Only report Available=False on pod-identity-webhook when not progressing

(based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)

Assisted-By: Claude Opus 4.6 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Walkthrough

Status logic for CredentialsRequest now treats generation mismatches as progressing; pod-identity reconciler derives operator conditions from deployment rollout state and preserves deployment-derived conditions on errors; status controller tracks version-change progressing with timeout-to-degraded and suppresses degraded/available during cluster upgrades. Multiple tests updated/added and clarifying comments inserted.

Changes

Cohort / File(s) Summary
CredentialsRequest Status & Tests
pkg/operator/credentialsrequest/status.go, pkg/operator/credentialsrequest/status_test.go, pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
Progressing is now computed from CredentialsRequest.Generation != Status.LastSyncGeneration (generation-mismatch counter). Removed unused reasonStaleCredentials. Tests updated to expect Upgradeable by default, added generation-aware scenarios and helpers, and removed an asserted OperatorDegraded in one in-progress test case.
Pod Identity Webhook Reconciler & Tests
pkg/operator/podidentity/podidentitywebhook_controller.go, pkg/operator/podidentity/podidentitywebhook_controller_test.go
Added lastDeployment field, podIdentityWebhookDeploymentName constant, rollout helpers (isDeploymentProgressing, hasDeploymentFinishedProgressing), deploymentConditions() and changed ReconcileResources to return the applied Deployment. Reconcile now computes/preserves deployment-derived conditions and appends Degraded on resource errors. Extensive new tests cover rollout/condition behaviors and fallback/fresh-fetch semantics.
Status Controller & Tests
pkg/operator/status/status_controller.go, pkg/operator/status/status_controller_test.go
Controller watches ClusterVersion, forces sticky OperatorProgressing on version changes (VersionChanged reason), enforces maxProgressingDuration to set OperatorDegraded=ProgressingTooLong, preserves prior degraded reasons/transition times, and suppresses Degraded/sets Available during cluster upgrades. Tests refactored and greatly expanded to cover transition-time semantics, timeouts, upgrade suppression, handler errors, and related behaviors.
CredentialsRequest Controller Comment
pkg/operator/credentialsrequest/credentialsrequest_controller.go
Added an explanatory multi-line comment describing that unlabeled root secrets produce OperatorProgressing with reason LabelsMissing as a spec-aligned steady-state while labeling completes. No functional change.
Actuator & SecretAnnotator Comments
pkg/operator/credentialsrequest/actuator/actuator.go, pkg/operator/secretannotator/status/status.go
Added clarifying comments on Upgradeable=True semantics and on degraded-mode explanations; no behavioral changes.
Other small test & helper additions
pkg/operator/podidentity/..._test.go, pkg/operator/status/..._test.go (many new tests and helpers)
Introduced new test helpers (e.g., errorHandler, pointer helpers, assertion helpers) and numerous new/modified unit and integration-style tests to validate the changed condition logic across controllers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 1, 2026

@dlom: This pull request references CCO-738 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-739 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-740 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

xref: CCO-738
xref: CCO-739
xref: CCO-740

Progressing:

  • Detect progress via generation and resource version changes, not provisioning status
  • Report Progressing=True on operator version change
  • Track root secret and infrastructure resource version changes

Degraded:

  • Suppress Degraded when Progressing is active
  • Enforce 20-minute ProgressingTooLong timeout in status controller

Available:

  • Only report Available=False on pod-identity-webhook when not progressing

(based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)

Assisted-By: Claude Opus 4.6 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dlom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 1, 2026
@dlom
Copy link
Copy Markdown
Contributor Author

dlom commented Apr 1, 2026

@jianping-shu this PR combines the two PRs I had originally into one unit. It is ready for you to test

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 1, 2026

@dlom: This pull request references CCO-738 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-739 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

This pull request references CCO-740 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

xref: CCO-738
xref: CCO-739
xref: CCO-740

This PR is a combination of #977 and #978

Progressing:

  • Detect progress via generation and resource version changes, not provisioning status
  • Report Progressing=True on operator version change
  • Track root secret and infrastructure resource version changes

Degraded:

  • Suppress Degraded when Progressing is active
  • Enforce 20-minute ProgressingTooLong timeout in status controller

Available:

  • Only report Available=False on pod-identity-webhook when not progressing

(based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)

Assisted-By: Claude Opus 4.6 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dlom
Copy link
Copy Markdown
Contributor Author

dlom commented Apr 1, 2026

/assign @jstuever

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/operator/credentialsrequest/status_test.go (1)

98-153: ⚠️ Potential issue | 🟠 Major

These new cases still don't assert the absence rules they describe.

The table harness only checks that listed conditions exist; it never fails on extras, and empty expectedConditions becomes a no-op. These cases can still pass if computeStatusConditions returns unexpected OperatorDegraded or OperatorProgressing. Please assert the exact condition set, or add explicit absence checks for the disallowed conditions.

Also applies to: 165-193, 240-249

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/credentialsrequest/status_test.go` around lines 98 - 153, The
table-driven tests call computeStatusConditions but only assert presence of
listed conditions (via expectedConditions), allowing unwanted extras; update the
tests (cases like "not progressing - unprovisioned but generation matches",
"degraded but not progressing - ...", and "progressing with errors - not
degraded during active rollout") to assert the exact returned condition set from
computeStatusConditions instead of subset checks—either compare the full slice
of returned conditions to expectedConditions (ensuring same length and same
elements) or add explicit absence assertions for disallowed conditions (e.g.,
ensure OperatorDegraded or OperatorProgressing are not present when
expectedConditions is empty or when a specific state forbids them); reference
computeStatusConditions and expectedConditions in your changes and apply the
same fix to the other mentioned test blocks (the ones around the commented
ranges).
🧹 Nitpick comments (1)
pkg/operator/podidentity/podidentitywebhook_controller_test.go (1)

494-513: Exercise Reconcile instead of restating its branches.

The blocks at Line 494 and Line 534 duplicate the production condition logic, so they can still pass while staticResourceReconciler.Reconcile regresses on stateful behavior like carrying an old condition slice into the next reconcile. Calling Reconcile with a fake setup, or extracting the condition builder into a shared helper, would make this coverage meaningful.

Also applies to: 534-577

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/podidentity/podidentitywebhook_controller_test.go` around lines
494 - 513, The test is duplicating production condition-building logic instead
of exercising staticResourceReconciler.Reconcile; remove the duplicated blocks
that manually set r.conditions/inspect r.lastDeployment and either (a) call
staticResourceReconciler.Reconcile with a fake client/Deployment fixture and
assert the resulting r.conditions, or (b) extract the condition-building code
into a shared helper (e.g., buildConditions or isDeploymentProgressing usage)
used by both the controller and the test; ensure the test constructs the same
fake Deployment states, invokes Reconcile (or the extracted helper), and asserts
on r.conditions (referencing Reconcile, staticResourceReconciler.Reconcile,
isDeploymentProgressing, r.conditions and r.lastDeployment to locate code).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 165-170: The checks for rootSecretChanged and infraChanged
short-circuit when the stored last-sync RV is empty, so legacy
CredentialsRequest objects won't flip to Progressing; update the boolean logic
in the calculations of rootSecretChanged and infraChanged (used in
GetConditions/status.go) to consider a change whenever the current resource
version (rootSecretResourceVersion or infraResourceVersion) differs from
cr.Status.LastSyncCloudCredsSecretResourceVersion or
cr.Status.LastSyncInfrastructureResourceVersion respectively, even if the stored
value is empty, matching the reconciliation behavior in
credentialsrequest_controller.go that compares currentRV !=
cr.Status.LastSync...; adjust only those boolean expressions (rootSecretChanged,
infraChanged) to remove the extra non-empty guards so old CRs resync and report
Progressing.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 298-314: The current logic sets r.conditions when
!deploymentProgressing but leaves r.conditions unchanged when
deploymentProgressing is true, which can leave stale
OperatorDegraded/OperatorAvailable states; update the reconcile logic around
deploymentProgressing (and references to r.lastDeployment and
isDeploymentProgressing) so that when deploymentProgressing is true you
explicitly clear or remove any existing configv1.OperatorDegraded and
configv1.OperatorAvailable entries from r.conditions (or reset r.conditions to
nil/empty) to avoid returning stale conditions from prior reconciles.

---

Outside diff comments:
In `@pkg/operator/credentialsrequest/status_test.go`:
- Around line 98-153: The table-driven tests call computeStatusConditions but
only assert presence of listed conditions (via expectedConditions), allowing
unwanted extras; update the tests (cases like "not progressing - unprovisioned
but generation matches", "degraded but not progressing - ...", and "progressing
with errors - not degraded during active rollout") to assert the exact returned
condition set from computeStatusConditions instead of subset checks—either
compare the full slice of returned conditions to expectedConditions (ensuring
same length and same elements) or add explicit absence assertions for disallowed
conditions (e.g., ensure OperatorDegraded or OperatorProgressing are not present
when expectedConditions is empty or when a specific state forbids them);
reference computeStatusConditions and expectedConditions in your changes and
apply the same fix to the other mentioned test blocks (the ones around the
commented ranges).

---

Nitpick comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller_test.go`:
- Around line 494-513: The test is duplicating production condition-building
logic instead of exercising staticResourceReconciler.Reconcile; remove the
duplicated blocks that manually set r.conditions/inspect r.lastDeployment and
either (a) call staticResourceReconciler.Reconcile with a fake client/Deployment
fixture and assert the resulting r.conditions, or (b) extract the
condition-building code into a shared helper (e.g., buildConditions or
isDeploymentProgressing usage) used by both the controller and the test; ensure
the test constructs the same fake Deployment states, invokes Reconcile (or the
extracted helper), and asserts on r.conditions (referencing Reconcile,
staticResourceReconciler.Reconcile, isDeploymentProgressing, r.conditions and
r.lastDeployment to locate code).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6527227e-c8ea-49e5-a026-ef4b7f81e53b

📥 Commits

Reviewing files that changed from the base of the PR and between 798098a and 987683c.

📒 Files selected for processing (7)
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/credentialsrequest/status.go
  • pkg/operator/credentialsrequest/status_test.go
  • pkg/operator/podidentity/podidentitywebhook_controller.go
  • pkg/operator/podidentity/podidentitywebhook_controller_test.go
  • pkg/operator/status/status_controller.go
  • pkg/operator/status/status_controller_test.go

Comment on lines +165 to +170
rootSecretChanged := rootSecretResourceVersion != "" &&
cr.Status.LastSyncCloudCredsSecretResourceVersion != "" &&
rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
infraChanged := infraResourceVersion != "" &&
cr.Status.LastSyncInfrastructureResourceVersion != "" &&
infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Legacy CRs with empty last-sync resource versions won't report Progressing.

These guards diverge from pkg/operator/credentialsrequest/credentialsrequest_controller.go:839-927, which reconciles when currentRV != cr.Status.LastSync... even if the stored value is empty. Existing CredentialsRequests that haven't backfilled these fields yet will resync after a root-secret or infrastructure update while GetConditions still reports steady state.

Suggested alignment
-		rootSecretChanged := rootSecretResourceVersion != "" &&
-			cr.Status.LastSyncCloudCredsSecretResourceVersion != "" &&
-			rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
-		infraChanged := infraResourceVersion != "" &&
-			cr.Status.LastSyncInfrastructureResourceVersion != "" &&
-			infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
+		rootSecretChanged := rootSecretResourceVersion != "" &&
+			rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
+		infraChanged := infraResourceVersion != "" &&
+			infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rootSecretChanged := rootSecretResourceVersion != "" &&
cr.Status.LastSyncCloudCredsSecretResourceVersion != "" &&
rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
infraChanged := infraResourceVersion != "" &&
cr.Status.LastSyncInfrastructureResourceVersion != "" &&
infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
rootSecretChanged := rootSecretResourceVersion != "" &&
rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
infraChanged := infraResourceVersion != "" &&
infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/credentialsrequest/status.go` around lines 165 - 170, The checks
for rootSecretChanged and infraChanged short-circuit when the stored last-sync
RV is empty, so legacy CredentialsRequest objects won't flip to Progressing;
update the boolean logic in the calculations of rootSecretChanged and
infraChanged (used in GetConditions/status.go) to consider a change whenever the
current resource version (rootSecretResourceVersion or infraResourceVersion)
differs from cr.Status.LastSyncCloudCredsSecretResourceVersion or
cr.Status.LastSyncInfrastructureResourceVersion respectively, even if the stored
value is empty, matching the reconciliation behavior in
credentialsrequest_controller.go that compares currentRV !=
cr.Status.LastSync...; adjust only those boolean expressions (rootSecretChanged,
infraChanged) to remove the extra non-empty guards so old CRs resync and report
Progressing.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 93.83562% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.83%. Comparing base (798098a) to head (e8288a2).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
pkg/operator/credentialsrequest/status.go 86.66% 3 Missing and 1 partial ⚠️
...rator/podidentity/podidentitywebhook_controller.go 94.23% 3 Missing ⚠️
pkg/operator/status/status_controller.go 96.87% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1008      +/-   ##
==========================================
+ Coverage   46.26%   46.83%   +0.56%     
==========================================
  Files          98       98              
  Lines       12264    12391     +127     
==========================================
+ Hits         5674     5803     +129     
+ Misses       5939     5938       -1     
+ Partials      651      650       -1     
Files with missing lines Coverage Δ
...g/operator/credentialsrequest/actuator/actuator.go 39.13% <ø> (ø)
...redentialsrequest/credentialsrequest_controller.go 45.86% <ø> (ø)
pkg/operator/secretannotator/status/status.go 0.00% <ø> (ø)
pkg/operator/status/status_controller.go 65.14% <96.87%> (+13.57%) ⬆️
...rator/podidentity/podidentitywebhook_controller.go 40.17% <94.23%> (+12.95%) ⬆️
pkg/operator/credentialsrequest/status.go 69.29% <86.66%> (+2.62%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/operator/status/status_controller.go (1)

262-307: ⚠️ Potential issue | 🟠 Major

VersionChanged progressing does not persist long enough for the timeout to work.

This block sets OperatorProgressing=True in the same update that writes the new co.Status.Versions, so the next ClusterOperator reconcile sees versions equal again and clears the synthetic progressing state unless another handler keeps it true. That means version-only upgrades never age into ProgressingTooLong. Line 273 also rewrites LastTransitionTime unconditionally, which grants a fresh 20-minute window if the operator was already progressing for some other reason.

This needs a sticky rollout signal for version changes, and LastTransitionTime should only move when OperatorProgressing actually flips to True.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status/status_controller.go` around lines 262 - 307, The
synthetic VersionChanged progressing is overwritten on the same update that sets
co.Status.Versions so it never ages into ProgressingTooLong; also
LastTransitionTime is unconditionally reset. Change the logic in the
version-diff block (the code that compares oldVersions vs co.Status.Versions and
accesses OperatorProgressing via findClusterOperatorCondition) to only set
progressing.LastTransitionTime and set Status=True/Reason="VersionChanged" when
the existing progressing condition was not already True (i.e. detect a status
flip), and otherwise leave LastTransitionTime untouched; alternatively introduce
a sticky rollout marker in the ClusterOperator status (e.g. a VersionsRollout or
rolloutInProgress boolean) that you set when versions change and use that marker
when computing progressing timeouts instead of transiently toggling
OperatorProgressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 301-304: The check that computes deploymentProgressing reads
r.lastDeployment (via isDeploymentProgressing) which may be stale because
r.lastDeployment is only refreshed after ApplyDeployment succeeds; update the
code to fetch or refresh the current Deployment before computing
deploymentProgressing (or refresh r.lastDeployment immediately before any
early-return fallible steps) so isDeploymentProgressing uses up-to-date state;
specifically locate uses of r.lastDeployment and the call to
isDeploymentProgressing (and the ApplyDeployment call) and ensure you call the
current Deployment getter/update r.lastDeployment prior to the error-prone
reconciliation steps that may return early so Degraded/Progressing aren’t
suppressed by stale data.
- Around line 305-317: The current code overwrites r.conditions with either an
empty slice or only Degraded=True, which drops deployment-derived conditions
like Progressing and Available; instead, preserve/build the
rollout/unavailability conditions first (the existing conditions that reflect
deploymentProgressing/unavailable state) and then append the Degraded condition
only when deploymentProgressing is false: do not assign r.conditions = [] or
replace it wholesale; use the existing r.conditions (or construct the
rollout/unavailable conditions) and conditionally append the
configv1.ClusterOperatorStatusCondition with Reason
reasonStaticResourceReconcileFailed and the error message when
!deploymentProgressing so Progressing/Available are retained.

---

Outside diff comments:
In `@pkg/operator/status/status_controller.go`:
- Around line 262-307: The synthetic VersionChanged progressing is overwritten
on the same update that sets co.Status.Versions so it never ages into
ProgressingTooLong; also LastTransitionTime is unconditionally reset. Change the
logic in the version-diff block (the code that compares oldVersions vs
co.Status.Versions and accesses OperatorProgressing via
findClusterOperatorCondition) to only set progressing.LastTransitionTime and set
Status=True/Reason="VersionChanged" when the existing progressing condition was
not already True (i.e. detect a status flip), and otherwise leave
LastTransitionTime untouched; alternatively introduce a sticky rollout marker in
the ClusterOperator status (e.g. a VersionsRollout or rolloutInProgress boolean)
that you set when versions change and use that marker when computing progressing
timeouts instead of transiently toggling OperatorProgressing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 70db941a-900b-4b1b-8215-e90b3bd5a8e2

📥 Commits

Reviewing files that changed from the base of the PR and between 987683c and 926cdbe.

📒 Files selected for processing (8)
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/credentialsrequest/status.go
  • pkg/operator/credentialsrequest/status_test.go
  • pkg/operator/podidentity/podidentitywebhook_controller.go
  • pkg/operator/podidentity/podidentitywebhook_controller_test.go
  • pkg/operator/status/status_controller.go
  • pkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/podidentity/podidentitywebhook_controller_test.go

@jianping-shu
Copy link
Copy Markdown

Testing in progress...

@jianping-shu
Copy link
Copy Markdown

jianping-shu commented Apr 2, 2026

@dlom I tested with OCP-88454, Pls. review if the outputs are expected.
Particularly I have 2 findings to confirm w/ you:

  1. Step "Delete 1 wrong CR" of Case 3, the Degraded condition was NOT changed but its lastTransitionTime was changed. Similar to OCP-45975, seems it shall NOT be updated.
  2. Case 5, the logic detected the root credential is insufficient to mint the component credential, the Degraded condition became True immediately (without 20m delay), is it expected?

@dlom
Copy link
Copy Markdown
Contributor Author

dlom commented Apr 2, 2026

@jianping-shu

  1. This was a bug, I believe I have fixed it
  2. This is expected behavior

I'll be pushing up an updated version shortly

Copy link
Copy Markdown
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

Comment on lines +52 to +58
var infraResourceVersion string
infra, err := utils.GetInfrastructure(r.Client)
if err != nil {
logger.WithError(err).Debug("error retrieving infrastructure for status")
} else {
infraResourceVersion = infra.ResourceVersion
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why we are watching infra version.

Comment on lines +41 to +50
var rootSecretResourceVersion string
rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation()
rootSecret := &corev1.Secret{}
if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
if !apierrors.IsNotFound(err) {
logger.WithError(err).Debug("error retrieving root credentials secret for status")
}
} else {
rootSecretResourceVersion = rootSecret.ResourceVersion
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expect root credential in all situations?

// the controller treats a non-nil root secret with a resource version
// different from the stored one as a trigger to resync, even when the
// stored version is empty (legacy CR that hasn't backfilled yet).
rootSecretChanged := rootSecretResourceVersion != "" &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for passthrough mode? I don't think rootSecret is relevant to other modes. What about Mint mode?

if failingCredRequests > 0 {
// Spec: must not report Degraded during a normal upgrade. When progressing,
// the 20-minute ProgressingTooLong timeout handles persistent failures.
if failingCredRequests > 0 && !isProgressing {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isProgressing != isUpdating. The cluster can be updating beyond the scope of the operator's progressing timeline.

progressingCondition.Message = fmt.Sprintf(
"%d of %d credentials requests provisioned, %d reporting errors.",
len(validCredRequests)-credRequestsNotProvisioned, len(validCredRequests), failingCredRequests)
"%d of %d credentials requests are progressing.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about this specific message changing. I realize the underlying structure is different. I just want to ensure we don't significantly change user experience here (we are all used to looking for this) without proven improvements.

Comment on lines +306 to +308
// Clear stale conditions from a previous reconcile (e.g. Degraded=True
// or Available=False) so they aren't reported while progressing.
r.conditions = []configv1.ClusterOperatorStatusCondition{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progressing != upgrading here too.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/operator/status/status_controller_test.go (1)

390-391: Minor typo in test data: "handerB" should be "handlerB".

This won't affect test correctness since the reason isn't asserted (due to non-deterministic handler order), but fixing it improves consistency.

✏️ Suggested fix
-						Reason: "handerB reasons",
+						Reason: "handlerB reasons",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status/status_controller_test.go` around lines 390 - 391, Fix
the minor typo in the test data by changing the Reason string literal from
"handerB reasons" to "handlerB reasons" (replace the occurrence of Reason:
"handerB reasons" in the test that constructs the handlerB status object so the
Reason reads "handlerB reasons").
pkg/operator/credentialsrequest/status.go (1)

163-175: Share this stale/progress predicate with the reconciler.

This block duplicates the resync trigger from pkg/operator/credentialsrequest/credentialsrequest_controller.go:848-860. The earlier empty-last-sync RV mismatch is exactly the kind of drift this invites again, so a shared helper would make status and reconciliation safer to evolve together.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/credentialsrequest/status.go` around lines 163 - 175, Extract
the resync predicate used here (the specChanged/rootSecretChanged/infraChanged
logic in the loop over validCredRequests) into a shared function (e.g.,
ShouldResyncCredentialsRequest or needsResync) in the credentialsrequest package
and call that helper from both this status loop and the reconciler (the logic
currently duplicated in credentialsrequest_controller.go). The helper should
accept the CredentialsRequest object and the rootSecretResourceVersion and
infraResourceVersion strings and return a boolean; replace the three local bools
and the combined if with a single call to that helper to determine when to
increment credRequestsProgressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 44-57: The status code currently swallows read errors from
AdminClient.Get (rootSecret) and utils.GetInfrastructure and proceeds with empty
resource versions so computeStatusConditions can incorrectly clear
OperatorProgressing; update the logic in status.go so that when AdminClient.Get
or utils.GetInfrastructure return an error (other than NotFound for rootSecret)
you record that failure into the status path instead of silently continuing —
e.g., set a temporary "Progressing"/"Degraded" condition or return the error so
computeStatusConditions knows inputs are incomplete; specifically modify the
branches around AdminClient.Get (rootSecret) and utils.GetInfrastructure (infra)
to set an appropriate status condition (or propagate the error) before calling
computeStatusConditions and ensure computeStatusConditions receives nil/invalid
indicators only when intentionally absent.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 308-326: Startup retries are bypassing the Reconcile()
condition-update path because podIdentityController.Start() calls
ReconcileResources() directly; change Start() to route bootstrap reconciliation
through the same helper used by Reconcile() (so the deployment-derived
conditions from deploymentConditions() and the OperatorDegraded append logic run
on startup failures), i.e., have Reconcile() delegate to a shared reconcile
helper that performs ReconcileResources(), sets r.conditions =
r.deploymentConditions(), and appends the Degraded condition on error, then call
that helper from Start() instead of calling ReconcileResources() directly;
update or add a unit/integration test for startup failure to assert
GetConditions() reflects the Degraded condition during bootstrap retries.

---

Nitpick comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 163-175: Extract the resync predicate used here (the
specChanged/rootSecretChanged/infraChanged logic in the loop over
validCredRequests) into a shared function (e.g., ShouldResyncCredentialsRequest
or needsResync) in the credentialsrequest package and call that helper from both
this status loop and the reconciler (the logic currently duplicated in
credentialsrequest_controller.go). The helper should accept the
CredentialsRequest object and the rootSecretResourceVersion and
infraResourceVersion strings and return a boolean; replace the three local bools
and the combined if with a single call to that helper to determine when to
increment credRequestsProgressing.

In `@pkg/operator/status/status_controller_test.go`:
- Around line 390-391: Fix the minor typo in the test data by changing the
Reason string literal from "handerB reasons" to "handlerB reasons" (replace the
occurrence of Reason: "handerB reasons" in the test that constructs the handlerB
status object so the Reason reads "handlerB reasons").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7ea65a0-4c0f-40f2-91d0-aad06cdd1d02

📥 Commits

Reviewing files that changed from the base of the PR and between 926cdbe and e8288a2.

📒 Files selected for processing (10)
  • pkg/operator/credentialsrequest/actuator/actuator.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/credentialsrequest/status.go
  • pkg/operator/credentialsrequest/status_test.go
  • pkg/operator/podidentity/podidentitywebhook_controller.go
  • pkg/operator/podidentity/podidentitywebhook_controller_test.go
  • pkg/operator/secretannotator/status/status.go
  • pkg/operator/status/status_controller.go
  • pkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (4)
  • pkg/operator/credentialsrequest/actuator/actuator.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/secretannotator/status/status.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/credentialsrequest/status_test.go
  • pkg/operator/status/status_controller.go

Comment on lines +44 to +57
if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
if !apierrors.IsNotFound(err) {
logger.WithError(err).Debug("error retrieving root credentials secret for status")
}
} else {
rootSecretResourceVersion = rootSecret.ResourceVersion
}

var infraResourceVersion string
infra, err := utils.GetInfrastructure(r.Client)
if err != nil {
logger.WithError(err).Debug("error retrieving infrastructure for status")
} else {
infraResourceVersion = infra.ResourceVersion
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t collapse failed status reads into “no progress”.

Unexpected root-secret or infrastructure read failures are only logged here, then computeStatusConditions runs with empty resource versions. That can suppress OperatorProgressing even though the status inputs are incomplete.

Suggested fix
-	var rootSecretResourceVersion string
-	rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation()
-	rootSecret := &corev1.Secret{}
-	if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
-		if !apierrors.IsNotFound(err) {
-			logger.WithError(err).Debug("error retrieving root credentials secret for status")
-		}
-	} else {
-		rootSecretResourceVersion = rootSecret.ResourceVersion
-	}
-
-	var infraResourceVersion string
-	infra, err := utils.GetInfrastructure(r.Client)
-	if err != nil {
-		logger.WithError(err).Debug("error retrieving infrastructure for status")
-	} else {
-		infraResourceVersion = infra.ResourceVersion
-	}
+	var rootSecretResourceVersion string
+	var infraResourceVersion string
+	if mode != operatorv1.CloudCredentialsModeManual {
+		rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation()
+		if rootSecretLocation.Name != "" {
+			rootSecret := &corev1.Secret{}
+			if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
+				if !apierrors.IsNotFound(err) {
+					return []configv1.ClusterOperatorStatusCondition{}, fmt.Errorf("failed to get root credentials secret for status: %w", err)
+				}
+			} else {
+				rootSecretResourceVersion = rootSecret.ResourceVersion
+			}
+		}
+
+		infra, err := utils.GetInfrastructure(r.Client)
+		if err != nil {
+			return []configv1.ClusterOperatorStatusCondition{}, fmt.Errorf("failed to get infrastructure for status: %w", err)
+		}
+		infraResourceVersion = infra.ResourceVersion
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/credentialsrequest/status.go` around lines 44 - 57, The status
code currently swallows read errors from AdminClient.Get (rootSecret) and
utils.GetInfrastructure and proceeds with empty resource versions so
computeStatusConditions can incorrectly clear OperatorProgressing; update the
logic in status.go so that when AdminClient.Get or utils.GetInfrastructure
return an error (other than NotFound for rootSecret) you record that failure
into the status path instead of silently continuing — e.g., set a temporary
"Progressing"/"Degraded" condition or return the error so
computeStatusConditions knows inputs are incomplete; specifically modify the
branches around AdminClient.Get (rootSecret) and utils.GetInfrastructure (infra)
to set an appropriate status condition (or propagate the error) before calling
computeStatusConditions and ensure computeStatusConditions receives nil/invalid
indicators only when intentionally absent.

Comment on lines +308 to +326
err = r.ReconcileResources(ctx)

// Build deployment-derived conditions first (Progressing, Available) so they
// are always reported regardless of whether ReconcileResources succeeded.
// Without this, a transient error would drop Progressing=True during rollout
// or flip Available=False back to the default True.
r.conditions = r.deploymentConditions()

if err != nil {
r.logger.Errorf("reconciliation failed, retrying in %s", retryInterval.String())
r.conditions = []configv1.ClusterOperatorStatusCondition{
{
Type: configv1.OperatorDegraded,
Status: configv1.ConditionTrue,
Reason: reasonStaticResourceReconcileFailed,
Message: fmt.Sprintf("static resource reconciliation failed: %v", err),
},
}
// Spec: Degraded means the component does not match its desired state.
// A reconciliation failure is a persistent error requiring investigation.
// Suppression during cluster upgrades is handled centrally in syncStatus.
r.conditions = append(r.conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorDegraded,
Status: configv1.ConditionTrue,
Reason: reasonStaticResourceReconcileFailed,
Message: fmt.Sprintf("static resource reconciliation failed: %v", err),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Route bootstrap retries through this same condition-update path.

Because podIdentityController.Start() still calls ReconcileResources() directly at Line 124, this block never runs while startup retries are failing. Since pkg/operator/status/status_controller.go:205-250 rebuilds defaults when a handler returns no abnormal conditions, GetConditions() can leave the ClusterOperator looking healthy until the cache starts and a watch event finally reaches Reconcile().

🔧 Suggested direction
+func (r *staticResourceReconciler) reconcileAndUpdateConditions(ctx context.Context) error {
+	currentDeployment, err := r.clientset.AppsV1().Deployments(operatorNamespace).Get(ctx, podIdentityWebhookDeploymentName, metav1.GetOptions{})
+	if err == nil {
+		r.lastDeployment = currentDeployment
+	}
+
+	err = r.ReconcileResources(ctx)
+	r.conditions = r.deploymentConditions()
+	if err != nil {
+		r.conditions = append(r.conditions, configv1.ClusterOperatorStatusCondition{
+			Type:    configv1.OperatorDegraded,
+			Status:  configv1.ConditionTrue,
+			Reason:  reasonStaticResourceReconcileFailed,
+			Message: fmt.Sprintf("static resource reconciliation failed: %v", err),
+		})
+	}
+	return err
+}
+
 func (c *podIdentityController) Start(ctx context.Context) error {
 	retryTimer := time.NewTimer(retryInterval)
 	for {
-		err := c.reconciler.ReconcileResources(ctx)
+		err := c.reconciler.reconcileAndUpdateConditions(ctx)
 		if err != nil {
 			retryTimer.Reset(retryInterval)
 			<-retryTimer.C
 		} else {
 			break
 		}
 	}

Then have Reconcile() delegate to the same helper, and add a startup-failure test for that path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 308 -
326, Startup retries are bypassing the Reconcile() condition-update path because
podIdentityController.Start() calls ReconcileResources() directly; change
Start() to route bootstrap reconciliation through the same helper used by
Reconcile() (so the deployment-derived conditions from deploymentConditions()
and the OperatorDegraded append logic run on startup failures), i.e., have
Reconcile() delegate to a shared reconcile helper that performs
ReconcileResources(), sets r.conditions = r.deploymentConditions(), and appends
the Degraded condition on error, then call that helper from Start() instead of
calling ReconcileResources() directly; update or add a unit/integration test for
startup failure to assert GetConditions() reflects the Degraded condition during
bootstrap retries.

Progressing:
- Detect progress via generation and resource version changes, not provisioning status
- Report Progressing=True on operator version change
- Track root secret and infrastructure resource version changes
- Set Progressing=True with reason VersionChanged on operator version updates

Degraded:
- Suppress Degraded when Progressing is active
- Enforce 20-minute ProgressingTooLong timeout in status controller

Available:
- Only report Available=False on pod-identity-webhook when not progressing
- Clear stale Degraded/Available conditions on pod-identity-webhook during active rollout

(based on the spec from
https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>"
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
pkg/operator/podidentity/podidentitywebhook_controller.go (2)

121-125: ⚠️ Potential issue | 🟠 Major

Startup retries still never publish failure conditions.

Start() is still calling ReconcileResources() directly. While bootstrap retries are failing, r.conditions never gets updated, so GetConditions() can leave the operator looking healthy until a watch event finally reaches Reconcile().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 121 -
125, Start() currently calls reconciler.ReconcileResources() which on bootstrap
failure does not update r.conditions, so operator health remains stale; change
Start() to invoke the reconciler path that publishes conditions (call
c.reconciler.Reconcile(ctx) instead of c.reconciler.ReconcileResources(ctx)) so
failures update r.conditions and GetConditions() reflects the error, and if
Reconcile(ctx) is not available add an explicit condition update on error (e.g.
set a failure condition on the controller’s r.conditions or call the
reconciler’s condition-publish helper) before sleeping/retrying.

298-305: ⚠️ Potential issue | 🟠 Major

Most reconcile errors still erase deployment-derived state.

deploymentConditions(d) only helps if ReconcileResources() returns a deployment, but every error branch here still returns nil. Failures before or after ApplyDeployment() therefore still drop the rollout/unavailability context this change is trying to preserve.

Also applies to: 355-423

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 298 -
305, ReconcileResources may return a deployment along with an error, but current
code throws that away and sets r.conditions = deploymentConditions(d) only when
err == nil; update the reconcile path so you always call deploymentConditions(d)
with the deployment value returned by ReconcileResources (even if err != nil)
before returning, and avoid replacing d with nil on error paths (including
failures before/after ApplyDeployment and the other block covering lines
355-423); specifically preserve and use the returned deployment from
ReconcileResources and call deploymentConditions(deployment) to set r.conditions
prior to any early returns so rollout/Available state is not lost.
pkg/operator/credentialsrequest/status.go (1)

128-140: ⚠️ Potential issue | 🟠 Major

Generation-only progress detection drops external resyncs.

This now only treats Generation != LastSyncGeneration as progress. Root credential rotations and infrastructure changes do not bump CredentialsRequest.Generation, so those resyncs can reconcile and still report steady state here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/credentialsrequest/status.go` around lines 128 - 140, The
current progress check only bumps credRequestsProgressing when cr.Generation !=
cr.Status.LastSyncGeneration, which misses external resyncs; update the loop in
status.go that iterates validCredRequests to also consider external-resync
indicators (e.g., cr.Status.LastSyncTimestamp or
cr.Status.LastSyncResourceVersion versus the CR's current
metadata.resourceVersion/lastObservedTimestamp) so that
credRequestsProgressing++ happens when either the Generation differs OR the
last-sync timestamp/resourceVersion indicates a newer reconciliation; adjust
references to cr.Generation, cr.Status.LastSyncGeneration,
cr.Status.LastSyncTimestamp (or LastSyncResourceVersion) and
metadata.resourceVersion accordingly and ensure isProgressing remains
credRequestsProgressing > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 326-349: The Available=False condition is emitted even while the
rollout is still progressing; modify deploymentConditions so it checks the
result of isDeploymentProgressing (the progressing boolean from
isDeploymentProgressing(deployment)) and only append the
configv1.OperatorAvailable ConditionFalse when progressing is false and
deployment.Status.AvailableReplicas == 0. Update the logic in
deploymentConditions (and maintain the existing Message/Reason text for the
Available condition) so Progressing and Available are not both true/false
simultaneously during an active rollout.

In `@pkg/operator/status/status_controller.go`:
- Around line 352-372: The suppression block in syncStatus (using
isClusterUpgrading and findClusterOperatorCondition to flip co.Status.Conditions
for OperatorDegraded and OperatorAvailable) mutates conditions after
LastTransitionTime was already calculated, causing incorrect timestamps; fix by
either performing the upgrade-suppression before computing LastTransitionTime
(i.e., apply the UpgradeInProgress changes to co.Status.Conditions prior to the
transition-time calculation) or by carrying over the previous transition times
from oldConditions when you flip statuses (look up matching condition in
oldConditions and preserve its LastTransitionTime when you set
Status/Reason/Message for Degraded/Available).
- Around line 269-318: The branch that handles wasVersionProgressing should
clear the synthetic VersionChanged signal when no handler takes over: in the
else where handlerProgressing is false (currently reasserting Progressing=True),
set progressing.Status = configv1.ConditionFalse (or remove the progressing
condition), update progressing.LastTransitionTime = metav1.Now() and clear or
update Reason/Message so the synthetic VersionChanged no longer claims
Progressing; modify the block that references progressing, oldProgressing,
reasonVersionChanged and findClusterOperatorCondition accordingly to stop
reasserting the sticky Progressing=True when no handler is reporting progress.

---

Duplicate comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 128-140: The current progress check only bumps
credRequestsProgressing when cr.Generation != cr.Status.LastSyncGeneration,
which misses external resyncs; update the loop in status.go that iterates
validCredRequests to also consider external-resync indicators (e.g.,
cr.Status.LastSyncTimestamp or cr.Status.LastSyncResourceVersion versus the CR's
current metadata.resourceVersion/lastObservedTimestamp) so that
credRequestsProgressing++ happens when either the Generation differs OR the
last-sync timestamp/resourceVersion indicates a newer reconciliation; adjust
references to cr.Generation, cr.Status.LastSyncGeneration,
cr.Status.LastSyncTimestamp (or LastSyncResourceVersion) and
metadata.resourceVersion accordingly and ensure isProgressing remains
credRequestsProgressing > 0.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 121-125: Start() currently calls reconciler.ReconcileResources()
which on bootstrap failure does not update r.conditions, so operator health
remains stale; change Start() to invoke the reconciler path that publishes
conditions (call c.reconciler.Reconcile(ctx) instead of
c.reconciler.ReconcileResources(ctx)) so failures update r.conditions and
GetConditions() reflects the error, and if Reconcile(ctx) is not available add
an explicit condition update on error (e.g. set a failure condition on the
controller’s r.conditions or call the reconciler’s condition-publish helper)
before sleeping/retrying.
- Around line 298-305: ReconcileResources may return a deployment along with an
error, but current code throws that away and sets r.conditions =
deploymentConditions(d) only when err == nil; update the reconcile path so you
always call deploymentConditions(d) with the deployment value returned by
ReconcileResources (even if err != nil) before returning, and avoid replacing d
with nil on error paths (including failures before/after ApplyDeployment and the
other block covering lines 355-423); specifically preserve and use the returned
deployment from ReconcileResources and call deploymentConditions(deployment) to
set r.conditions prior to any early returns so rollout/Available state is not
lost.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 924d76ff-10df-47b7-bebb-d68a24f99dcc

📥 Commits

Reviewing files that changed from the base of the PR and between e8288a2 and 44f4799.

📒 Files selected for processing (10)
  • pkg/operator/credentialsrequest/actuator/actuator.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/credentialsrequest/status.go
  • pkg/operator/credentialsrequest/status_test.go
  • pkg/operator/podidentity/podidentitywebhook_controller.go
  • pkg/operator/podidentity/podidentitywebhook_controller_test.go
  • pkg/operator/secretannotator/status/status.go
  • pkg/operator/status/status_controller.go
  • pkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/operator/credentialsrequest/actuator/actuator.go
  • pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
  • pkg/operator/secretannotator/status/status.go

Comment on lines +326 to +349
func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition {
var conditions []configv1.ClusterOperatorStatusCondition
if deployment != nil {
progressing, msg := isDeploymentProgressing(deployment)
if progressing {
// Spec: report Progressing=True when actively rolling out new code.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorProgressing,
Status: configv1.ConditionTrue,
Reason: "Deploying",
Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1),
})
}
if deployment.Status.AvailableReplicas == 0 {
// Spec: Available=False means the component is non-functional and requires
// immediate administrator intervention. Zero available pods after the
// deployment has finished progressing indicates a real availability problem.
// Suppression during cluster upgrades is handled centrally in syncStatus.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorAvailable,
Status: configv1.ConditionFalse,
Reason: "DeploymentUnavailable",
Message: "pod-identity-webhook deployment has no available pods",
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only emit Available=False after rollout stops progressing.

This branch still runs during a normal zero-ready rollout, so it can publish Progressing=True and Available=False together for the initial deployment. The intended behavior here is to suppress Available=False until the deployment is no longer progressing.

Suggested fix
 	progressing, msg := isDeploymentProgressing(deployment)
 	if progressing {
 		// Spec: report Progressing=True when actively rolling out new code.
 		conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
 			Type:    configv1.OperatorProgressing,
 			Status:  configv1.ConditionTrue,
 			Reason:  "Deploying",
 			Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1),
 		})
 	}
-	if deployment.Status.AvailableReplicas == 0 {
+	if !progressing && deployment.Status.AvailableReplicas == 0 {
 		// Spec: Available=False means the component is non-functional and requires
 		// immediate administrator intervention. Zero available pods after the
 		// deployment has finished progressing indicates a real availability problem.
 		// Suppression during cluster upgrades is handled centrally in syncStatus.
 		conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition {
var conditions []configv1.ClusterOperatorStatusCondition
if deployment != nil {
progressing, msg := isDeploymentProgressing(deployment)
if progressing {
// Spec: report Progressing=True when actively rolling out new code.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorProgressing,
Status: configv1.ConditionTrue,
Reason: "Deploying",
Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1),
})
}
if deployment.Status.AvailableReplicas == 0 {
// Spec: Available=False means the component is non-functional and requires
// immediate administrator intervention. Zero available pods after the
// deployment has finished progressing indicates a real availability problem.
// Suppression during cluster upgrades is handled centrally in syncStatus.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorAvailable,
Status: configv1.ConditionFalse,
Reason: "DeploymentUnavailable",
Message: "pod-identity-webhook deployment has no available pods",
})
func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition {
var conditions []configv1.ClusterOperatorStatusCondition
if deployment != nil {
progressing, msg := isDeploymentProgressing(deployment)
if progressing {
// Spec: report Progressing=True when actively rolling out new code.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorProgressing,
Status: configv1.ConditionTrue,
Reason: "Deploying",
Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1),
})
}
if !progressing && deployment.Status.AvailableReplicas == 0 {
// Spec: Available=False means the component is non-functional and requires
// immediate administrator intervention. Zero available pods after the
// deployment has finished progressing indicates a real availability problem.
// Suppression during cluster upgrades is handled centrally in syncStatus.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorAvailable,
Status: configv1.ConditionFalse,
Reason: "DeploymentUnavailable",
Message: "pod-identity-webhook deployment has no available pods",
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 326 -
349, The Available=False condition is emitted even while the rollout is still
progressing; modify deploymentConditions so it checks the result of
isDeploymentProgressing (the progressing boolean from
isDeploymentProgressing(deployment)) and only append the
configv1.OperatorAvailable ConditionFalse when progressing is false and
deployment.Status.AvailableReplicas == 0. Update the logic in
deploymentConditions (and maintain the existing Message/Reason text for the
Available condition) so Progressing and Available are not both true/false
simultaneously during an active rollout.

Comment on lines +269 to +318
// Spec: report Progressing=True when actively rolling out a version change.
// The signal is made sticky via the condition itself: when Progressing has
// Reason=VersionChanged, it persists across reconciles until a handler
// reports its own Progressing=True (indicating it has taken over tracking)
// or the rollout completes (no handler reports Progressing=True on the
// subsequent reconcile after the version change was first detected).
progressing, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorProgressing)
if !reflect.DeepEqual(oldVersions, co.Status.Versions) {
// Version just changed — set Progressing=True with VersionChanged reason.
logger.WithFields(log.Fields{
"old": oldVersions,
"new": co.Status.Versions,
}).Info("version has changed, updating progressing condition lastTransitionTime")
progressing, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorProgressing)
// We know this should be there.
progressing.LastTransitionTime = metav1.Now()
}).Info("version has changed, setting Progressing=True")
if progressing != nil {
if progressing.Status != configv1.ConditionTrue {
// Only update LastTransitionTime when actually transitioning
// from False/Unknown to True — don't grant a fresh 20-minute
// window if already progressing for another reason.
progressing.LastTransitionTime = metav1.Now()
}
progressing.Status = configv1.ConditionTrue
progressing.Reason = reasonVersionChanged
progressing.Message = "Operator version is updating"
}
} else if progressing != nil {
// Version matches — check if the previous reconcile set VersionChanged.
oldProgressing, _ := findClusterOperatorCondition(oldConditions, configv1.OperatorProgressing)
wasVersionProgressing := oldProgressing != nil &&
oldProgressing.Status == configv1.ConditionTrue &&
oldProgressing.Reason == reasonVersionChanged

if wasVersionProgressing {
handlerProgressing := progressing.Status == configv1.ConditionTrue
if handlerProgressing {
// A handler has taken over Progressing — let its reason through.
// The LastTransitionTime is preserved by setLastTransitionTime
// since the status hasn't changed (True → True).
} else {
// No handler is reporting progress — keep the sticky VersionChanged
// signal. This ensures the Progressing=True condition persists
// across reconciles so ProgressingTooLong can fire.
progressing.Status = configv1.ConditionTrue
progressing.Reason = reasonVersionChanged
progressing.Message = "Operator version is updating"
// Preserve the original transition time from when the version
// change was first detected.
progressing.LastTransitionTime = oldProgressing.LastTransitionTime
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clear the synthetic VersionChanged signal when no handler takes over.

The comment here says the synthetic version-change signal should end once a later reconcile sees no handler still progressing, but this branch does the opposite and reasserts Progressing=True. That can leave the operator permanently progressing after a no-op version bump and eventually trip ProgressingTooLong after an otherwise complete rollout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status/status_controller.go` around lines 269 - 318, The branch
that handles wasVersionProgressing should clear the synthetic VersionChanged
signal when no handler takes over: in the else where handlerProgressing is false
(currently reasserting Progressing=True), set progressing.Status =
configv1.ConditionFalse (or remove the progressing condition), update
progressing.LastTransitionTime = metav1.Now() and clear or update Reason/Message
so the synthetic VersionChanged no longer claims Progressing; modify the block
that references progressing, oldProgressing, reasonVersionChanged and
findClusterOperatorCondition accordingly to stop reasserting the sticky
Progressing=True when no handler is reporting progress.

Comment on lines +352 to +372
// Spec: "A component must not report Degraded during the course of a normal upgrade."
// Spec: "A component must not report Available=False during the course of a normal upgrade."
// The upgrade window extends beyond this operator's own Progressing period —
// other operators may still be progressing. Use the ClusterVersion Progressing
// condition as the authoritative signal for whether a cluster upgrade is active.
// This runs last so it overrides all sources, including ProgressingTooLong.
if isClusterUpgrading(kubeClient, logger) {
degraded, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorDegraded)
if degraded != nil && degraded.Status == configv1.ConditionTrue {
logger.Info("suppressing Degraded condition during cluster upgrade")
degraded.Status = configv1.ConditionFalse
degraded.Reason = "UpgradeInProgress"
degraded.Message = "Degraded is suppressed during cluster upgrade"
}
available, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorAvailable)
if available != nil && available.Status == configv1.ConditionFalse {
logger.Info("suppressing Available=False condition during cluster upgrade")
available.Status = configv1.ConditionTrue
available.Reason = "UpgradeInProgress"
available.Message = "Available is reported as True during cluster upgrade"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Upgrade suppression is mutating the published status after transition times were computed.

Because LastTransitionTime was already derived earlier in syncStatus, suppressing Degraded/Available here can refresh timestamps from the intermediate unsuppressed state even when the final published truth value stays suppressed across reconciles. Calculate transition times from the final condition set, or explicitly preserve/update the suppressed timestamps against oldConditions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status/status_controller.go` around lines 352 - 372, The
suppression block in syncStatus (using isClusterUpgrading and
findClusterOperatorCondition to flip co.Status.Conditions for OperatorDegraded
and OperatorAvailable) mutates conditions after LastTransitionTime was already
calculated, causing incorrect timestamps; fix by either performing the
upgrade-suppression before computing LastTransitionTime (i.e., apply the
UpgradeInProgress changes to co.Status.Conditions prior to the transition-time
calculation) or by carrying over the previous transition times from
oldConditions when you flip statuses (look up matching condition in
oldConditions and preserve its LastTransitionTime when you set
Status/Reason/Message for Degraded/Available).

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 3, 2026

@dlom: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-upgrade 44f4799 link true /test e2e-upgrade
ci/prow/e2e-aws-cco-parallel 44f4799 link false /test e2e-aws-cco-parallel
ci/prow/unit 44f4799 link true /test unit
ci/prow/coverage 44f4799 link true /test coverage
ci/prow/security 44f4799 link true /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants