CCO-738, CCO-739, CCO-740: Overhaul status logic to correctly match spec#1008
CCO-738, CCO-739, CCO-740: Overhaul status logic to correctly match spec#1008dlom wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@dlom: This pull request references CCO-738 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-739 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-740 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughStatus logic for CredentialsRequest now treats generation mismatches as progressing; pod-identity reconciler derives operator conditions from deployment rollout state and preserves deployment-derived conditions on errors; status controller tracks version-change progressing with timeout-to-degraded and suppresses degraded/available during cluster upgrades. Multiple tests updated/added and clarifying comments inserted. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@dlom: This pull request references CCO-738 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-739 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-740 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dlom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jianping-shu this PR combines the two PRs I had originally into one unit. It is ready for you to test |
|
@dlom: This pull request references CCO-738 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-739 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. This pull request references CCO-740 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign @jstuever |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/credentialsrequest/status_test.go (1)
98-153:⚠️ Potential issue | 🟠 MajorThese new cases still don't assert the absence rules they describe.
The table harness only checks that listed conditions exist; it never fails on extras, and empty
expectedConditionsbecomes a no-op. These cases can still pass ifcomputeStatusConditionsreturns unexpectedOperatorDegradedorOperatorProgressing. Please assert the exact condition set, or add explicit absence checks for the disallowed conditions.Also applies to: 165-193, 240-249
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/credentialsrequest/status_test.go` around lines 98 - 153, The table-driven tests call computeStatusConditions but only assert presence of listed conditions (via expectedConditions), allowing unwanted extras; update the tests (cases like "not progressing - unprovisioned but generation matches", "degraded but not progressing - ...", and "progressing with errors - not degraded during active rollout") to assert the exact returned condition set from computeStatusConditions instead of subset checks—either compare the full slice of returned conditions to expectedConditions (ensuring same length and same elements) or add explicit absence assertions for disallowed conditions (e.g., ensure OperatorDegraded or OperatorProgressing are not present when expectedConditions is empty or when a specific state forbids them); reference computeStatusConditions and expectedConditions in your changes and apply the same fix to the other mentioned test blocks (the ones around the commented ranges).
🧹 Nitpick comments (1)
pkg/operator/podidentity/podidentitywebhook_controller_test.go (1)
494-513: ExerciseReconcileinstead of restating its branches.The blocks at Line 494 and Line 534 duplicate the production condition logic, so they can still pass while
staticResourceReconciler.Reconcileregresses on stateful behavior like carrying an old condition slice into the next reconcile. CallingReconcilewith a fake setup, or extracting the condition builder into a shared helper, would make this coverage meaningful.Also applies to: 534-577
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/podidentity/podidentitywebhook_controller_test.go` around lines 494 - 513, The test is duplicating production condition-building logic instead of exercising staticResourceReconciler.Reconcile; remove the duplicated blocks that manually set r.conditions/inspect r.lastDeployment and either (a) call staticResourceReconciler.Reconcile with a fake client/Deployment fixture and assert the resulting r.conditions, or (b) extract the condition-building code into a shared helper (e.g., buildConditions or isDeploymentProgressing usage) used by both the controller and the test; ensure the test constructs the same fake Deployment states, invokes Reconcile (or the extracted helper), and asserts on r.conditions (referencing Reconcile, staticResourceReconciler.Reconcile, isDeploymentProgressing, r.conditions and r.lastDeployment to locate code).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 165-170: The checks for rootSecretChanged and infraChanged
short-circuit when the stored last-sync RV is empty, so legacy
CredentialsRequest objects won't flip to Progressing; update the boolean logic
in the calculations of rootSecretChanged and infraChanged (used in
GetConditions/status.go) to consider a change whenever the current resource
version (rootSecretResourceVersion or infraResourceVersion) differs from
cr.Status.LastSyncCloudCredsSecretResourceVersion or
cr.Status.LastSyncInfrastructureResourceVersion respectively, even if the stored
value is empty, matching the reconciliation behavior in
credentialsrequest_controller.go that compares currentRV !=
cr.Status.LastSync...; adjust only those boolean expressions (rootSecretChanged,
infraChanged) to remove the extra non-empty guards so old CRs resync and report
Progressing.
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 298-314: The current logic sets r.conditions when
!deploymentProgressing but leaves r.conditions unchanged when
deploymentProgressing is true, which can leave stale
OperatorDegraded/OperatorAvailable states; update the reconcile logic around
deploymentProgressing (and references to r.lastDeployment and
isDeploymentProgressing) so that when deploymentProgressing is true you
explicitly clear or remove any existing configv1.OperatorDegraded and
configv1.OperatorAvailable entries from r.conditions (or reset r.conditions to
nil/empty) to avoid returning stale conditions from prior reconciles.
---
Outside diff comments:
In `@pkg/operator/credentialsrequest/status_test.go`:
- Around line 98-153: The table-driven tests call computeStatusConditions but
only assert presence of listed conditions (via expectedConditions), allowing
unwanted extras; update the tests (cases like "not progressing - unprovisioned
but generation matches", "degraded but not progressing - ...", and "progressing
with errors - not degraded during active rollout") to assert the exact returned
condition set from computeStatusConditions instead of subset checks—either
compare the full slice of returned conditions to expectedConditions (ensuring
same length and same elements) or add explicit absence assertions for disallowed
conditions (e.g., ensure OperatorDegraded or OperatorProgressing are not present
when expectedConditions is empty or when a specific state forbids them);
reference computeStatusConditions and expectedConditions in your changes and
apply the same fix to the other mentioned test blocks (the ones around the
commented ranges).
---
Nitpick comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller_test.go`:
- Around line 494-513: The test is duplicating production condition-building
logic instead of exercising staticResourceReconciler.Reconcile; remove the
duplicated blocks that manually set r.conditions/inspect r.lastDeployment and
either (a) call staticResourceReconciler.Reconcile with a fake client/Deployment
fixture and assert the resulting r.conditions, or (b) extract the
condition-building code into a shared helper (e.g., buildConditions or
isDeploymentProgressing usage) used by both the controller and the test; ensure
the test constructs the same fake Deployment states, invokes Reconcile (or the
extracted helper), and asserts on r.conditions (referencing Reconcile,
staticResourceReconciler.Reconcile, isDeploymentProgressing, r.conditions and
r.lastDeployment to locate code).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6527227e-c8ea-49e5-a026-ef4b7f81e53b
📒 Files selected for processing (7)
pkg/operator/credentialsrequest/credentialsrequest_controller_test.gopkg/operator/credentialsrequest/status.gopkg/operator/credentialsrequest/status_test.gopkg/operator/podidentity/podidentitywebhook_controller.gopkg/operator/podidentity/podidentitywebhook_controller_test.gopkg/operator/status/status_controller.gopkg/operator/status/status_controller_test.go
| rootSecretChanged := rootSecretResourceVersion != "" && | ||
| cr.Status.LastSyncCloudCredsSecretResourceVersion != "" && | ||
| rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion | ||
| infraChanged := infraResourceVersion != "" && | ||
| cr.Status.LastSyncInfrastructureResourceVersion != "" && | ||
| infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion |
There was a problem hiding this comment.
Legacy CRs with empty last-sync resource versions won't report Progressing.
These guards diverge from pkg/operator/credentialsrequest/credentialsrequest_controller.go:839-927, which reconciles when currentRV != cr.Status.LastSync... even if the stored value is empty. Existing CredentialsRequests that haven't backfilled these fields yet will resync after a root-secret or infrastructure update while GetConditions still reports steady state.
Suggested alignment
- rootSecretChanged := rootSecretResourceVersion != "" &&
- cr.Status.LastSyncCloudCredsSecretResourceVersion != "" &&
- rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
- infraChanged := infraResourceVersion != "" &&
- cr.Status.LastSyncInfrastructureResourceVersion != "" &&
- infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion
+ rootSecretChanged := rootSecretResourceVersion != "" &&
+ rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion
+ infraChanged := infraResourceVersion != "" &&
+ infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rootSecretChanged := rootSecretResourceVersion != "" && | |
| cr.Status.LastSyncCloudCredsSecretResourceVersion != "" && | |
| rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion | |
| infraChanged := infraResourceVersion != "" && | |
| cr.Status.LastSyncInfrastructureResourceVersion != "" && | |
| infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion | |
| rootSecretChanged := rootSecretResourceVersion != "" && | |
| rootSecretResourceVersion != cr.Status.LastSyncCloudCredsSecretResourceVersion | |
| infraChanged := infraResourceVersion != "" && | |
| infraResourceVersion != cr.Status.LastSyncInfrastructureResourceVersion |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/credentialsrequest/status.go` around lines 165 - 170, The checks
for rootSecretChanged and infraChanged short-circuit when the stored last-sync
RV is empty, so legacy CredentialsRequest objects won't flip to Progressing;
update the boolean logic in the calculations of rootSecretChanged and
infraChanged (used in GetConditions/status.go) to consider a change whenever the
current resource version (rootSecretResourceVersion or infraResourceVersion)
differs from cr.Status.LastSyncCloudCredsSecretResourceVersion or
cr.Status.LastSyncInfrastructureResourceVersion respectively, even if the stored
value is empty, matching the reconciliation behavior in
credentialsrequest_controller.go that compares currentRV !=
cr.Status.LastSync...; adjust only those boolean expressions (rootSecretChanged,
infraChanged) to remove the extra non-empty guards so old CRs resync and report
Progressing.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1008 +/- ##
==========================================
+ Coverage 46.26% 46.83% +0.56%
==========================================
Files 98 98
Lines 12264 12391 +127
==========================================
+ Hits 5674 5803 +129
+ Misses 5939 5938 -1
+ Partials 651 650 -1
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/status/status_controller.go (1)
262-307:⚠️ Potential issue | 🟠 Major
VersionChangedprogressing does not persist long enough for the timeout to work.This block sets
OperatorProgressing=Truein the same update that writes the newco.Status.Versions, so the nextClusterOperatorreconcile sees versions equal again and clears the synthetic progressing state unless another handler keeps it true. That means version-only upgrades never age intoProgressingTooLong. Line 273 also rewritesLastTransitionTimeunconditionally, which grants a fresh 20-minute window if the operator was already progressing for some other reason.This needs a sticky rollout signal for version changes, and
LastTransitionTimeshould only move whenOperatorProgressingactually flips toTrue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status/status_controller.go` around lines 262 - 307, The synthetic VersionChanged progressing is overwritten on the same update that sets co.Status.Versions so it never ages into ProgressingTooLong; also LastTransitionTime is unconditionally reset. Change the logic in the version-diff block (the code that compares oldVersions vs co.Status.Versions and accesses OperatorProgressing via findClusterOperatorCondition) to only set progressing.LastTransitionTime and set Status=True/Reason="VersionChanged" when the existing progressing condition was not already True (i.e. detect a status flip), and otherwise leave LastTransitionTime untouched; alternatively introduce a sticky rollout marker in the ClusterOperator status (e.g. a VersionsRollout or rolloutInProgress boolean) that you set when versions change and use that marker when computing progressing timeouts instead of transiently toggling OperatorProgressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 301-304: The check that computes deploymentProgressing reads
r.lastDeployment (via isDeploymentProgressing) which may be stale because
r.lastDeployment is only refreshed after ApplyDeployment succeeds; update the
code to fetch or refresh the current Deployment before computing
deploymentProgressing (or refresh r.lastDeployment immediately before any
early-return fallible steps) so isDeploymentProgressing uses up-to-date state;
specifically locate uses of r.lastDeployment and the call to
isDeploymentProgressing (and the ApplyDeployment call) and ensure you call the
current Deployment getter/update r.lastDeployment prior to the error-prone
reconciliation steps that may return early so Degraded/Progressing aren’t
suppressed by stale data.
- Around line 305-317: The current code overwrites r.conditions with either an
empty slice or only Degraded=True, which drops deployment-derived conditions
like Progressing and Available; instead, preserve/build the
rollout/unavailability conditions first (the existing conditions that reflect
deploymentProgressing/unavailable state) and then append the Degraded condition
only when deploymentProgressing is false: do not assign r.conditions = [] or
replace it wholesale; use the existing r.conditions (or construct the
rollout/unavailable conditions) and conditionally append the
configv1.ClusterOperatorStatusCondition with Reason
reasonStaticResourceReconcileFailed and the error message when
!deploymentProgressing so Progressing/Available are retained.
---
Outside diff comments:
In `@pkg/operator/status/status_controller.go`:
- Around line 262-307: The synthetic VersionChanged progressing is overwritten
on the same update that sets co.Status.Versions so it never ages into
ProgressingTooLong; also LastTransitionTime is unconditionally reset. Change the
logic in the version-diff block (the code that compares oldVersions vs
co.Status.Versions and accesses OperatorProgressing via
findClusterOperatorCondition) to only set progressing.LastTransitionTime and set
Status=True/Reason="VersionChanged" when the existing progressing condition was
not already True (i.e. detect a status flip), and otherwise leave
LastTransitionTime untouched; alternatively introduce a sticky rollout marker in
the ClusterOperator status (e.g. a VersionsRollout or rolloutInProgress boolean)
that you set when versions change and use that marker when computing progressing
timeouts instead of transiently toggling OperatorProgressing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 70db941a-900b-4b1b-8215-e90b3bd5a8e2
📒 Files selected for processing (8)
pkg/operator/credentialsrequest/credentialsrequest_controller.gopkg/operator/credentialsrequest/credentialsrequest_controller_test.gopkg/operator/credentialsrequest/status.gopkg/operator/credentialsrequest/status_test.gopkg/operator/podidentity/podidentitywebhook_controller.gopkg/operator/podidentity/podidentitywebhook_controller_test.gopkg/operator/status/status_controller.gopkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/podidentity/podidentitywebhook_controller_test.go
|
Testing in progress... |
|
@dlom I tested with OCP-88454, Pls. review if the outputs are expected.
|
I'll be pushing up an updated version shortly |
| var infraResourceVersion string | ||
| infra, err := utils.GetInfrastructure(r.Client) | ||
| if err != nil { | ||
| logger.WithError(err).Debug("error retrieving infrastructure for status") | ||
| } else { | ||
| infraResourceVersion = infra.ResourceVersion | ||
| } |
There was a problem hiding this comment.
It is not clear why we are watching infra version.
| var rootSecretResourceVersion string | ||
| rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation() | ||
| rootSecret := &corev1.Secret{} | ||
| if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil { | ||
| if !apierrors.IsNotFound(err) { | ||
| logger.WithError(err).Debug("error retrieving root credentials secret for status") | ||
| } | ||
| } else { | ||
| rootSecretResourceVersion = rootSecret.ResourceVersion | ||
| } |
There was a problem hiding this comment.
Can we expect root credential in all situations?
| // the controller treats a non-nil root secret with a resource version | ||
| // different from the stored one as a trigger to resync, even when the | ||
| // stored version is empty (legacy CR that hasn't backfilled yet). | ||
| rootSecretChanged := rootSecretResourceVersion != "" && |
There was a problem hiding this comment.
Is this for passthrough mode? I don't think rootSecret is relevant to other modes. What about Mint mode?
| if failingCredRequests > 0 { | ||
| // Spec: must not report Degraded during a normal upgrade. When progressing, | ||
| // the 20-minute ProgressingTooLong timeout handles persistent failures. | ||
| if failingCredRequests > 0 && !isProgressing { |
There was a problem hiding this comment.
isProgressing != isUpdating. The cluster can be updating beyond the scope of the operator's progressing timeline.
| progressingCondition.Message = fmt.Sprintf( | ||
| "%d of %d credentials requests provisioned, %d reporting errors.", | ||
| len(validCredRequests)-credRequestsNotProvisioned, len(validCredRequests), failingCredRequests) | ||
| "%d of %d credentials requests are progressing.", |
There was a problem hiding this comment.
I have concerns about this specific message changing. I realize the underlying structure is different. I just want to ensure we don't significantly change user experience here (we are all used to looking for this) without proven improvements.
| // Clear stale conditions from a previous reconcile (e.g. Degraded=True | ||
| // or Available=False) so they aren't reported while progressing. | ||
| r.conditions = []configv1.ClusterOperatorStatusCondition{} |
There was a problem hiding this comment.
Progressing != upgrading here too.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/operator/status/status_controller_test.go (1)
390-391: Minor typo in test data: "handerB" should be "handlerB".This won't affect test correctness since the reason isn't asserted (due to non-deterministic handler order), but fixing it improves consistency.
✏️ Suggested fix
- Reason: "handerB reasons", + Reason: "handlerB reasons",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status/status_controller_test.go` around lines 390 - 391, Fix the minor typo in the test data by changing the Reason string literal from "handerB reasons" to "handlerB reasons" (replace the occurrence of Reason: "handerB reasons" in the test that constructs the handlerB status object so the Reason reads "handlerB reasons").pkg/operator/credentialsrequest/status.go (1)
163-175: Share this stale/progress predicate with the reconciler.This block duplicates the resync trigger from
pkg/operator/credentialsrequest/credentialsrequest_controller.go:848-860. The earlier empty-last-sync RV mismatch is exactly the kind of drift this invites again, so a shared helper would make status and reconciliation safer to evolve together.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/credentialsrequest/status.go` around lines 163 - 175, Extract the resync predicate used here (the specChanged/rootSecretChanged/infraChanged logic in the loop over validCredRequests) into a shared function (e.g., ShouldResyncCredentialsRequest or needsResync) in the credentialsrequest package and call that helper from both this status loop and the reconciler (the logic currently duplicated in credentialsrequest_controller.go). The helper should accept the CredentialsRequest object and the rootSecretResourceVersion and infraResourceVersion strings and return a boolean; replace the three local bools and the combined if with a single call to that helper to determine when to increment credRequestsProgressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 44-57: The status code currently swallows read errors from
AdminClient.Get (rootSecret) and utils.GetInfrastructure and proceeds with empty
resource versions so computeStatusConditions can incorrectly clear
OperatorProgressing; update the logic in status.go so that when AdminClient.Get
or utils.GetInfrastructure return an error (other than NotFound for rootSecret)
you record that failure into the status path instead of silently continuing —
e.g., set a temporary "Progressing"/"Degraded" condition or return the error so
computeStatusConditions knows inputs are incomplete; specifically modify the
branches around AdminClient.Get (rootSecret) and utils.GetInfrastructure (infra)
to set an appropriate status condition (or propagate the error) before calling
computeStatusConditions and ensure computeStatusConditions receives nil/invalid
indicators only when intentionally absent.
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 308-326: Startup retries are bypassing the Reconcile()
condition-update path because podIdentityController.Start() calls
ReconcileResources() directly; change Start() to route bootstrap reconciliation
through the same helper used by Reconcile() (so the deployment-derived
conditions from deploymentConditions() and the OperatorDegraded append logic run
on startup failures), i.e., have Reconcile() delegate to a shared reconcile
helper that performs ReconcileResources(), sets r.conditions =
r.deploymentConditions(), and appends the Degraded condition on error, then call
that helper from Start() instead of calling ReconcileResources() directly;
update or add a unit/integration test for startup failure to assert
GetConditions() reflects the Degraded condition during bootstrap retries.
---
Nitpick comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 163-175: Extract the resync predicate used here (the
specChanged/rootSecretChanged/infraChanged logic in the loop over
validCredRequests) into a shared function (e.g., ShouldResyncCredentialsRequest
or needsResync) in the credentialsrequest package and call that helper from both
this status loop and the reconciler (the logic currently duplicated in
credentialsrequest_controller.go). The helper should accept the
CredentialsRequest object and the rootSecretResourceVersion and
infraResourceVersion strings and return a boolean; replace the three local bools
and the combined if with a single call to that helper to determine when to
increment credRequestsProgressing.
In `@pkg/operator/status/status_controller_test.go`:
- Around line 390-391: Fix the minor typo in the test data by changing the
Reason string literal from "handerB reasons" to "handlerB reasons" (replace the
occurrence of Reason: "handerB reasons" in the test that constructs the handlerB
status object so the Reason reads "handlerB reasons").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7ea65a0-4c0f-40f2-91d0-aad06cdd1d02
📒 Files selected for processing (10)
pkg/operator/credentialsrequest/actuator/actuator.gopkg/operator/credentialsrequest/credentialsrequest_controller.gopkg/operator/credentialsrequest/credentialsrequest_controller_test.gopkg/operator/credentialsrequest/status.gopkg/operator/credentialsrequest/status_test.gopkg/operator/podidentity/podidentitywebhook_controller.gopkg/operator/podidentity/podidentitywebhook_controller_test.gopkg/operator/secretannotator/status/status.gopkg/operator/status/status_controller.gopkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (4)
- pkg/operator/credentialsrequest/actuator/actuator.go
- pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
- pkg/operator/secretannotator/status/status.go
- pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/operator/credentialsrequest/status_test.go
- pkg/operator/status/status_controller.go
| if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil { | ||
| if !apierrors.IsNotFound(err) { | ||
| logger.WithError(err).Debug("error retrieving root credentials secret for status") | ||
| } | ||
| } else { | ||
| rootSecretResourceVersion = rootSecret.ResourceVersion | ||
| } | ||
|
|
||
| var infraResourceVersion string | ||
| infra, err := utils.GetInfrastructure(r.Client) | ||
| if err != nil { | ||
| logger.WithError(err).Debug("error retrieving infrastructure for status") | ||
| } else { | ||
| infraResourceVersion = infra.ResourceVersion |
There was a problem hiding this comment.
Don’t collapse failed status reads into “no progress”.
Unexpected root-secret or infrastructure read failures are only logged here, then computeStatusConditions runs with empty resource versions. That can suppress OperatorProgressing even though the status inputs are incomplete.
Suggested fix
- var rootSecretResourceVersion string
- rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation()
- rootSecret := &corev1.Secret{}
- if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
- if !apierrors.IsNotFound(err) {
- logger.WithError(err).Debug("error retrieving root credentials secret for status")
- }
- } else {
- rootSecretResourceVersion = rootSecret.ResourceVersion
- }
-
- var infraResourceVersion string
- infra, err := utils.GetInfrastructure(r.Client)
- if err != nil {
- logger.WithError(err).Debug("error retrieving infrastructure for status")
- } else {
- infraResourceVersion = infra.ResourceVersion
- }
+ var rootSecretResourceVersion string
+ var infraResourceVersion string
+ if mode != operatorv1.CloudCredentialsModeManual {
+ rootSecretLocation := r.Actuator.GetCredentialsRootSecretLocation()
+ if rootSecretLocation.Name != "" {
+ rootSecret := &corev1.Secret{}
+ if err := r.AdminClient.Get(context.TODO(), rootSecretLocation, rootSecret); err != nil {
+ if !apierrors.IsNotFound(err) {
+ return []configv1.ClusterOperatorStatusCondition{}, fmt.Errorf("failed to get root credentials secret for status: %w", err)
+ }
+ } else {
+ rootSecretResourceVersion = rootSecret.ResourceVersion
+ }
+ }
+
+ infra, err := utils.GetInfrastructure(r.Client)
+ if err != nil {
+ return []configv1.ClusterOperatorStatusCondition{}, fmt.Errorf("failed to get infrastructure for status: %w", err)
+ }
+ infraResourceVersion = infra.ResourceVersion
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/credentialsrequest/status.go` around lines 44 - 57, The status
code currently swallows read errors from AdminClient.Get (rootSecret) and
utils.GetInfrastructure and proceeds with empty resource versions so
computeStatusConditions can incorrectly clear OperatorProgressing; update the
logic in status.go so that when AdminClient.Get or utils.GetInfrastructure
return an error (other than NotFound for rootSecret) you record that failure
into the status path instead of silently continuing — e.g., set a temporary
"Progressing"/"Degraded" condition or return the error so
computeStatusConditions knows inputs are incomplete; specifically modify the
branches around AdminClient.Get (rootSecret) and utils.GetInfrastructure (infra)
to set an appropriate status condition (or propagate the error) before calling
computeStatusConditions and ensure computeStatusConditions receives nil/invalid
indicators only when intentionally absent.
| err = r.ReconcileResources(ctx) | ||
|
|
||
| // Build deployment-derived conditions first (Progressing, Available) so they | ||
| // are always reported regardless of whether ReconcileResources succeeded. | ||
| // Without this, a transient error would drop Progressing=True during rollout | ||
| // or flip Available=False back to the default True. | ||
| r.conditions = r.deploymentConditions() | ||
|
|
||
| if err != nil { | ||
| r.logger.Errorf("reconciliation failed, retrying in %s", retryInterval.String()) | ||
| r.conditions = []configv1.ClusterOperatorStatusCondition{ | ||
| { | ||
| Type: configv1.OperatorDegraded, | ||
| Status: configv1.ConditionTrue, | ||
| Reason: reasonStaticResourceReconcileFailed, | ||
| Message: fmt.Sprintf("static resource reconciliation failed: %v", err), | ||
| }, | ||
| } | ||
| // Spec: Degraded means the component does not match its desired state. | ||
| // A reconciliation failure is a persistent error requiring investigation. | ||
| // Suppression during cluster upgrades is handled centrally in syncStatus. | ||
| r.conditions = append(r.conditions, configv1.ClusterOperatorStatusCondition{ | ||
| Type: configv1.OperatorDegraded, | ||
| Status: configv1.ConditionTrue, | ||
| Reason: reasonStaticResourceReconcileFailed, | ||
| Message: fmt.Sprintf("static resource reconciliation failed: %v", err), | ||
| }) |
There was a problem hiding this comment.
Route bootstrap retries through this same condition-update path.
Because podIdentityController.Start() still calls ReconcileResources() directly at Line 124, this block never runs while startup retries are failing. Since pkg/operator/status/status_controller.go:205-250 rebuilds defaults when a handler returns no abnormal conditions, GetConditions() can leave the ClusterOperator looking healthy until the cache starts and a watch event finally reaches Reconcile().
🔧 Suggested direction
+func (r *staticResourceReconciler) reconcileAndUpdateConditions(ctx context.Context) error {
+ currentDeployment, err := r.clientset.AppsV1().Deployments(operatorNamespace).Get(ctx, podIdentityWebhookDeploymentName, metav1.GetOptions{})
+ if err == nil {
+ r.lastDeployment = currentDeployment
+ }
+
+ err = r.ReconcileResources(ctx)
+ r.conditions = r.deploymentConditions()
+ if err != nil {
+ r.conditions = append(r.conditions, configv1.ClusterOperatorStatusCondition{
+ Type: configv1.OperatorDegraded,
+ Status: configv1.ConditionTrue,
+ Reason: reasonStaticResourceReconcileFailed,
+ Message: fmt.Sprintf("static resource reconciliation failed: %v", err),
+ })
+ }
+ return err
+}
+
func (c *podIdentityController) Start(ctx context.Context) error {
retryTimer := time.NewTimer(retryInterval)
for {
- err := c.reconciler.ReconcileResources(ctx)
+ err := c.reconciler.reconcileAndUpdateConditions(ctx)
if err != nil {
retryTimer.Reset(retryInterval)
<-retryTimer.C
} else {
break
}
}Then have Reconcile() delegate to the same helper, and add a startup-failure test for that path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 308 -
326, Startup retries are bypassing the Reconcile() condition-update path because
podIdentityController.Start() calls ReconcileResources() directly; change
Start() to route bootstrap reconciliation through the same helper used by
Reconcile() (so the deployment-derived conditions from deploymentConditions()
and the OperatorDegraded append logic run on startup failures), i.e., have
Reconcile() delegate to a shared reconcile helper that performs
ReconcileResources(), sets r.conditions = r.deploymentConditions(), and appends
the Degraded condition on error, then call that helper from Start() instead of
calling ReconcileResources() directly; update or add a unit/integration test for
startup failure to assert GetConditions() reflects the Degraded condition during
bootstrap retries.
Progressing: - Detect progress via generation and resource version changes, not provisioning status - Report Progressing=True on operator version change - Track root secret and infrastructure resource version changes - Set Progressing=True with reason VersionChanged on operator version updates Degraded: - Suppress Degraded when Progressing is active - Enforce 20-minute ProgressingTooLong timeout in status controller Available: - Only report Available=False on pod-identity-webhook when not progressing - Clear stale Degraded/Available conditions on pod-identity-webhook during active rollout (based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>"
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
pkg/operator/podidentity/podidentitywebhook_controller.go (2)
121-125:⚠️ Potential issue | 🟠 MajorStartup retries still never publish failure conditions.
Start()is still callingReconcileResources()directly. While bootstrap retries are failing,r.conditionsnever gets updated, soGetConditions()can leave the operator looking healthy until a watch event finally reachesReconcile().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 121 - 125, Start() currently calls reconciler.ReconcileResources() which on bootstrap failure does not update r.conditions, so operator health remains stale; change Start() to invoke the reconciler path that publishes conditions (call c.reconciler.Reconcile(ctx) instead of c.reconciler.ReconcileResources(ctx)) so failures update r.conditions and GetConditions() reflects the error, and if Reconcile(ctx) is not available add an explicit condition update on error (e.g. set a failure condition on the controller’s r.conditions or call the reconciler’s condition-publish helper) before sleeping/retrying.
298-305:⚠️ Potential issue | 🟠 MajorMost reconcile errors still erase deployment-derived state.
deploymentConditions(d)only helps ifReconcileResources()returns a deployment, but every error branch here still returnsnil. Failures before or afterApplyDeployment()therefore still drop the rollout/unavailability context this change is trying to preserve.Also applies to: 355-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 298 - 305, ReconcileResources may return a deployment along with an error, but current code throws that away and sets r.conditions = deploymentConditions(d) only when err == nil; update the reconcile path so you always call deploymentConditions(d) with the deployment value returned by ReconcileResources (even if err != nil) before returning, and avoid replacing d with nil on error paths (including failures before/after ApplyDeployment and the other block covering lines 355-423); specifically preserve and use the returned deployment from ReconcileResources and call deploymentConditions(deployment) to set r.conditions prior to any early returns so rollout/Available state is not lost.pkg/operator/credentialsrequest/status.go (1)
128-140:⚠️ Potential issue | 🟠 MajorGeneration-only progress detection drops external resyncs.
This now only treats
Generation != LastSyncGenerationas progress. Root credential rotations and infrastructure changes do not bumpCredentialsRequest.Generation, so those resyncs can reconcile and still report steady state here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/credentialsrequest/status.go` around lines 128 - 140, The current progress check only bumps credRequestsProgressing when cr.Generation != cr.Status.LastSyncGeneration, which misses external resyncs; update the loop in status.go that iterates validCredRequests to also consider external-resync indicators (e.g., cr.Status.LastSyncTimestamp or cr.Status.LastSyncResourceVersion versus the CR's current metadata.resourceVersion/lastObservedTimestamp) so that credRequestsProgressing++ happens when either the Generation differs OR the last-sync timestamp/resourceVersion indicates a newer reconciliation; adjust references to cr.Generation, cr.Status.LastSyncGeneration, cr.Status.LastSyncTimestamp (or LastSyncResourceVersion) and metadata.resourceVersion accordingly and ensure isProgressing remains credRequestsProgressing > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 326-349: The Available=False condition is emitted even while the
rollout is still progressing; modify deploymentConditions so it checks the
result of isDeploymentProgressing (the progressing boolean from
isDeploymentProgressing(deployment)) and only append the
configv1.OperatorAvailable ConditionFalse when progressing is false and
deployment.Status.AvailableReplicas == 0. Update the logic in
deploymentConditions (and maintain the existing Message/Reason text for the
Available condition) so Progressing and Available are not both true/false
simultaneously during an active rollout.
In `@pkg/operator/status/status_controller.go`:
- Around line 352-372: The suppression block in syncStatus (using
isClusterUpgrading and findClusterOperatorCondition to flip co.Status.Conditions
for OperatorDegraded and OperatorAvailable) mutates conditions after
LastTransitionTime was already calculated, causing incorrect timestamps; fix by
either performing the upgrade-suppression before computing LastTransitionTime
(i.e., apply the UpgradeInProgress changes to co.Status.Conditions prior to the
transition-time calculation) or by carrying over the previous transition times
from oldConditions when you flip statuses (look up matching condition in
oldConditions and preserve its LastTransitionTime when you set
Status/Reason/Message for Degraded/Available).
- Around line 269-318: The branch that handles wasVersionProgressing should
clear the synthetic VersionChanged signal when no handler takes over: in the
else where handlerProgressing is false (currently reasserting Progressing=True),
set progressing.Status = configv1.ConditionFalse (or remove the progressing
condition), update progressing.LastTransitionTime = metav1.Now() and clear or
update Reason/Message so the synthetic VersionChanged no longer claims
Progressing; modify the block that references progressing, oldProgressing,
reasonVersionChanged and findClusterOperatorCondition accordingly to stop
reasserting the sticky Progressing=True when no handler is reporting progress.
---
Duplicate comments:
In `@pkg/operator/credentialsrequest/status.go`:
- Around line 128-140: The current progress check only bumps
credRequestsProgressing when cr.Generation != cr.Status.LastSyncGeneration,
which misses external resyncs; update the loop in status.go that iterates
validCredRequests to also consider external-resync indicators (e.g.,
cr.Status.LastSyncTimestamp or cr.Status.LastSyncResourceVersion versus the CR's
current metadata.resourceVersion/lastObservedTimestamp) so that
credRequestsProgressing++ happens when either the Generation differs OR the
last-sync timestamp/resourceVersion indicates a newer reconciliation; adjust
references to cr.Generation, cr.Status.LastSyncGeneration,
cr.Status.LastSyncTimestamp (or LastSyncResourceVersion) and
metadata.resourceVersion accordingly and ensure isProgressing remains
credRequestsProgressing > 0.
In `@pkg/operator/podidentity/podidentitywebhook_controller.go`:
- Around line 121-125: Start() currently calls reconciler.ReconcileResources()
which on bootstrap failure does not update r.conditions, so operator health
remains stale; change Start() to invoke the reconciler path that publishes
conditions (call c.reconciler.Reconcile(ctx) instead of
c.reconciler.ReconcileResources(ctx)) so failures update r.conditions and
GetConditions() reflects the error, and if Reconcile(ctx) is not available add
an explicit condition update on error (e.g. set a failure condition on the
controller’s r.conditions or call the reconciler’s condition-publish helper)
before sleeping/retrying.
- Around line 298-305: ReconcileResources may return a deployment along with an
error, but current code throws that away and sets r.conditions =
deploymentConditions(d) only when err == nil; update the reconcile path so you
always call deploymentConditions(d) with the deployment value returned by
ReconcileResources (even if err != nil) before returning, and avoid replacing d
with nil on error paths (including failures before/after ApplyDeployment and the
other block covering lines 355-423); specifically preserve and use the returned
deployment from ReconcileResources and call deploymentConditions(deployment) to
set r.conditions prior to any early returns so rollout/Available state is not
lost.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 924d76ff-10df-47b7-bebb-d68a24f99dcc
📒 Files selected for processing (10)
pkg/operator/credentialsrequest/actuator/actuator.gopkg/operator/credentialsrequest/credentialsrequest_controller.gopkg/operator/credentialsrequest/credentialsrequest_controller_test.gopkg/operator/credentialsrequest/status.gopkg/operator/credentialsrequest/status_test.gopkg/operator/podidentity/podidentitywebhook_controller.gopkg/operator/podidentity/podidentitywebhook_controller_test.gopkg/operator/secretannotator/status/status.gopkg/operator/status/status_controller.gopkg/operator/status/status_controller_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/operator/credentialsrequest/actuator/actuator.go
- pkg/operator/credentialsrequest/credentialsrequest_controller.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/operator/credentialsrequest/credentialsrequest_controller_test.go
- pkg/operator/secretannotator/status/status.go
| func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition { | ||
| var conditions []configv1.ClusterOperatorStatusCondition | ||
| if deployment != nil { | ||
| progressing, msg := isDeploymentProgressing(deployment) | ||
| if progressing { | ||
| // Spec: report Progressing=True when actively rolling out new code. | ||
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | ||
| Type: configv1.OperatorProgressing, | ||
| Status: configv1.ConditionTrue, | ||
| Reason: "Deploying", | ||
| Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1), | ||
| }) | ||
| } | ||
| if deployment.Status.AvailableReplicas == 0 { | ||
| // Spec: Available=False means the component is non-functional and requires | ||
| // immediate administrator intervention. Zero available pods after the | ||
| // deployment has finished progressing indicates a real availability problem. | ||
| // Suppression during cluster upgrades is handled centrally in syncStatus. | ||
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | ||
| Type: configv1.OperatorAvailable, | ||
| Status: configv1.ConditionFalse, | ||
| Reason: "DeploymentUnavailable", | ||
| Message: "pod-identity-webhook deployment has no available pods", | ||
| }) |
There was a problem hiding this comment.
Only emit Available=False after rollout stops progressing.
This branch still runs during a normal zero-ready rollout, so it can publish Progressing=True and Available=False together for the initial deployment. The intended behavior here is to suppress Available=False until the deployment is no longer progressing.
Suggested fix
progressing, msg := isDeploymentProgressing(deployment)
if progressing {
// Spec: report Progressing=True when actively rolling out new code.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorProgressing,
Status: configv1.ConditionTrue,
Reason: "Deploying",
Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1),
})
}
- if deployment.Status.AvailableReplicas == 0 {
+ if !progressing && deployment.Status.AvailableReplicas == 0 {
// Spec: Available=False means the component is non-functional and requires
// immediate administrator intervention. Zero available pods after the
// deployment has finished progressing indicates a real availability problem.
// Suppression during cluster upgrades is handled centrally in syncStatus.
conditions = append(conditions, configv1.ClusterOperatorStatusCondition{📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition { | |
| var conditions []configv1.ClusterOperatorStatusCondition | |
| if deployment != nil { | |
| progressing, msg := isDeploymentProgressing(deployment) | |
| if progressing { | |
| // Spec: report Progressing=True when actively rolling out new code. | |
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | |
| Type: configv1.OperatorProgressing, | |
| Status: configv1.ConditionTrue, | |
| Reason: "Deploying", | |
| Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1), | |
| }) | |
| } | |
| if deployment.Status.AvailableReplicas == 0 { | |
| // Spec: Available=False means the component is non-functional and requires | |
| // immediate administrator intervention. Zero available pods after the | |
| // deployment has finished progressing indicates a real availability problem. | |
| // Suppression during cluster upgrades is handled centrally in syncStatus. | |
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | |
| Type: configv1.OperatorAvailable, | |
| Status: configv1.ConditionFalse, | |
| Reason: "DeploymentUnavailable", | |
| Message: "pod-identity-webhook deployment has no available pods", | |
| }) | |
| func deploymentConditions(deployment *appsv1.Deployment) []configv1.ClusterOperatorStatusCondition { | |
| var conditions []configv1.ClusterOperatorStatusCondition | |
| if deployment != nil { | |
| progressing, msg := isDeploymentProgressing(deployment) | |
| if progressing { | |
| // Spec: report Progressing=True when actively rolling out new code. | |
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | |
| Type: configv1.OperatorProgressing, | |
| Status: configv1.ConditionTrue, | |
| Reason: "Deploying", | |
| Message: strings.Replace(msg, "Deployment", "pod-identity-webhook deployment", 1), | |
| }) | |
| } | |
| if !progressing && deployment.Status.AvailableReplicas == 0 { | |
| // Spec: Available=False means the component is non-functional and requires | |
| // immediate administrator intervention. Zero available pods after the | |
| // deployment has finished progressing indicates a real availability problem. | |
| // Suppression during cluster upgrades is handled centrally in syncStatus. | |
| conditions = append(conditions, configv1.ClusterOperatorStatusCondition{ | |
| Type: configv1.OperatorAvailable, | |
| Status: configv1.ConditionFalse, | |
| Reason: "DeploymentUnavailable", | |
| Message: "pod-identity-webhook deployment has no available pods", | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/podidentity/podidentitywebhook_controller.go` around lines 326 -
349, The Available=False condition is emitted even while the rollout is still
progressing; modify deploymentConditions so it checks the result of
isDeploymentProgressing (the progressing boolean from
isDeploymentProgressing(deployment)) and only append the
configv1.OperatorAvailable ConditionFalse when progressing is false and
deployment.Status.AvailableReplicas == 0. Update the logic in
deploymentConditions (and maintain the existing Message/Reason text for the
Available condition) so Progressing and Available are not both true/false
simultaneously during an active rollout.
| // Spec: report Progressing=True when actively rolling out a version change. | ||
| // The signal is made sticky via the condition itself: when Progressing has | ||
| // Reason=VersionChanged, it persists across reconciles until a handler | ||
| // reports its own Progressing=True (indicating it has taken over tracking) | ||
| // or the rollout completes (no handler reports Progressing=True on the | ||
| // subsequent reconcile after the version change was first detected). | ||
| progressing, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorProgressing) | ||
| if !reflect.DeepEqual(oldVersions, co.Status.Versions) { | ||
| // Version just changed — set Progressing=True with VersionChanged reason. | ||
| logger.WithFields(log.Fields{ | ||
| "old": oldVersions, | ||
| "new": co.Status.Versions, | ||
| }).Info("version has changed, updating progressing condition lastTransitionTime") | ||
| progressing, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorProgressing) | ||
| // We know this should be there. | ||
| progressing.LastTransitionTime = metav1.Now() | ||
| }).Info("version has changed, setting Progressing=True") | ||
| if progressing != nil { | ||
| if progressing.Status != configv1.ConditionTrue { | ||
| // Only update LastTransitionTime when actually transitioning | ||
| // from False/Unknown to True — don't grant a fresh 20-minute | ||
| // window if already progressing for another reason. | ||
| progressing.LastTransitionTime = metav1.Now() | ||
| } | ||
| progressing.Status = configv1.ConditionTrue | ||
| progressing.Reason = reasonVersionChanged | ||
| progressing.Message = "Operator version is updating" | ||
| } | ||
| } else if progressing != nil { | ||
| // Version matches — check if the previous reconcile set VersionChanged. | ||
| oldProgressing, _ := findClusterOperatorCondition(oldConditions, configv1.OperatorProgressing) | ||
| wasVersionProgressing := oldProgressing != nil && | ||
| oldProgressing.Status == configv1.ConditionTrue && | ||
| oldProgressing.Reason == reasonVersionChanged | ||
|
|
||
| if wasVersionProgressing { | ||
| handlerProgressing := progressing.Status == configv1.ConditionTrue | ||
| if handlerProgressing { | ||
| // A handler has taken over Progressing — let its reason through. | ||
| // The LastTransitionTime is preserved by setLastTransitionTime | ||
| // since the status hasn't changed (True → True). | ||
| } else { | ||
| // No handler is reporting progress — keep the sticky VersionChanged | ||
| // signal. This ensures the Progressing=True condition persists | ||
| // across reconciles so ProgressingTooLong can fire. | ||
| progressing.Status = configv1.ConditionTrue | ||
| progressing.Reason = reasonVersionChanged | ||
| progressing.Message = "Operator version is updating" | ||
| // Preserve the original transition time from when the version | ||
| // change was first detected. | ||
| progressing.LastTransitionTime = oldProgressing.LastTransitionTime | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Clear the synthetic VersionChanged signal when no handler takes over.
The comment here says the synthetic version-change signal should end once a later reconcile sees no handler still progressing, but this branch does the opposite and reasserts Progressing=True. That can leave the operator permanently progressing after a no-op version bump and eventually trip ProgressingTooLong after an otherwise complete rollout.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/status/status_controller.go` around lines 269 - 318, The branch
that handles wasVersionProgressing should clear the synthetic VersionChanged
signal when no handler takes over: in the else where handlerProgressing is false
(currently reasserting Progressing=True), set progressing.Status =
configv1.ConditionFalse (or remove the progressing condition), update
progressing.LastTransitionTime = metav1.Now() and clear or update Reason/Message
so the synthetic VersionChanged no longer claims Progressing; modify the block
that references progressing, oldProgressing, reasonVersionChanged and
findClusterOperatorCondition accordingly to stop reasserting the sticky
Progressing=True when no handler is reporting progress.
| // Spec: "A component must not report Degraded during the course of a normal upgrade." | ||
| // Spec: "A component must not report Available=False during the course of a normal upgrade." | ||
| // The upgrade window extends beyond this operator's own Progressing period — | ||
| // other operators may still be progressing. Use the ClusterVersion Progressing | ||
| // condition as the authoritative signal for whether a cluster upgrade is active. | ||
| // This runs last so it overrides all sources, including ProgressingTooLong. | ||
| if isClusterUpgrading(kubeClient, logger) { | ||
| degraded, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorDegraded) | ||
| if degraded != nil && degraded.Status == configv1.ConditionTrue { | ||
| logger.Info("suppressing Degraded condition during cluster upgrade") | ||
| degraded.Status = configv1.ConditionFalse | ||
| degraded.Reason = "UpgradeInProgress" | ||
| degraded.Message = "Degraded is suppressed during cluster upgrade" | ||
| } | ||
| available, _ := findClusterOperatorCondition(co.Status.Conditions, configv1.OperatorAvailable) | ||
| if available != nil && available.Status == configv1.ConditionFalse { | ||
| logger.Info("suppressing Available=False condition during cluster upgrade") | ||
| available.Status = configv1.ConditionTrue | ||
| available.Reason = "UpgradeInProgress" | ||
| available.Message = "Available is reported as True during cluster upgrade" | ||
| } |
There was a problem hiding this comment.
Upgrade suppression is mutating the published status after transition times were computed.
Because LastTransitionTime was already derived earlier in syncStatus, suppressing Degraded/Available here can refresh timestamps from the intermediate unsuppressed state even when the final published truth value stays suppressed across reconciles. Calculate transition times from the final condition set, or explicitly preserve/update the suppressed timestamps against oldConditions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/status/status_controller.go` around lines 352 - 372, The
suppression block in syncStatus (using isClusterUpgrading and
findClusterOperatorCondition to flip co.Status.Conditions for OperatorDegraded
and OperatorAvailable) mutates conditions after LastTransitionTime was already
calculated, causing incorrect timestamps; fix by either performing the
upgrade-suppression before computing LastTransitionTime (i.e., apply the
UpgradeInProgress changes to co.Status.Conditions prior to the transition-time
calculation) or by carrying over the previous transition times from
oldConditions when you flip statuses (look up matching condition in
oldConditions and preserve its LastTransitionTime when you set
Status/Reason/Message for Degraded/Available).
|
@dlom: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
xref: CCO-738
xref: CCO-739
xref: CCO-740
This PR is a combination of #977 and #978
Progressing:
Degraded:
Available:
(based on the spec from https://github.com/openshift/api/blob/2757d677e9aaac40a9985e5a6a17b31c7c577ba8/config/v1/types_cluster_operator.go#L151)
Assisted-By: Claude Opus 4.6 noreply@anthropic.com