OCPBUGS-65896: controller/workload: Rework Degraded/Progressing conditions by tchap · Pull Request #2128 · openshift/library-go

tchap · 2026-02-18T10:49:41Z

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the deployment is not Available or it times
out progressing. This is less strict than before, but checking available
replicas would mean we go Degraded on scaling automatically.

Progressing is set to True when the deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

I also improved the overall test case names in a separate commit.

This affects authentication-operator and openshift-apiserver-operator.

Summary by CodeRabbit

Bug Fixes
- More accurate rollout/progress tracking with improved timeout detection and clearer Progressing reasons (including timed-out progressing message).
- Improved Degraded reporting: concise unavailable-instance messaging and clearer failing-pod detection; retains "AsExpected" where appropriate and special-cases timeout/degraded states.
- Stricter gating for recording new versions—only recorded when generation/observed-generation and replica counts align.
- Pod-list errors now surface as SyncError in status.
Tests
- Expanded and reworked status scenarios and assertions, including version-recording and pod-list error validations.

openshift-ci-robot · 2026-02-18T10:49:47Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the Deployment is not Available or it times
out progressing.

Progressing is set to True when the Deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2026-02-18T10:53:02Z

/cc @atiratree

tchap · 2026-02-18T10:58:17Z

-	// Update is done when all pods have been updated to the latest revision
-	// and the deployment controller has reported NewReplicaSetAvailable
-	workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
-	workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)


We don't use IsUpdatingTooLong any more, we based the check on ProgressDeadlineExceeded from the deployment controller.

let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value

I don't think we can do that. Sync from the delegate returns the synchronized deployment already. So we would have to send another update request, I guess.

We can align this in auth when updating library-go. I will actually open a PR there once this is accepted to run CI anyway.

Yup, I meant that we should update it in the consumer repos like auth.

Might be better to open a proof PR before the library-go changes are merged.

Opened openshift/cluster-authentication-operator#843

atiratree · 2026-02-18T13:15:30Z

-	// Update is done when all pods have been updated to the latest revision
-	// and the deployment controller has reported NewReplicaSetAvailable
-	workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
-	workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)


let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value

atiratree

Just one nit, otherwise LGTM, thanks!

atiratree · 2026-02-18T18:03:16Z

-	// Update is done when all pods have been updated to the latest revision
-	// and the deployment controller has reported NewReplicaSetAvailable
-	workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
-	workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)


Yup, I meant that we should update it in the consumer repos like auth.

atiratree · 2026-02-18T18:03:49Z

-	// Update is done when all pods have been updated to the latest revision
-	// and the deployment controller has reported NewReplicaSetAvailable
-	workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
-	workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)


Might be better to open a proof PR before the library-go changes are merged.

atiratree · 2026-02-19T08:47:39Z

/lgtm
/hold
for the proof PR openshift/cluster-authentication-operator#843

To enable scaling not to automatically set Progressing/Degraded, conditions handling is aligned so that the Deployment generation is not consulted any more. Degraded only happens when the Deployment is not Available or it times out progressing. Progressing is set to True when the Deployment is not in NewReplicaSetAvailable and it hasn't timed out progressing.

coderabbitai · 2026-04-14T08:21:46Z

Walkthrough

Refactors workload controller condition evaluation: replaces generation-based Progressing checks with timeout/progress checks, simplifies Degraded detection to pod-level failures and availability, tightens version-recording gating, and adds helpers for error messages, failing-pod detection, and timeout detection. Tests expanded to cover new flows and version-recording assertions.

Changes

Cohort / File(s)	Summary
Controller logic `pkg/operator/apiserver/controller/workload/workload.go`	Refactored Progressing/Degraded computation: added `hasDeploymentTimedOutProgressing`, `hasFailingPods`, `isFailureWaitingReason`, and `errMessage`. Replaced generation-based progression checks and external deployment helper usage with timeout/progress helpers. Tightened version-recording gating to require matching Generation/ObservedGeneration and desired/updated/available replica equality. Adjusted messages formatting.
Tests `pkg/operator/apiserver/controller/workload/workload_test.go`	Reworked `TestUpdateOperatorStatus` scenarios and fixtures. Updated expected Condition Reason/Message/Status values for rollout, timeout, recovery, and pod-failure cases. Added pod-list error injection (`podListErr`) and `fakeVersionRecorder` with per-scenario `validateVersionRecorder` checks to assert when `SetVersion` is called.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant Deployment as DeploymentStatus
  participant PodLister
  participant VersionRecorder
  Controller->>Deployment: read deployment status (generation, observedGeneration, available/updated/desired replicas, conditions)
  Controller->>PodLister: list pods for deployment
  alt pod list returns error
    Controller->>Controller: collect error -> errs
    Controller->>Controller: set Degraded = True (SyncError)
  else pod list OK
    Controller->>Controller: hasFailingPods(pods) -> failingPods
    alt failingPods non-empty
      Controller->>Controller: set Degraded = True (X of Y instances unavailable...)
    else
      Controller->>Controller: hasDeploymentTimedOutProgressing(status) -> (msg, timedOut)
      alt timedOut
        Controller->>Controller: set Progressing = False (ProgressDeadlineExceeded)
        Controller->>Controller: set Degraded = True (ProgressDeadlineExceeded)
      else hasDeploymentProgressed?
        Controller->>Controller: set Progressing = True (PodsUpdating)
      else
        Controller->>Controller: set Progressing = False (AsExpected)
        Controller->>Controller: set Degraded = False (AsExpected)
      end
    end
  end
  alt generation/replica gating satisfied
    Controller->>VersionRecorder: SetVersion(operand, targetVersion)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Ipv6 And Disconnected Network Test Compatibility	❓ Inconclusive	Unable to execute shell commands to analyze test files and verify Ginkgo e2e test patterns.	Provide direct access to test file contents or output from the specified shell commands for analysis.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main changes: refactoring of Degraded and Progressing condition logic in the workload controller.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The custom check regarding Ginkgo test names is not applicable to this pull request, as the codebase uses standard Go testing with subtests rather than the Ginkgo framework.
Test Structure And Quality	✅ Passed	The pull request contains standard Go table-driven tests with solid quality, not Ginkgo framework tests, making the Ginkgo-specific check inapplicable.
Microshift Test Compatibility	✅ Passed	The pull request modifies unit tests in Go's standard testing package, not Ginkgo e2e tests. The custom check applies only to new Ginkgo e2e tests, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR modifies unit test file using standard Go testing framework, not Ginkgo e2e tests with multi-node cluster topology assumptions.
Topology-Aware Scheduling Compatibility	✅ Passed	PR refactors deployment status condition logic without introducing topology-aware scheduling constraints, pod affinity rules, node selectors, topology spread constraints, or PodDisruptionBudgets.
Ote Binary Stdout Contract	✅ Passed	Modified files are a Go controller library and unit test file with no OTE test binary characteristics, no main() or suite-level initialization, and no stdout violations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-04-14T08:23:23Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the deployment is not Available or it times
out progressing. This is less strict than before, but checking available
replicas would mean we go Degraded on scaling automatically.

Progressing is set to True when the deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

I also improved the overall test case names in a separate commit.

This affects authentication-operator and openshift-apiserver-operator.

Summary by CodeRabbit

Bug Fixes

Improved deployment progress tracking and timeout detection to provide more accurate status reporting

Enhanced degradation state identification to better reflect unavailable pod conditions

Refined version recording requirements to ensure proper state validation before marking updates as complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (1)

pkg/operator/apiserver/controller/workload/workload_test.go (1)
54-67: Please add explicit coverage for the new version-recording gate.

pkg/operator/apiserver/controller/workload/workload.go Lines 358-361 changed when SetVersion fires, but this table still never exercises that path: no scenario sets operatorConfigAtHighestRevision = true, and the test controller leaves versionRecorder unset. A small fake recorder plus one positive and one negative case would keep this behavior from regressing.

Also applies to: 846-878
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 54
- 67, Add explicit scenarios to TestUpdateOperatorStatus that exercise the
version-recording gate by (1) injecting a fake versionRecorder into the test
controller and (2) adding two table entries: one with
operatorConfigAtHighestRevision = true and one with it = false; verify that when
operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is
invoked (and not invoked for the false case). Locate the test table in
TestUpdateOperatorStatus and the controller instance used to run scenarios, set
controller.versionRecorder to a simple fake recorder that records calls, and add
assertions in each scenario's validateOperatorStatus to check whether SetVersion
was called as expected to cover the SetVersion path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 54-67: Add explicit scenarios to TestUpdateOperatorStatus that
exercise the version-recording gate by (1) injecting a fake versionRecorder into
the test controller and (2) adding two table entries: one with
operatorConfigAtHighestRevision = true and one with it = false; verify that when
operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is
invoked (and not invoked for the false case). Locate the test table in
TestUpdateOperatorStatus and the controller instance used to run scenarios, set
controller.versionRecorder to a simple fake recorder that records calls, and add
assertions in each scenario's validateOperatorStatus to check whether SetVersion
was called as expected to cover the SetVersion path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15f9a88c-d378-463c-ad2b-87fa88d3a8d7

📥 Commits

Reviewing files that changed from the base of the PR and between d2db42c and 08c8c15.

📒 Files selected for processing (2)

pkg/operator/apiserver/controller/workload/workload.go
pkg/operator/apiserver/controller/workload/workload_test.go

coderabbitai

🧹 Nitpick comments (1)

pkg/operator/apiserver/controller/workload/workload_test.go (1)
847-889: Add negative SetVersion() cases for the other readiness gates.

SetVersion() is also gated on Generation == ObservedGeneration, AvailableReplicas == desiredReplicas, and UpdatedReplicas == desiredReplicas in pkg/operator/apiserver/controller/workload/workload.go:355-364. Right now only the operatorConfigAtHighestRevision gate is exercised, so a regression that records the version during an incomplete rollout would still pass this suite.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 847
- 889, Add negative test cases in the workload_test table that exercise the
other readiness gates which should prevent SetVersion(): create entries similar
to the existing cases but where (a) Generation != ObservedGeneration, (b)
Status.AvailableReplicas < Spec.Replicas, and (c) Status.UpdatedReplicas <
Spec.Replicas; each should set operatorConfigAtHighestRevision=true (so only the
other gate blocks) use the same fakeVersionRecorder and validateVersionRecorder
to assert r.setVersionCalls remains empty, and keep validateOperatorStatus as a
no-op; locate tests in workload_test.go and mirror the structure/fields used by
the existing "version ..." cases so the assertions against setVersionCalls catch
regressions in workload.go’s SetVersion gating logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 847-889: Add negative test cases in the workload_test table that
exercise the other readiness gates which should prevent SetVersion(): create
entries similar to the existing cases but where (a) Generation !=
ObservedGeneration, (b) Status.AvailableReplicas < Spec.Replicas, and (c)
Status.UpdatedReplicas < Spec.Replicas; each should set
operatorConfigAtHighestRevision=true (so only the other gate blocks) use the
same fakeVersionRecorder and validateVersionRecorder to assert r.setVersionCalls
remains empty, and keep validateOperatorStatus as a no-op; locate tests in
workload_test.go and mirror the structure/fields used by the existing "version
..." cases so the assertions against setVersionCalls catch regressions in
workload.go’s SetVersion gating logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f83fb810-ff6d-40fd-9f77-c2365c142fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 08c8c15 and 882ecf8.

📒 Files selected for processing (1)

pkg/operator/apiserver/controller/workload/workload_test.go

atiratree · 2026-04-20T09:30:56Z

/hold cancel
/lgtm

openshift-ci · 2026-04-20T09:31:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree, tchap
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/operator/apiserver/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

p0lyn0mial · 2026-04-21T07:56:51Z

@@ -4,7 +4,6 @@ import (
 	"context"


Am I right that those are the three (at least) main behavioral changes introduced in this PR ?

1. Scaling the workload (up/down) won't report Progressing=True 2. A stuck rollout eventually stops reporting Progressing=True 3. A workload missing some replicas (2/3) no longer triggers Degraded=True

Yeah, right. We only degrade on Available == 0.

Also checking the Progressing timeout now depends on the native Deployment ProgressDeadlineExceeded instead of using a helper and once Progressing deadline is hit, we go Degraded.

I would like to get @benluddy opinion on that.

Ok, I added another commit that actually makes Degraded work as expected, i.e. Degraded on not all replicas available. This is done by actually checking relevant pods and counting the failing ones so that starting new pods is not included into the Degrading condition.

openshift-ci · 2026-04-24T14:25:48Z

New changes are detected. LGTM label has been removed.

Previously, the controller only reported DeploymentDegraded when zero replicas were available. This missed partial outages where some pods were crashlooping or stuck. Now the controller degrades whenever AvailableReplicas < desired and at least one pod shows a concrete failure: non-zero exit, crashloop (RestartCount > 1), or a non-transient waiting reason. Pods that are still starting (ContainerCreating, PodInitializing) are not treated as failures to avoid false positives during normal rollouts and scale-ups. Additionally, the SyncError degraded condition is now applied via defer to ensure it is always set regardless of the return path.

openshift-ci-robot · 2026-04-24T14:49:32Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is invalid:

expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the deployment is not Available or it times
out progressing. This is less strict than before, but checking available
replicas would mean we go Degraded on scaling automatically.

Progressing is set to True when the deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

I also improved the overall test case names in a separate commit.

This affects authentication-operator and openshift-apiserver-operator.

Summary by CodeRabbit

Bug Fixes

More accurate rollout/progress tracking with improved timeout detection and clearer Progressing reasons (including timed-out progressing message).

Improved Degraded reporting: concise unavailable-instance messaging and clearer failing-pod detection; retains "AsExpected" where appropriate and special-cases timeout/degraded states.

Stricter gating for recording new versions—only recorded when generation/observed-generation and replica counts align.

Pod-list errors now surface as SyncError in status.

Tests

Expanded and reworked status scenarios and assertions, including version-recording and pod-list error validations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/operator/apiserver/controller/workload/workload_test.go (1)

1217-1236: Make the fake pod lister honor namespace and selectors.

The new degraded-path logic is selector-sensitive, but this fake always returns every injected pod. That lets these tests pass even if production code queries the wrong namespace or builds the wrong selector.

♻️ Proposed fix

 type fakePodNamespaceLister struct {
-	lister *fakePodLister
+	lister    *fakePodLister
+	namespace string
 }

 func (f *fakePodNamespaceLister) List(selector labels.Selector) (ret []*corev1.Pod, err error) {
-	return f.lister.pods, f.lister.err
+	if f.lister.err != nil {
+		return nil, f.lister.err
+	}
+	for _, pod := range f.lister.pods {
+		if pod.Namespace != f.namespace {
+			continue
+		}
+		if selector != nil && !selector.Matches(labels.Set(pod.Labels)) {
+			continue
+		}
+		ret = append(ret, pod)
+	}
+	return ret, nil
 }

 func (f *fakePodLister) List(selector labels.Selector) (ret []*corev1.Pod, err error) {
-	return f.pods, f.err
+	if f.err != nil {
+		return nil, f.err
+	}
+	for _, pod := range f.pods {
+		if selector != nil && !selector.Matches(labels.Set(pod.Labels)) {
+			continue
+		}
+		ret = append(ret, pod)
+	}
+	return ret, nil
 }

 func (f *fakePodLister) Pods(namespace string) corev1listers.PodNamespaceLister {
 	return &fakePodNamespaceLister{
-		lister: f,
+		lister:    f,
+		namespace: namespace,
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines
1217 - 1236, The fake pod lister currently ignores namespace and selectors
causing tests to pass incorrectly; update fakePodNamespaceLister and
fakePodLister methods so listing and getting respect namespace and label
selectors: have fakePodNamespaceLister store the target namespace (set in
fakePodLister.Pods(namespace)), implement fakePodNamespaceLister.List(selector)
to iterate f.lister.pods and return only pods whose Namespace equals that stored
namespace and whose labels match selector.Matches(labels.Set(pod.Labels)), and
implement fakePodNamespaceLister.Get(name) to return the pod with matching
Namespace and Name (or the existing error if not found). Ensure
fakePodLister.Pods(namespace) returns a fakePodNamespaceLister with the
namespace populated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/apiserver/controller/workload/workload.go`:
- Around line 417-420: The pod lookup in hasFailingPods currently uses
workload.Spec.Template.Labels which can miss older ReplicaSet pods; instead
build a selector from the Deployment's selector (workload.Spec.Selector) and use
that to list pods via podsLister.Pods(workload.Namespace).List(...). Convert the
LabelSelector to a labels.Selector (e.g. via metav1.LabelSelectorAsSelector),
handle/return any conversion error and nil selector, and then use that selector
when calling podsLister to ensure all pods owned by the Deployment are
considered.
- Around line 423-443: The loop that builds failureMessages only inspects
container statuses and misses pod-level pre-container failures (e.g.,
PodScheduled=False/Unschedulable), so detect those and append a failure message
when no container-based msg was added for a pod: inside the pods loop (where
failureMessages is collected), after iterating container statuses and before
continuing, check pod.Status.Conditions for a PodScheduled condition with Status
== corev1.ConditionFalse or Reason == "Unschedulable" (or commonly used
scheduling failure reasons) and if present append a descriptive message like
"pod X not scheduled: <reason/message>"; ensure this runs per-pod so pods stuck
before container creation mark the rollout as degraded.

---

Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 1217-1236: The fake pod lister currently ignores namespace and
selectors causing tests to pass incorrectly; update fakePodNamespaceLister and
fakePodLister methods so listing and getting respect namespace and label
selectors: have fakePodNamespaceLister store the target namespace (set in
fakePodLister.Pods(namespace)), implement fakePodNamespaceLister.List(selector)
to iterate f.lister.pods and return only pods whose Namespace equals that stored
namespace and whose labels match selector.Matches(labels.Set(pod.Labels)), and
implement fakePodNamespaceLister.Get(name) to return the pod with matching
Namespace and Name (or the existing error if not found). Ensure
fakePodLister.Pods(namespace) returns a fakePodNamespaceLister with the
namespace populated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49d5cedd-2955-4826-a170-eb740f971b22

📥 Commits

Reviewing files that changed from the base of the PR and between 3ae8d51 and 6d8858b.

📒 Files selected for processing (2)

pkg/operator/apiserver/controller/workload/workload.go
pkg/operator/apiserver/controller/workload/workload_test.go

coderabbitai · 2026-04-24T14:54:51Z

+func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) {
+	pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels))
+	if err != nil {
+		return nil, err


⚠️ Potential issue | 🟠 Major

Use the Deployment selector when discovering failing pods.

Line 418 matches on workload.Spec.Template.Labels, which can be stricter than the Deployment’s selector and can exclude older ReplicaSet pods during a rollout. That can hide failing owned pods and incorrectly clear DeploymentDegraded while AvailableReplicas < desiredReplicas.

🐛 Proposed fix

func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) { - pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels)) + if workload.Spec.Selector == nil { + return nil, fmt.Errorf("deployment/%s.%s is missing spec.selector", workload.Name, workload.Namespace) + } + selector, err := metav1.LabelSelectorAsSelector(workload.Spec.Selector) + if err != nil { + return nil, err + } + pods, err := podsLister.Pods(workload.Namespace).List(selector) if err != nil { return nil, err }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) {

pods, err := podsLister.Pods(workload.Namespace).List(labels.SelectorFromSet(workload.Spec.Template.Labels))

if err != nil {

return nil, err

func hasFailingPods(workload *appsv1.Deployment, podsLister corev1listers.PodLister) ([]string, error) {

if workload.Spec.Selector == nil {

return nil, fmt.Errorf("deployment/%s.%s is missing spec.selector", workload.Name, workload.Namespace)

}

selector, err := metav1.LabelSelectorAsSelector(workload.Spec.Selector)

if err != nil {

return nil, err

}

pods, err := podsLister.Pods(workload.Namespace).List(selector)

if err != nil {

return nil, err

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/operator/apiserver/controller/workload/workload.go` around lines 417 - 420, The pod lookup in hasFailingPods currently uses workload.Spec.Template.Labels which can miss older ReplicaSet pods; instead build a selector from the Deployment's selector (workload.Spec.Selector) and use that to list pods via podsLister.Pods(workload.Namespace).List(...). Convert the LabelSelector to a labels.Selector (e.g. via metav1.LabelSelectorAsSelector), handle/return any conversion error and nil selector, and then use that selector when calling podsLister to ensure all pods owned by the Deployment are considered.

coderabbitai · 2026-04-24T14:54:52Z

+	var failureMessages []string
+	for _, pod := range pods {
+		for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) {
+			if c.Ready {
+				continue
+			}
+
+			var msg string
+			switch {
+			case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0:
+				msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name)
+			case c.RestartCount > 1:
+				msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name)
+			case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason):
+				msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name)
+			}
+			if len(msg) > 0 {
+				failureMessages = append(failureMessages, msg)
+			}
+		}
+	}


⚠️ Potential issue | 🟠 Major

Treat pre-container pod failures as degraded too.

This helper only looks at container statuses. Pods stuck before container creation—most notably PodScheduled=False / Unschedulable—won’t produce any message here, so a partially available rollout can stay DeploymentDegraded=False until ProgressDeadlineExceeded fires.

🐛 Proposed fix

var failureMessages []string for _, pod := range pods { + podFailedBeforeStart := false + for _, cond := range pod.Status.Conditions { + if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse && cond.Reason == corev1.PodReasonUnschedulable { + failureMessages = append(failureMessages, fmt.Sprintf("pod %s is unschedulable: %s", pod.Name, cond.Message)) + podFailedBeforeStart = true + break + } + } + if podFailedBeforeStart { + continue + } + for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) { if c.Ready { continue }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var failureMessages []string

for _, pod := range pods {

for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) {

if c.Ready {

continue

}

var msg string

switch {

case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0:

msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name)

case c.RestartCount > 1:

msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name)

case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason):

msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name)

}

if len(msg) > 0 {

failureMessages = append(failureMessages, msg)

}

}

}

var failureMessages []string

for _, pod := range pods {

podFailedBeforeStart := false

for _, cond := range pod.Status.Conditions {

if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse && cond.Reason == corev1.PodReasonUnschedulable {

failureMessages = append(failureMessages, fmt.Sprintf("pod %s is unschedulable: %s", pod.Name, cond.Message))

podFailedBeforeStart = true

break

}

}

if podFailedBeforeStart {

continue

}

for _, c := range slices.Concat(pod.Status.ContainerStatuses, pod.Status.InitContainerStatuses) {

if c.Ready {

continue

}

var msg string

switch {

case c.State.Terminated != nil && c.State.Terminated.ExitCode != 0:

msg = fmt.Sprintf("container %s terminated with exit code %d in pod %s", c.Name, c.State.Terminated.ExitCode, pod.Name)

case c.RestartCount > 1:

msg = fmt.Sprintf("container %s is crashlooping in pod %s", c.Name, pod.Name)

case c.State.Waiting != nil && isFailureWaitingReason(c.State.Waiting.Reason):

msg = fmt.Sprintf("container %s is waiting (%s) in pod %s", c.Name, c.State.Waiting.Reason, pod.Name)

}

if len(msg) > 0 {

failureMessages = append(failureMessages, msg)

}

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/operator/apiserver/controller/workload/workload.go` around lines 423 - 443, The loop that builds failureMessages only inspects container statuses and misses pod-level pre-container failures (e.g., PodScheduled=False/Unschedulable), so detect those and append a failure message when no container-based msg was added for a pod: inside the pods loop (where failureMessages is collected), after iterating container statuses and before continuing, check pod.Status.Conditions for a PodScheduled condition with Status == corev1.ConditionFalse or Reason == "Unschedulable" (or commonly used scheduling failure reasons) and if present append a descriptive message like "pod X not scheduled: <reason/message>"; ensure this runs per-pod so pods stuck before container creation mark the rollout as degraded.

openshift-ci · 2026-04-24T15:05:36Z

@tchap: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`6d8858b`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 18, 2026

openshift-ci Bot requested review from p0lyn0mial and tkashem February 18, 2026 10:50

tchap mentioned this pull request Feb 18, 2026

OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only openshift/cluster-authentication-operator#836

Closed

openshift-ci Bot requested a review from atiratree February 18, 2026 10:53

tchap commented Feb 18, 2026

View reviewed changes

atiratree reviewed Feb 18, 2026

View reviewed changes

tchap force-pushed the workload-condition-overwrites branch from 9a05b4b to 0b782a2 Compare February 18, 2026 22:43

tchap mentioned this pull request Feb 18, 2026

WIP: OCPBUGS-65896: Update library-go to remove Progressing on scaling openshift/cluster-authentication-operator#843

Open

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2026

openshift-ci Bot assigned atiratree Feb 19, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2026

tchap force-pushed the workload-condition-overwrites branch from 0b782a2 to 08c8c15 Compare April 14, 2026 08:21

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2026

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

tchap force-pushed the workload-condition-overwrites branch from 882ecf8 to 3ae8d51 Compare April 14, 2026 09:08

openshift-ci Bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 20, 2026

p0lyn0mial reviewed Apr 21, 2026

View reviewed changes

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2026

tchap force-pushed the workload-condition-overwrites branch 2 times, most recently from dee9075 to 405d2b1 Compare April 24, 2026 14:31

tchap added 2 commits April 24, 2026 16:34

Add tests for versionRecorder.SetVersion calls

a1c1323

tchap force-pushed the workload-condition-overwrites branch from 405d2b1 to b94b8e2 Compare April 24, 2026 14:35

Remove useless fields from tests

6d8858b

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Conversation

tchap commented Feb 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 18, 2026

Uh oh!

tchap commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

atiratree left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atiratree commented Feb 19, 2026

Uh oh!

coderabbitai Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

openshift-ci-robot commented Apr 14, 2026

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

atiratree commented Apr 20, 2026

Uh oh!

openshift-ci Bot commented Apr 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

openshift-ci-robot commented Apr 24, 2026

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

tchap commented Feb 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 14, 2026 •

edited

Loading