Skip to content

OCPBUGS-65623: cluster-olm-operator sets Progressing=True during upgrade#173

Merged
openshift-merge-bot[bot] merged 8 commits intoopenshift:mainfrom
rashmigottipati:OCPBUGS-65623-fix-progressing-condition
Feb 26, 2026
Merged

OCPBUGS-65623: cluster-olm-operator sets Progressing=True during upgrade#173
openshift-merge-bot[bot] merged 8 commits intoopenshift:mainfrom
rashmigottipati:OCPBUGS-65623-fix-progressing-condition

Conversation

@rashmigottipati
Copy link
Copy Markdown
Member

Description

Add a wrapper around the configclient to detect version changes during upgrades. The wrapper intercepts status updates and checks if the RELEASE_VERSION matches the version thats currently stored in etcd. If the versions don't match, we're in an upgrade, so it sets Progressing=True.

Changes:

  • Add wrapper types to intercept ClusterOperator status updates
  • Add RELEASE_VERSION environment variable to the deployment manifest
  • Pass the wrapper to the status controller instead of the raw client

Motivation

cluster-olm-operator doesn't report Progressing=True during upgrades because library-go's deployment check may be missing fast deployments (which is common in patch upgrades with retagged images).
Since we use library-go's status controller and can't modify it, we wrap the client that gets passed to library-go and add the version check there.

Fixes: https://issues.redhat.com/browse/OCPBUGS-65623

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@rashmigottipati: This pull request references Jira Issue OCPBUGS-65623, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Description

Add a wrapper around the configclient to detect version changes during upgrades. The wrapper intercepts status updates and checks if the RELEASE_VERSION matches the version thats currently stored in etcd. If the versions don't match, we're in an upgrade, so it sets Progressing=True.

Changes:

  • Add wrapper types to intercept ClusterOperator status updates
  • Add RELEASE_VERSION environment variable to the deployment manifest
  • Pass the wrapper to the status controller instead of the raw client

Motivation

cluster-olm-operator doesn't report Progressing=True during upgrades because library-go's deployment check may be missing fast deployments (which is common in patch upgrades with retagged images).
Since we use library-go's status controller and can't modify it, we wrap the client that gets passed to library-go and add the version check there.

Fixes: https://issues.redhat.com/browse/OCPBUGS-65623

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rashmigottipati
Copy link
Copy Markdown
Member Author

/assign @jianzhangbjz

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-multiarch-master-nightly-4.22-upgrade-from-nightly-4.21-ocp-ovn-remote-s2s-libvirt-multi-p-p

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 6, 2026

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-multiarch-master-nightly-4.22-upgrade-from-nightly-4.21-ocp-ovn-remote-s2s-libvirt-multi-p-p

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d67f9fc0-02f4-11f1-99e5-20319970b13e-0

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-release-master-ci-4.22-e2e-aws-upgrade-ovn-single-node

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 6, 2026

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.22-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2ba881f0-02f6-11f1-996c-5088ca1319bb-0

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 6, 2026

@jianzhangbjz it looks that [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade was not executed in this test run?

@rashmigottipati
Copy link
Copy Markdown
Member Author

@pedjak it looks like [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade test was skipped as this CI job triggered a patch level upgrade. See below message:
<skipped message="Test skipped in a patch-level upgrade test"></skipped>

@jianzhangbjz This CI job wont validate this particular fix because it's a patch upgrade. We will be able to validate the changes when we the test actually runs on minor/major upgrades. Can you please find a different CI job that would trigger this validation?

Comment thread cmd/cluster-olm-operator/main.go Outdated
Comment thread cmd/cluster-olm-operator/main.go Outdated
@pedjak
Copy link
Copy Markdown

pedjak commented Feb 6, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 6, 2026

@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9b7f3b60-0380-11f1-90eb-7e61d802a20d-0

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 7, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 7, 2026

@pedjak: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 7, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 7, 2026

@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ac537de0-0400-11f1-946e-e7f93b1f53c3-0

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 7, 2026

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9b7f3b60-0380-11f1-90eb-7e61d802a20d-0

The test passed in this run.

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 7, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 7, 2026

@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/097f30a0-0419-11f1-8f81-1778f85345a1-0

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 8, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 8, 2026

@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d94663b0-051f-11f1-983b-28296fef8a6f-0

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 8, 2026

@jianzhangbjz I started a few runs of periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade and it looks the test is passing, can you confirm?

@jianzhangbjz
Copy link
Copy Markdown
Contributor

jianzhangbjz commented Feb 9, 2026

Hi @pedjak , I didn't find [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade test in this Prow job, which you created. And, I described in https://issues.redhat.com/browse/OCPBUGS-65623?focusedId=28990653&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28990653, the key point is that this test case passed in the HA cluster, so we don't need any fix.

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 9, 2026

Hi @pedjak , I didn't find [Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade test in this Prow job, which you created.

The test is there, see https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30754-openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade/2020572421811081216

{
  "name": "[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade test",
  "lifecycle": "blocking",
  "duration": 4464906,
  "startTime": null,
  "endTime": null,
  "result": "passed",
  "output": "clusteroperator/olm became Progressing=True at 2026-02-08T20:49:21Z during the upgrade window from 2026-02-08T20:13:18Z to 2026-02-08T21:27:42Z"
}
image

Also, it passes in this run

the key point is that this test case passed in the HA cluster, so we don't need any fix.

I do think that we need to fix it, because aggregating progressing condition from deployment might be unreliable, i.e. we can miss the right time window. How do we know that tests are passing in HA clusters, when there is an exception currently in openshift/origin repo set for OLM component, so that test actually do not fail?

The run I triggered here are actually executed together with openshift/origin#30754 Once this PR gets merged, we can lift up the exception as well.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 24, 2026

@rashmigottipati: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@pedjak
Copy link
Copy Markdown

pedjak commented Feb 24, 2026

/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 24, 2026

@pedjak: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 25, 2026

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/28e16ab0-11f4-11f1-8854-54759b749696-0

@jianzhangbjz
Copy link
Copy Markdown
Contributor

Test failed, more: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade/2026488903468322816

0 unexpected clusteroperator state transitions during (upgrade=true) e2e test run, as desired.
6 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Feb 25 04:41:22.565 E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:41:22.565 - 78s   E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:41.167 W clusteroperator/olm condition/Available reason/AsExpected status/True OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available\nCatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case)
Feb 25 04:42:42.267 E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:42.267 - 14s   E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:57.144 W clusteroperator/olm condition/Available reason/AsExpected status/True OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available\nCatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case)

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/payload-job-with-prs periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 25, 2026

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3aec6070-121c-11f1-9264-bf3a34a90d5d-0

Comment thread pkg/clients/wrappers.go Outdated
Comment thread pkg/clients/wrappers_test.go
@pedjak
Copy link
Copy Markdown

pedjak commented Feb 25, 2026

Test failed, more: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade/2026488903468322816

The test we are fixing with this PR did not fail:

image

Signed-off-by: Rashmi Gottipati <rgottipa@redhat.com>
Copy link
Copy Markdown
Member Author

@rashmigottipati rashmigottipati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianzhangbjz updated the PR addressing the latest review comments. can you PTAL and help add the verify label? Thanks.

@jianzhangbjz
Copy link
Copy Markdown
Contributor

@jianzhangbjz
Copy link
Copy Markdown
Contributor

Test fail:

pkg/clients/wrappers_test.go:90:1: File is not properly formatted (gofmt)
			name:              "lister error - should return error",
^
pkg/clients/wrappers.go:52:1: `if w.releaseVersion != ""` has complex nested blocks (complexity: 5) (nestif)
	if w.releaseVersion != "" {
^
2 issues:
* gofmt: 1
* nestif: 1
make: *** [Makefile:27: lint] Error 1

Signed-off-by: Rashmi Gottipati <rgottipa@redhat.com>
@jianzhangbjz
Copy link
Copy Markdown
Contributor

/payload-job-with-prs periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade openshift/origin#30754

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 26, 2026

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fb47d90-12bd-11f1-861f-6b7c251340a7-0

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/retest-required

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jianzhangbjz: This PR has been marked as verified by @jianzhangbjz.

Details

In response to this:

Test passed https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30754-openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade/2026851607269871616
/verified by @jianzhangbjz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jianzhangbjz
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@pedjak
Copy link
Copy Markdown

pedjak commented Feb 26, 2026

/lgtm

@jianzhangbjz we need also to merge openshift/origin#30754 - without it, merging this one does not have any impact.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 26, 2026

@rashmigottipati: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 791eaec into openshift:main Feb 26, 2026
11 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@rashmigottipati: Jira Issue OCPBUGS-65623: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-65623 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

Description

Add a wrapper around the configclient to detect version changes during upgrades. The wrapper intercepts status updates and checks if the RELEASE_VERSION matches the version thats currently stored in etcd. If the versions don't match, we're in an upgrade, so it sets Progressing=True.

Changes:

  • Add wrapper types to intercept ClusterOperator status updates
  • Add RELEASE_VERSION environment variable to the deployment manifest
  • Pass the wrapper to the status controller instead of the raw client

Motivation

cluster-olm-operator doesn't report Progressing=True during upgrades because library-go's deployment check may be missing fast deployments (which is common in patch upgrades with retagged images).
Since we use library-go's status controller and can't modify it, we wrap the client that gets passed to library-go and add the version check there.

Fixes: https://issues.redhat.com/browse/OCPBUGS-65623

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rashmigottipati
Copy link
Copy Markdown
Member Author

/cherry-pick release-4.21

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rashmigottipati: new pull request created: #177

Details

In response to this:

/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants