OCPBUGS-65623: cluster-olm-operator sets Progressing=True during upgrade#173
Conversation
|
@rashmigottipati: This pull request references Jira Issue OCPBUGS-65623, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/assign @jianzhangbjz |
|
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.22-upgrade-from-nightly-4.21-ocp-ovn-remote-s2s-libvirt-multi-p-p |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d67f9fc0-02f4-11f1-99e5-20319970b13e-0 |
|
/payload-job periodic-ci-openshift-release-master-ci-4.22-e2e-aws-upgrade-ovn-single-node |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2ba881f0-02f6-11f1-996c-5088ca1319bb-0 |
|
@jianzhangbjz it looks that |
|
@pedjak it looks like @jianzhangbjz This CI job wont validate this particular fix because it's a patch upgrade. We will be able to validate the changes when we the test actually runs on minor/major upgrades. Can you please find a different CI job that would trigger this validation? |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9b7f3b60-0380-11f1-90eb-7e61d802a20d-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ac537de0-0400-11f1-946e-e7f93b1f53c3-0 |
The test passed in this run. |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/097f30a0-0419-11f1-8f81-1778f85345a1-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
@pedjak: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d94663b0-051f-11f1-983b-28296fef8a6f-0 |
|
@jianzhangbjz I started a few runs of |
|
Hi @pedjak , I didn't find |
The test is there, see https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30754-openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade/2020572421811081216 {
"name": "[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm must go Progressing=True during an upgrade test",
"lifecycle": "blocking",
"duration": 4464906,
"startTime": null,
"endTime": null,
"result": "passed",
"output": "clusteroperator/olm became Progressing=True at 2026-02-08T20:49:21Z during the upgrade window from 2026-02-08T20:13:18Z to 2026-02-08T21:27:42Z"
}
Also, it passes in this run
I do think that we need to fix it, because aggregating progressing condition from deployment might be unreliable, i.e. we can miss the right time window. How do we know that tests are passing in HA clusters, when there is an exception currently in openshift/origin repo set for OLM component, so that test actually do not fail? The run I triggered here are actually executed together with openshift/origin#30754 Once this PR gets merged, we can lift up the exception as well. |
|
@rashmigottipati: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-job-with-prs periodic-ci-openshift-release-master-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade openshift/origin#30754 |
|
@pedjak: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/28e16ab0-11f4-11f1-8854-54759b749696-0 |
0 unexpected clusteroperator state transitions during (upgrade=true) e2e test run, as desired.
6 unwelcome but acceptable clusteroperator state transitions during e2e test run. These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:
Feb 25 04:41:22.565 E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:41:22.565 - 78s E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:41.167 W clusteroperator/olm condition/Available reason/AsExpected status/True OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available\nCatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case)
Feb 25 04:42:42.267 E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:42.267 - 14s E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment (exception: https://issues.redhat.com/browse/OCPBUGS-62517)
Feb 25 04:42:57.144 W clusteroperator/olm condition/Available reason/AsExpected status/True OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available\nCatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case) |
|
/payload-job-with-prs periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade openshift/origin#30754 |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3aec6070-121c-11f1-9264-bf3a34a90d5d-0 |
Signed-off-by: Rashmi Gottipati <rgottipa@redhat.com>
rashmigottipati
left a comment
There was a problem hiding this comment.
@jianzhangbjz updated the PR addressing the latest review comments. can you PTAL and help add the verify label? Thanks.
That's an exception. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30754-openshift-cluster-olm-operator-173-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade/2026561105756688384 test passed |
|
Test fail: pkg/clients/wrappers_test.go:90:1: File is not properly formatted (gofmt)
name: "lister error - should return error",
^
pkg/clients/wrappers.go:52:1: `if w.releaseVersion != ""` has complex nested blocks (complexity: 5) (nestif)
if w.releaseVersion != "" {
^
2 issues:
* gofmt: 1
* nestif: 1
make: *** [Makefile:27: lint] Error 1 |
Signed-off-by: Rashmi Gottipati <rgottipa@redhat.com>
|
/payload-job-with-prs periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade openshift/origin#30754 |
|
@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fb47d90-12bd-11f1-861f-6b7c251340a7-0 |
|
/retest-required |
|
@jianzhangbjz: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
@jianzhangbjz we need also to merge openshift/origin#30754 - without it, merging this one does not have any impact. |
|
@rashmigottipati: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@rashmigottipati: Jira Issue OCPBUGS-65623: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-65623 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.21 |
|
@rashmigottipati: new pull request created: #177 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |


Description
Add a wrapper around the configclient to detect version changes during upgrades. The wrapper intercepts status updates and checks if the RELEASE_VERSION matches the version thats currently stored in etcd. If the versions don't match, we're in an upgrade, so it sets Progressing=True.
Changes:
Motivation
cluster-olm-operator doesn't report Progressing=True during upgrades because library-go's deployment check may be missing fast deployments (which is common in patch upgrades with retagged images).
Since we use library-go's status controller and can't modify it, we wrap the client that gets passed to library-go and add the version check there.
Fixes: https://issues.redhat.com/browse/OCPBUGS-65623