Add upgrade workflow for kueue operator testing#74857
Add upgrade workflow for kueue operator testing#74857sohankunkerkar wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds a new CI upgrade test workflow for the kueue-operator that installs the operator after an OCP upgrade and runs a small workload-based smoke test, and wires that workflow into presubmit/periodic jobs for multiple upgrade paths.
Changes:
- Added a new step-registry chain for upgrade testing, plus install + workload smoke-test refs.
- Added ci-operator config variants for upgrades 4.18→4.19, 4.19→4.20, and 4.20→4.21 (each with two component flavors).
- Added new presubmit and periodic Prow jobs to execute the upgrade suites and variant images jobs.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ci-operator/step-registry/kueue-operator/test/upgrade/workload/kueue-operator-test-upgrade-workload-ref.yaml | New step ref definition for the post-upgrade workload smoke test. |
| ci-operator/step-registry/kueue-operator/test/upgrade/workload/kueue-operator-test-upgrade-workload-ref.metadata.json | Metadata for the workload smoke-test step. |
| ci-operator/step-registry/kueue-operator/test/upgrade/workload/kueue-operator-test-upgrade-workload-commands.sh | Implements the workload smoke test (creates queues/flavor + submits Job + checks admission/completion). |
| ci-operator/step-registry/kueue-operator/test/upgrade/kueue-operator-test-upgrade-chain.yaml | New upgrade test chain combining env setup, cert-manager, operator install, and workload smoke test. |
| ci-operator/step-registry/kueue-operator/test/upgrade/kueue-operator-test-upgrade-chain.metadata.json | Metadata for the upgrade chain. |
| ci-operator/step-registry/kueue-operator/test/upgrade/install/kueue-operator-test-upgrade-install-ref.yaml | New step ref to install the operator bundle via operator-sdk. |
| ci-operator/step-registry/kueue-operator/test/upgrade/install/kueue-operator-test-upgrade-install-ref.metadata.json | Metadata for the install step. |
| ci-operator/step-registry/kueue-operator/test/upgrade/install/kueue-operator-test-upgrade-install-commands.sh | Implements the bundle install using operator-sdk run bundle. |
| ci-operator/jobs/openshift/kueue-operator/openshift-kueue-operator-main-presubmits.yaml | Adds presubmit upgrade and images jobs for the new variants/targets. |
| ci-operator/jobs/openshift/kueue-operator/openshift-kueue-operator-main-periodics.yaml | Adds periodic upgrade jobs for the new variants/targets. |
| ci-operator/config/openshift/kueue-operator/openshift-kueue-operator-main__upgrade-from-4.18.yaml | New ci-operator variant defining 4.18→4.19 upgrade tests (kueue 1.1/1.2). |
| ci-operator/config/openshift/kueue-operator/openshift-kueue-operator-main__upgrade-from-4.19.yaml | New ci-operator variant defining 4.19→4.20 upgrade tests (kueue 1.1/1.2). |
| ci-operator/config/openshift/kueue-operator/openshift-kueue-operator-main__upgrade-from-4.20.yaml | New ci-operator variant defining 4.20→4.21 upgrade tests (kueue 1.1/1.2). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "Verifying workload finished status..." | ||
| FINISHED=$(oc get workloads -n kueue-upgrade-test -o jsonpath='{.items[0].status.conditions[?(@.type=="Finished")].status}' 2>/dev/null || true) | ||
| if [ "$FINISHED" = "True" ]; then | ||
| echo "Kueue workload completed and finished successfully on upgraded cluster!" | ||
| else | ||
| echo "WARNING: Workload Finished condition not set, but job completed." | ||
| oc get workloads -n kueue-upgrade-test -o yaml |
There was a problem hiding this comment.
The Finished condition check also uses .items[0], which can end up checking a different Workload than the one admitted/created for this Job. Reuse the same resolved Workload name from the admission phase to verify Finished on the correct object.
| echo "Waiting for Job to complete..." | ||
| oc wait --for=condition=complete job/kueue-smoke-test-job -n kueue-upgrade-test --timeout=300s | ||
|
|
There was a problem hiding this comment.
oc wait ... --timeout=300s may be too short for an upgraded cluster where image pulls and scheduling can be slower, leading to intermittent failures even when the system is healthy. Consider increasing the timeout (and/or making it configurable) to reduce flakiness.
| { | ||
| "path": "ci-operator/step-registry/kueue-operator/test/upgrade", | ||
| "owners": "openshift/kueue-operator", | ||
| "description": "Chain that installs kueue operator with dependencies on an upgraded cluster, runs e2e tests and a workload smoke test." |
There was a problem hiding this comment.
The metadata description says this chain "runs e2e tests", but the chain YAML only installs dependencies/operator and runs the workload smoke test. Please update the description to match what the chain actually does (or move the e2e reference to the CI config/workflow description where openshift-e2e-test is invoked).
| "description": "Chain that installs kueue operator with dependencies on an upgraded cluster, runs e2e tests and a workload smoke test." | |
| "description": "Chain that installs kueue operator with dependencies on an upgraded cluster and runs a workload smoke test." |
| echo "Waiting for workload to be admitted by kueue..." | ||
| for i in $(seq 1 30); do | ||
| ADMITTED=$(oc get workloads -n kueue-upgrade-test -o jsonpath='{.items[0].status.conditions[?(@.type=="Admitted")].status}' 2>/dev/null || true) | ||
| if [ "$ADMITTED" = "True" ]; then | ||
| echo "Workload admitted by kueue successfully!" |
There was a problem hiding this comment.
The admission check reads only the first Workload in the namespace (.items[0]). If the workload list is empty initially or multiple Workloads exist, this can cause false negatives/positives and flaky behavior. Consider first determining the specific Workload created for this Job (e.g., wait until exactly one exists and capture its name, or select by a label/owner reference), then query conditions on that named Workload.
2b1ae14 to
48626fa
Compare
|
/pj-rehearse |
|
@sohankunkerkar: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
2b03054 to
ce5f097
Compare
|
/pj-rehearse |
|
@sohankunkerkar: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
ce5f097 to
a446766
Compare
|
/pj-rehearse |
|
@sohankunkerkar: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
a446766 to
8584b38
Compare
|
/pj-rehearse |
|
@sohankunkerkar: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@sohankunkerkar: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.