Skip to content

OCPQE-31838: add NUM_WORKERS=4 to hypershift 4.19/4.20 disconnected agent tests#77296

Open
zhfeng wants to merge 1 commit intoopenshift:mainfrom
zhfeng:fix-hypershift-420-num-workers
Open

OCPQE-31838: add NUM_WORKERS=4 to hypershift 4.19/4.20 disconnected agent tests#77296
zhfeng wants to merge 1 commit intoopenshift:mainfrom
zhfeng:fix-hypershift-420-num-workers

Conversation

@zhfeng
Copy link
Copy Markdown
Contributor

@zhfeng zhfeng commented Apr 2, 2026

Summary

  • Add NUM_WORKERS=4 to e2e-agent-disconnected-ovn-dualstack-metal-conformance and e2e-agent-disconnected-ovn-ipv6-metal-conformance tests in both 4.19 and 4.20 periodics-mce configs
  • Aligns with the 4.21 config which already has this setting

Problem

The KCM deployment in hypershift hosted control planes uses replicas: 2 with strict pod anti-affinity (required across zones and hosts) and maxUnavailable: 0. During a rolling update, a surge pod needs to be scheduled but with only the default 2 worker nodes, there are no eligible nodes available — the 2 existing KCM pods block the anti-affinity-compatible nodes, and the remaining nodes either have insufficient memory or untolerated taints.

This caused the hypershift-agent-check-conditions step to fail with Degraded: True due to kube-controller-manager deployment has 1 unavailable replicas.

Example failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance/2037771813567598592

Fix

Setting NUM_WORKERS=4 provides enough worker nodes to accommodate the rolling update surge pod alongside the existing replicas with anti-affinity constraints, matching what 4.21 already does.

The disconnected agent tests (dualstack and ipv6) were missing
NUM_WORKERS=4, causing KCM rolling update deadlocks due to insufficient
worker nodes for pod anti-affinity + maxUnavailable=0. Aligns with
the 4.21 config which already has this fix.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 2, 2026

@zhfeng: This pull request references OCPQE-31838 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add NUM_WORKERS=4 to e2e-agent-disconnected-ovn-dualstack-metal-conformance and e2e-agent-disconnected-ovn-ipv6-metal-conformance tests in both 4.19 and 4.20 periodics-mce configs
  • Aligns with the 4.21 config which already has this setting

Problem

The KCM deployment in hypershift hosted control planes uses replicas: 2 with strict pod anti-affinity (required across zones and hosts) and maxUnavailable: 0. During a rolling update, a surge pod needs to be scheduled but with only the default 2 worker nodes, there are no eligible nodes available — the 2 existing KCM pods block the anti-affinity-compatible nodes, and the remaining nodes either have insufficient memory or untolerated taints.

This caused the hypershift-agent-check-conditions step to fail with Degraded: True due to kube-controller-manager deployment has 1 unavailable replicas.

Example failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance/2037771813567598592

Fix

Setting NUM_WORKERS=4 provides enough worker nodes to accommodate the rolling update surge pod alongside the existing replicas with anti-affinity constraints, matching what 4.21 already does.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from bryan-cox and jparrill April 2, 2026 08:57
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@zhfeng: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance N/A periodic Ci-operator config changed
periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance N/A periodic Ci-operator config changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@zhfeng
Copy link
Copy Markdown
Contributor Author

zhfeng commented Apr 2, 2026

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 2, 2026
@bryan-cox
Copy link
Copy Markdown
Member

/lgtm cancel

I'll let @jparrill tag with lgtm

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 2, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhfeng
Once this PR has been reviewed and has the lgtm label, please ask for approval from bryan-cox. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2026
@zhfeng
Copy link
Copy Markdown
Contributor Author

zhfeng commented Apr 2, 2026

/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@zhfeng
Copy link
Copy Markdown
Contributor Author

zhfeng commented Apr 2, 2026

/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@zhfeng
Copy link
Copy Markdown
Contributor Author

zhfeng commented Apr 2, 2026

/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 2, 2026

@zhfeng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance 418f316 link unknown /pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants