OCPQE-31838: add NUM_WORKERS=4 to hypershift 4.19/4.20 disconnected agent tests#77296
OCPQE-31838: add NUM_WORKERS=4 to hypershift 4.19/4.20 disconnected agent tests#77296zhfeng wants to merge 1 commit intoopenshift:mainfrom
Conversation
The disconnected agent tests (dualstack and ipv6) were missing NUM_WORKERS=4, causing KCM rolling update deadlocks due to insufficient worker nodes for pod anti-affinity + maxUnavailable=0. Aligns with the 4.21 config which already has this fix.
|
@zhfeng: This pull request references OCPQE-31838 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[REHEARSALNOTIFIER]
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals. Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance |
|
@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/lgtm cancel I'll let @jparrill tag with lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhfeng The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/pj-rehearse periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance |
|
@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance |
|
@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-hypershift-release-4.19-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal-conformance |
|
@zhfeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@zhfeng: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
NUM_WORKERS=4toe2e-agent-disconnected-ovn-dualstack-metal-conformanceande2e-agent-disconnected-ovn-ipv6-metal-conformancetests in both 4.19 and 4.20 periodics-mce configsProblem
The KCM deployment in hypershift hosted control planes uses
replicas: 2with strict pod anti-affinity (required across zones and hosts) andmaxUnavailable: 0. During a rolling update, a surge pod needs to be scheduled but with only the default 2 worker nodes, there are no eligible nodes available — the 2 existing KCM pods block the anti-affinity-compatible nodes, and the remaining nodes either have insufficient memory or untolerated taints.This caused the
hypershift-agent-check-conditionsstep to fail withDegraded: Truedue tokube-controller-manager deployment has 1 unavailable replicas.Example failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.20-periodics-mce-e2e-agent-disconnected-ovn-dualstack-metal-conformance/2037771813567598592
Fix
Setting
NUM_WORKERS=4provides enough worker nodes to accommodate the rolling update surge pod alongside the existing replicas with anti-affinity constraints, matching what 4.21 already does.