[release-4.22] NVIDIA-554: DPU-host mode: use ConfigMap for OVN feature enablement instead of per-node script gating#2997
Conversation
The OVN config map templates had broken conditional logic around enable-multi-network: the self-hosted template used "if not .OVN_MULTI_NETWORK_ENABLE" (inverted), while the managed template had both "if" and "if not" branches — resulting in enable-multi-network=true being emitted regardless of the flag. Replace these broken conditionals with unconditional enable-multi-network=true, remove the OVN_MULTI_NETWORK_ENABLE template variable from the Go code, and decouple OVN_MULTI_NETWORK_POLICY_ENABLE from DisableMultiNetwork so that UseMultiNetworkPolicy is always respected. DisableMultiNetwork continues to control Multus deployment in render.go / multus.go — only the OVN feature-flag plumbing is removed here. Made-with: Cursor
…nstead of per-node script gating All OVN-Kubernetes features (egress IP, egress firewall, multicast, multi-network, admin network policy, multi-external-gateway, etc.) are now enabled in DPU-host mode. The OVN controller on DPU-host nodes processes the configuration but does not offload egress IP datapath — traffic follows the regular kernel path instead. Because these features are safe to enable cluster-wide, the per-node gating logic in the startup script (008-script-lib.yaml) is no longer needed. Feature flags are managed solely through the cluster-wide ConfigMap (004-config.yaml) passed to ovnkube via --config-file. OVN_NODE_MODE remains used only for DPU-host structural differences: gateway interface selection, --ovnkube-node-mode flag, and disabling init-ovnkube-controller. Also removes the redundant network_connect_enabled_flag CLI flag from the node startup script — enable-network-connect is already managed through the ConfigMap. Made-with: Cursor
Remove enable-multicast=true from ovnkube config maps and pass it directly as --enable-multicast on the ovnkube CLI for node and control plane processes (both self-hosted and managed). Made-with: Cursor
|
@openshift-cherrypick-robot: Ignoring requests to cherry-pick non-bug issues: NVIDIA-554 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-1 |
|
/retest |
1 similar comment
|
/retest |
PRPQR
|
| Shard | Run ID | Result | Duration | Failure Type | Real (Non-Flake) Blocking Failures |
|---|---|---|---|---|---|
| 0 | 2052372720007516160 | ❌ FAIL | 6h01m | MonitorTest + e2e blocking | [sig-arch][Late] collect certificate data (blocking), [Monitor] Nodes should reach OSUpdateStaged in a timely fashion, pathological events (etcd, kube-apiserver, kube-controller-manager, kube-scheduler), KubePodNotReady in openshift-marketplace |
| 1 | 2052372720460500992 | ❌ FAIL | 5h52m | e2e blocking | [sig-arch][Late] collect certificate data (blocking), [sig-arch][Late] all registered tls artifacts must have no metadata violation regressions, [sig-arch][Late] all tls artifacts must be registered, [sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls |
| 2 | 2052372720913485824 | ✅ PASS | 5h44m | — | — |
| 3 | 2052372721366470656 | ✅ PASS | 5h26m | — | — |
| 4 | 2052372721815261184 | ✅ PASS | 5h34m | — | — |
| 5 | 2052372722272440320 | ✅ PASS | 5h37m | — | — |
| 6 | 2052372722721230848 | ❌ FAIL | 5h53m | MonitorTest | [Monitor:legacy-networking-invariants] pods should successfully create sandboxes by other (28× FailedCreatePodSandBox — no CNI config during upgrade) |
| 7 | 2052372723178409984 | ❌ FAIL | 5h50m | e2e blocking | [sig-node] Pod InPlace Resize Container (limit-ranger) [FeatureGate:InPlacePodVerticalScaling] pod-resize-limit-ranger-test exceed maximum Memory and CPU |
| 8 | 2052372723874664448 | ❌ FAIL | 5h58m | MonitorTest + e2e blocking + upgrade timeout | [sig-arch][Late] collect certificate data (blocking), [sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time (85.5m vs 85m limit), [Monitor] Nodes should reach OSUpdateStaged in a timely fashion, KubePodNotReady in ns/default |
| 9 | 2052372724713525248 | ✅ PASS | 5h42m | — | — |
Failure summary
| Real Failure | Shards | Pre-existing on baseline? | PR-related? |
|---|---|---|---|
[sig-arch][Late] collect certificate data |
0, 1, 8 | ✅ Yes — same certs.go:144 failure on periodic master runs |
No |
[Monitor] pods should successfully create sandboxes by other (no CNI config) |
6 | Transient CNI gaps seen on all runs (passing + baseline); this run crossed the monitor threshold | Unlikely — PR moves feature flags, doesn't change CNI startup sequence |
[sig-node] InPlacePodVerticalScaling exceed max Memory and CPU |
7 | Not verified | No — sig-node feature gate test, unrelated to networking |
[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time |
8 | Timing-dependent (85.5m vs 85m limit); MCO was the bottleneck | No — MCO-driven node recycling was the long pole |
[Monitor] Nodes should reach OSUpdateStaged in a timely fashion |
0, 8 | Common MCO monitor flake | No |
[Monitor] KubePodNotReady (marketplace, default) |
0, 8 | Common alert flake | No |
| Pathological events (etcd, kube-apiserver, etc.) | 0 | Common upgrade noise | No |
[sig-arch][Late] tls artifacts / metadata regressions |
0, 1, 8 | Informing (non-blocking), co-occurs with cert data failure | No |
[sig-node][Late] CRI-O no stuck image pulls |
0, 1, 8 | Informing (non-blocking) | No |
PRPQR
|
| Run | Result | Failure Type | Blocking Failures |
|---|---|---|---|
| 2052372745726988288 | ❌ FAIL | Hypershift e2e test failure | TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle — test tried to exec into a Failed-phase HCP openshift-apiserver pod for 300s. Pod failed due to scheduling: TaintToleration (pod didn't tolerate node taints), not a container crash or OOM. Other apiserver replicas were Running. TestUpgradeControlPlane/ValidateHostedCluster — one transient TLS handshake timeout to guest API that later succeeded (connected in 11s). |
- Guest
ClusterOperator/network: Available=True, Degraded=False, Progressing=False — CNO not involved - Failure is a stale Failed-phase pod from a rollout + test picking that pod for exec instead of a Running one
PRPQR hypershift e2e-aws-ovn Analysis
| Run | Result | Failure Type | Blocking Failures |
|---|---|---|---|
| 2052372746683289600 | ❌ FAIL | Hypershift e2e test failure (teardown) | TestCreateClusterPrivateWithRouteKAS/Teardown — fixture's wait for AWS resources to disappear hit context deadline with 3 resources still listed: 2 CAPA node EBS volumes + 1 NLB (openshift-ingress/router-default). All functional subtests passed. destroy.log shows the NLB was successfully deleted shortly after infra destroy started — the test fixture's polling deadline expired before async AWS cleanup completed. |
- Stuck NLB is the Ingress Operator's
router-defaultService LB, not CNO - No CNO/OVN failures in the test — CNO not involved
- Failure is AWS async resource cleanup latency vs test fixture timeout
|
@openshift-cherrypick-robot: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Nightly PRPQR
|
| Shard | Run ID | Failure Type | Real Blocking Failures | Observed Cause |
|---|---|---|---|---|
| 1 | 2052372781747671040 | e2e blocking (20) | mass-test + 9× oc adm must-gather + collect certificate data. Monitor: OSUpdateStaged |
20 blocking failures triggered mass-test-failure threshold (>10). must-gather and cert collection failures dominated. |
| 2 | 2052372784276836352 | e2e blocking (19) | mass-test + 8× must-gather + collect certificate data + [sig-storage] CSI Mock volume storage capacity |
Same must-gather/cert pattern + one CSI storage test. |
| 3 | 2052372785107308544 | e2e blocking (20) | mass-test + 9× must-gather + collect certificate data. Monitor: OSUpdateStaged |
Same pattern as shard 1. |
| 4 | 2052372786789224448 | e2e blocking (20) | mass-test + 9× must-gather + collect certificate data |
Same pattern. |
| 5 | 2052372785954557952 | e2e blocking (2) | collect certificate data + 2× TLS artifacts registration. Monitor: OSUpdateStaged |
Only cert-related failures; mass-test check passed on this shard. |
| 6 | 2052372778409005056 | e2e blocking (20) | mass-test + 9× must-gather + collect certificate data. Monitor: OSUpdateStaged |
Same pattern as shard 1. |
| 7 | 2052372782586531840 | Install failure | Installer exit code 6 — no e2e tests ran | Workers never provisioned (machine-api: 0/2 running replicas). Without workers: ingress router pods couldn't schedule (untolerated taints on 3 control-plane nodes), cascading to console, monitoring, image-registry. CNO was Progressing (waiting on other operators), not Degraded. GCP machine provisioning failure. |
| 8 | 2052372787623890944 | e2e blocking (2) | CCO metrics test + 3× kube-apiserver TLS/certificate. Monitor: OSUpdateStaged |
CCO metrics + cert collection failures. |
| 9 | 2052372780908810240 | e2e blocking (26) + MonitorTest | mass-test + [sig-builds] failures. Monitor: [sig-network-edge] disruption/service-load-balancer-with-pdb |
26 blocking failures (mostly sig-builds) + service LB disruption monitor. Process timed out / exit 127. |
| 10 | 2052372783429586944 | e2e blocking (20) | mass-test + 9× must-gather + collect certificate data |
Same pattern as shard 1. |
- 0/10 passed — this job is broadly broken. 7/10 shards hit the same must-gather + cert-data pattern suggesting a systemic GCP RT environment issue
- CNO/OVN not observed as the cause of any failure
- Periodic baseline data is from March 2026 (stale) — cannot confirm whether this is pre-existing in May 2026
Nightly PRPQR
|
| Shard | Run ID | Failure Type | Real Blocking Failures | Observed Cause |
|---|---|---|---|---|
| 1 | 2052372770850869248 | e2e blocking (2) | [sig-api-machinery] AdmissionWebhook should mutate custom resource with pruning [Conformance], [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe |
Two conformance test failures — AdmissionWebhook mutation and kubelet liveness probe handling. Not networking-related. |
| 5 | 2052372774202118144 | e2e blocking (1) | [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe |
Same liveness probe conformance test as shard 1. Shared failure across 2 shards suggests a flaky conformance test. |
| 6 | 2052372775049367552 | Infra — deprovision timeout | All e2e tests passed (2026 pass, 1 flaky). | ipi-deprovision-deprovision hit 1h Azure destroy timeout. Tests were green — failure is purely infra cleanup. Caused the aggregator to time out at 7h waiting for this shard. |
| 7 | 2052372775879839744 | e2e blocking (1) | [sig-api-machinery] FieldValidation should detect unknown metadata fields in both the root and embedded object of a CR [Conformance] |
Conformance test for CRD field validation. Not networking-related. |
| 8 | 2052372776722894848 | Upgrade MonitorTest | [Monitor:audit-log-analyzer][sig-api-machinery] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients |
Audit log analysis detected API load balancers sending requests to kube-apiserver after it reported not-ready during upgrade shutdown. sig-api-machinery issue, not CNO. |
- CNO/OVN not observed in any real failure chain
- Most common shared failure:
sig-nodeliveness probe test (shards 1, 5)
Nightly PRPQR e2e-aws-ovn-techpreview Analysis
| Run | Failure Type | Real Blocking Failures | Observed Cause |
|---|---|---|---|
| 2052372803566440448 | MonitorTest | [Monitor:legacy-test-framework-invariants-pathological] events should not repeat pathologically |
21× Back-off pulling image events for OLM webhook-operator pod (quay.io/openshift/community-e2e-images:...webhook-operator...). Image pull backoff repeated enough to cross the pathological events threshold. OLM image pull issue, not CNO. |
- CNO/OVN not in the failure chain
|
/verified by @tsorya |
|
@tsorya: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@danwinship can you please take a look? |
|
clean backport with no changes |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, openshift-cherrypick-robot The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is an automated cherry-pick of #2944
/assign tsorya