[release-4.22] NVIDIA-554: DPU-host mode: use ConfigMap for OVN feature enablement instead of per-node script gating by openshift-cherrypick-robot · Pull Request #2997 · openshift/cluster-network-operator

openshift-cherrypick-robot · 2026-05-07T12:56:21Z

This is an automated cherry-pick of #2944

/assign tsorya

The OVN config map templates had broken conditional logic around enable-multi-network: the self-hosted template used "if not .OVN_MULTI_NETWORK_ENABLE" (inverted), while the managed template had both "if" and "if not" branches — resulting in enable-multi-network=true being emitted regardless of the flag. Replace these broken conditionals with unconditional enable-multi-network=true, remove the OVN_MULTI_NETWORK_ENABLE template variable from the Go code, and decouple OVN_MULTI_NETWORK_POLICY_ENABLE from DisableMultiNetwork so that UseMultiNetworkPolicy is always respected. DisableMultiNetwork continues to control Multus deployment in render.go / multus.go — only the OVN feature-flag plumbing is removed here. Made-with: Cursor

…nstead of per-node script gating All OVN-Kubernetes features (egress IP, egress firewall, multicast, multi-network, admin network policy, multi-external-gateway, etc.) are now enabled in DPU-host mode. The OVN controller on DPU-host nodes processes the configuration but does not offload egress IP datapath — traffic follows the regular kernel path instead. Because these features are safe to enable cluster-wide, the per-node gating logic in the startup script (008-script-lib.yaml) is no longer needed. Feature flags are managed solely through the cluster-wide ConfigMap (004-config.yaml) passed to ovnkube via --config-file. OVN_NODE_MODE remains used only for DPU-host structural differences: gateway interface selection, --ovnkube-node-mode flag, and disabling init-ovnkube-controller. Also removes the redundant network_connect_enabled_flag CLI flag from the node startup script — enable-network-connect is already managed through the ConfigMap. Made-with: Cursor

Remove enable-multicast=true from ovnkube config maps and pass it directly as --enable-multicast on the ovnkube CLI for node and control plane processes (both self-hosted and managed). Made-with: Cursor

openshift-ci-robot · 2026-05-07T12:56:23Z

@openshift-cherrypick-robot: Ignoring requests to cherry-pick non-bug issues: NVIDIA-554

Details

In response to this:

This is an automated cherry-pick of #2944

/assign tsorya

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-07T12:56:28Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8cbd91b6-8db1-4333-83d8-d4f47f1ca89f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tsorya · 2026-05-07T12:59:18Z

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

openshift-ci · 2026-05-07T12:59:22Z

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-1

tsorya · 2026-05-07T13:27:45Z

/retest

tsorya · 2026-05-09T18:52:33Z

/retest

tsorya · 2026-05-09T19:28:00Z

PRPQR `e2e-aws-ovn-upgrade` Analysis (10 runs, 5 passed / 5 failed — needed 8 to pass)

Shard	Run ID	Result	Duration	Failure Type	Real (Non-Flake) Blocking Failures
0	2052372720007516160	❌ FAIL	6h01m	MonitorTest + e2e blocking	`[sig-arch][Late] collect certificate data` (blocking), `[Monitor] Nodes should reach OSUpdateStaged in a timely fashion`, pathological events (etcd, kube-apiserver, kube-controller-manager, kube-scheduler), `KubePodNotReady` in openshift-marketplace
1	2052372720460500992	❌ FAIL	5h52m	e2e blocking	`[sig-arch][Late] collect certificate data` (blocking), `[sig-arch][Late] all registered tls artifacts must have no metadata violation regressions`, `[sig-arch][Late] all tls artifacts must be registered`, `[sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls`
2	2052372720913485824	✅ PASS	5h44m	—	—
3	2052372721366470656	✅ PASS	5h26m	—	—
4	2052372721815261184	✅ PASS	5h34m	—	—
5	2052372722272440320	✅ PASS	5h37m	—	—
6	2052372722721230848	❌ FAIL	5h53m	MonitorTest	`[Monitor:legacy-networking-invariants] pods should successfully create sandboxes by other` (28× FailedCreatePodSandBox — no CNI config during upgrade)
7	2052372723178409984	❌ FAIL	5h50m	e2e blocking	`[sig-node] Pod InPlace Resize Container (limit-ranger) [FeatureGate:InPlacePodVerticalScaling] pod-resize-limit-ranger-test exceed maximum Memory and CPU`
8	2052372723874664448	❌ FAIL	5h58m	MonitorTest + e2e blocking + upgrade timeout	`[sig-arch][Late] collect certificate data` (blocking), `[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time` (85.5m vs 85m limit), `[Monitor] Nodes should reach OSUpdateStaged in a timely fashion`, `KubePodNotReady` in ns/default
9	2052372724713525248	✅ PASS	5h42m	—	—

Failure summary

Real Failure	Shards	Pre-existing on baseline?	PR-related?
`[sig-arch][Late] collect certificate data`	0, 1, 8	✅ Yes — same `certs.go:144` failure on periodic master runs	No
`[Monitor] pods should successfully create sandboxes by other` (no CNI config)	6	Transient CNI gaps seen on all runs (passing + baseline); this run crossed the monitor threshold	Unlikely — PR moves feature flags, doesn't change CNI startup sequence
`[sig-node] InPlacePodVerticalScaling exceed max Memory and CPU`	7	Not verified	No — sig-node feature gate test, unrelated to networking
`[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time`	8	Timing-dependent (85.5m vs 85m limit); MCO was the bottleneck	No — MCO-driven node recycling was the long pole
`[Monitor] Nodes should reach OSUpdateStaged in a timely fashion`	0, 8	Common MCO monitor flake	No
`[Monitor] KubePodNotReady` (marketplace, default)	0, 8	Common alert flake	No
Pathological events (etcd, kube-apiserver, etc.)	0	Common upgrade noise	No
`[sig-arch][Late] tls artifacts / metadata regressions`	0, 1, 8	Informing (non-blocking), co-occurs with cert data failure	No
`[sig-node][Late] CRI-O no stuck image pulls`	0, 1, 8	Informing (non-blocking)	No

tsorya · 2026-05-09T20:01:19Z

PRPQR `hypershift e2e-aks` Analysis

Run	Result	Failure Type	Blocking Failures
2052372745726988288	❌ FAIL	Hypershift e2e test failure	`TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle` — test tried to exec into a Failed-phase HCP `openshift-apiserver` pod for 300s. Pod failed due to scheduling: `TaintToleration` (pod didn't tolerate node taints), not a container crash or OOM. Other apiserver replicas were Running. `TestUpgradeControlPlane/ValidateHostedCluster` — one transient TLS handshake timeout to guest API that later succeeded (connected in 11s).

Guest ClusterOperator/network: Available=True, Degraded=False, Progressing=False — CNO not involved
Failure is a stale Failed-phase pod from a rollout + test picking that pod for exec instead of a Running one

PRPQR `hypershift e2e-aws-ovn` Analysis

Run	Result	Failure Type	Blocking Failures
2052372746683289600	❌ FAIL	Hypershift e2e test failure (teardown)	`TestCreateClusterPrivateWithRouteKAS/Teardown` — fixture's wait for AWS resources to disappear hit context deadline with 3 resources still listed: 2 CAPA node EBS volumes + 1 NLB (`openshift-ingress/router-default`). All functional subtests passed. `destroy.log` shows the NLB was successfully deleted shortly after infra destroy started — the test fixture's polling deadline expired before async AWS cleanup completed.

Stuck NLB is the Ingress Operator's router-default Service LB, not CNO
No CNO/OVN failures in the test — CNO not involved
Failure is AWS async resource cleanup latency vs test fixture timeout

openshift-ci · 2026-05-09T23:42:58Z

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`9a31008`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tsorya · 2026-05-10T01:21:05Z

Nightly PRPQR `e2e-gcp-ovn-rt-upgrade` Analysis (10 runs, 0 passed / 10 failed)

Shard	Run ID	Failure Type	Real Blocking Failures	Observed Cause
1	2052372781747671040	e2e blocking (20)	mass-test + 9× `oc adm must-gather` + `collect certificate data`. Monitor: `OSUpdateStaged`	20 blocking failures triggered mass-test-failure threshold (>10). must-gather and cert collection failures dominated.
2	2052372784276836352	e2e blocking (19)	mass-test + 8× must-gather + `collect certificate data` + `[sig-storage] CSI Mock volume storage capacity`	Same must-gather/cert pattern + one CSI storage test.
3	2052372785107308544	e2e blocking (20)	mass-test + 9× must-gather + `collect certificate data`. Monitor: `OSUpdateStaged`	Same pattern as shard 1.
4	2052372786789224448	e2e blocking (20)	mass-test + 9× must-gather + `collect certificate data`	Same pattern.
5	2052372785954557952	e2e blocking (2)	`collect certificate data` + 2× TLS artifacts registration. Monitor: `OSUpdateStaged`	Only cert-related failures; mass-test check passed on this shard.
6	2052372778409005056	e2e blocking (20)	mass-test + 9× must-gather + `collect certificate data`. Monitor: `OSUpdateStaged`	Same pattern as shard 1.
7	2052372782586531840	Install failure	Installer exit code 6 — no e2e tests ran	Workers never provisioned (`machine-api`: 0/2 running replicas). Without workers: ingress router pods couldn't schedule (untolerated taints on 3 control-plane nodes), cascading to console, monitoring, image-registry. CNO was Progressing (waiting on other operators), not Degraded. GCP machine provisioning failure.
8	2052372787623890944	e2e blocking (2)	CCO metrics test + 3× kube-apiserver TLS/certificate. Monitor: `OSUpdateStaged`	CCO metrics + cert collection failures.
9	2052372780908810240	e2e blocking (26) + MonitorTest	mass-test + `[sig-builds]` failures. Monitor: `[sig-network-edge] disruption/service-load-balancer-with-pdb`	26 blocking failures (mostly sig-builds) + service LB disruption monitor. Process timed out / exit 127.
10	2052372783429586944	e2e blocking (20)	mass-test + 9× must-gather + `collect certificate data`	Same pattern as shard 1.

0/10 passed — this job is broadly broken. 7/10 shards hit the same must-gather + cert-data pattern suggesting a systemic GCP RT environment issue
CNO/OVN not observed as the cause of any failure
Periodic baseline data is from March 2026 (stale) — cannot confirm whether this is pre-existing in May 2026

tsorya · 2026-05-10T01:23:29Z

Nightly PRPQR `e2e-azure-ovn-upgrade` Analysis (10 runs, 5 passed / 5 failed)

Aggregator timed out at 7h waiting for shard 6 (deprovision timeout).

Shard	Run ID	Failure Type	Real Blocking Failures	Observed Cause
1	2052372770850869248	e2e blocking (2)	`[sig-api-machinery] AdmissionWebhook should mutate custom resource with pruning [Conformance]`, `[sig-node] Probing container should not be restarted with a non-local redirect http liveness probe`	Two conformance test failures — AdmissionWebhook mutation and kubelet liveness probe handling. Not networking-related.
5	2052372774202118144	e2e blocking (1)	`[sig-node] Probing container should not be restarted with a non-local redirect http liveness probe`	Same liveness probe conformance test as shard 1. Shared failure across 2 shards suggests a flaky conformance test.
6	2052372775049367552	Infra — deprovision timeout	All e2e tests passed (2026 pass, 1 flaky).	`ipi-deprovision-deprovision` hit 1h Azure destroy timeout. Tests were green — failure is purely infra cleanup. Caused the aggregator to time out at 7h waiting for this shard.
7	2052372775879839744	e2e blocking (1)	`[sig-api-machinery] FieldValidation should detect unknown metadata fields in both the root and embedded object of a CR [Conformance]`	Conformance test for CRD field validation. Not networking-related.
8	2052372776722894848	Upgrade MonitorTest	`[Monitor:audit-log-analyzer][sig-api-machinery] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients`	Audit log analysis detected API load balancers sending requests to kube-apiserver after it reported not-ready during upgrade shutdown. sig-api-machinery issue, not CNO.

CNO/OVN not observed in any real failure chain
Most common shared failure: sig-node liveness probe test (shards 1, 5)

Nightly PRPQR `e2e-aws-ovn-techpreview` Analysis

Run	Failure Type	Real Blocking Failures	Observed Cause
2052372803566440448	MonitorTest	`[Monitor:legacy-test-framework-invariants-pathological] events should not repeat pathologically`	21× `Back-off pulling image` events for OLM `webhook-operator` pod (`quay.io/openshift/community-e2e-images:...webhook-operator...`). Image pull backoff repeated enough to cross the pathological events threshold. OLM image pull issue, not CNO.

CNO/OVN not in the failure chain

tsorya · 2026-05-10T02:29:40Z

/verified by @tsorya

openshift-ci-robot · 2026-05-10T02:29:51Z

@tsorya: This PR has been marked as verified by @tsorya.

Details

In response to this:

/verified by @tsorya

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tsorya · 2026-05-10T02:42:27Z

@danwinship can you please take a look?

danwinship · 2026-05-11T13:03:01Z

clean backport with no changes
/lgtm
/label backport-risk-assessed

openshift-ci · 2026-05-11T13:07:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, openshift-cherrypick-robot

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tsorya added 3 commits May 7, 2026 12:56

OCPBUGS-78731: Move enable-multicast from config maps to CLI flags

9a31008

Remove enable-multicast=true from ovnkube config maps and pass it directly as --enable-multicast on the ovnkube CLI for node and control plane processes (both self-hosted and managed). Made-with: Cursor

openshift-cherrypick-robot assigned tsorya May 7, 2026

openshift-cherrypick-robot mentioned this pull request May 7, 2026

NVIDIA-554: DPU-host mode: use ConfigMap for OVN feature enablement instead of per-node script gating #2944

Merged

openshift-ci Bot requested review from arghosh93 and arkadeepsen May 7, 2026 12:56

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 10, 2026

openshift-ci Bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 11, 2026

openshift-ci Bot assigned anuragthehatter, asood-rh, imatza-rh, jechen0648, mffiedler and rbbratta May 11, 2026

openshift-ci Bot assigned weliang1 and danwinship May 11, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2026

Conversation

openshift-cherrypick-robot commented May 7, 2026

Uh oh!

openshift-ci-robot commented May 7, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 7, 2026

Review skipped

Uh oh!

tsorya commented May 7, 2026

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

tsorya commented May 7, 2026

Uh oh!

tsorya commented May 9, 2026

Uh oh!

tsorya commented May 9, 2026

PRPQR e2e-aws-ovn-upgrade Analysis (10 runs, 5 passed / 5 failed — needed 8 to pass)

Failure summary

Uh oh!

tsorya commented May 9, 2026

PRPQR hypershift e2e-aks Analysis

PRPQR hypershift e2e-aws-ovn Analysis

Uh oh!

openshift-ci Bot commented May 9, 2026

Uh oh!

tsorya commented May 10, 2026

Nightly PRPQR e2e-gcp-ovn-rt-upgrade Analysis (10 runs, 0 passed / 10 failed)

Uh oh!

tsorya commented May 10, 2026

Nightly PRPQR e2e-azure-ovn-upgrade Analysis (10 runs, 5 passed / 5 failed)

Nightly PRPQR e2e-aws-ovn-techpreview Analysis

Uh oh!

tsorya commented May 10, 2026

Uh oh!

openshift-ci-robot commented May 10, 2026

Uh oh!

tsorya commented May 10, 2026

Uh oh!

danwinship commented May 11, 2026

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

openshift-ci-robot commented May 7, 2026 •

edited by openshift-ci Bot

Loading

PRPQR `e2e-aws-ovn-upgrade` Analysis (10 runs, 5 passed / 5 failed — needed 8 to pass)

PRPQR `hypershift e2e-aks` Analysis

PRPQR `hypershift e2e-aws-ovn` Analysis

Nightly PRPQR `e2e-gcp-ovn-rt-upgrade` Analysis (10 runs, 0 passed / 10 failed)

Nightly PRPQR `e2e-azure-ovn-upgrade` Analysis (10 runs, 5 passed / 5 failed)

Nightly PRPQR `e2e-aws-ovn-techpreview` Analysis