Skip to content

[release-4.22] NVIDIA-554: DPU-host mode: use ConfigMap for OVN feature enablement instead of per-node script gating#2997

Open
openshift-cherrypick-robot wants to merge 3 commits intoopenshift:release-4.22from
openshift-cherrypick-robot:cherry-pick-2944-to-release-4.22
Open

[release-4.22] NVIDIA-554: DPU-host mode: use ConfigMap for OVN feature enablement instead of per-node script gating#2997
openshift-cherrypick-robot wants to merge 3 commits intoopenshift:release-4.22from
openshift-cherrypick-robot:cherry-pick-2944-to-release-4.22

Conversation

@openshift-cherrypick-robot
Copy link
Copy Markdown

This is an automated cherry-pick of #2944

/assign tsorya

tsorya added 3 commits May 7, 2026 12:56
The OVN config map templates had broken conditional logic around
enable-multi-network: the self-hosted template used
"if not .OVN_MULTI_NETWORK_ENABLE" (inverted), while the managed
template had both "if" and "if not" branches — resulting in
enable-multi-network=true being emitted regardless of the flag.

Replace these broken conditionals with unconditional
enable-multi-network=true, remove the OVN_MULTI_NETWORK_ENABLE
template variable from the Go code, and decouple
OVN_MULTI_NETWORK_POLICY_ENABLE from DisableMultiNetwork so that
UseMultiNetworkPolicy is always respected.

DisableMultiNetwork continues to control Multus deployment in
render.go / multus.go — only the OVN feature-flag plumbing is
removed here.

Made-with: Cursor
…nstead of per-node script gating

All OVN-Kubernetes features (egress IP, egress firewall, multicast,
multi-network, admin network policy, multi-external-gateway, etc.)
are now enabled in DPU-host mode. The OVN controller on DPU-host
nodes processes the configuration but does not offload egress IP
datapath — traffic follows the regular kernel path instead.

Because these features are safe to enable cluster-wide, the per-node
gating logic in the startup script (008-script-lib.yaml) is no
longer needed. Feature flags are managed solely through the
cluster-wide ConfigMap (004-config.yaml) passed to ovnkube via
--config-file.

OVN_NODE_MODE remains used only for DPU-host structural differences:
gateway interface selection, --ovnkube-node-mode flag, and disabling
init-ovnkube-controller.

Also removes the redundant network_connect_enabled_flag CLI flag
from the node startup script — enable-network-connect is already
managed through the ConfigMap.

Made-with: Cursor
Remove enable-multicast=true from ovnkube config maps and pass it
directly as --enable-multicast on the ovnkube CLI for node and
control plane processes (both self-hosted and managed).

Made-with: Cursor
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 7, 2026

@openshift-cherrypick-robot: Ignoring requests to cherry-pick non-bug issues: NVIDIA-554

Details

In response to this:

This is an automated cherry-pick of #2944

/assign tsorya

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8cbd91b6-8db1-4333-83d8-d4f47f1ca89f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from arghosh93 and arkadeepsen May 7, 2026 12:56
@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 7, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ff81e80-4a14-11f1-8986-7930f2586794-1

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 7, 2026

/retest

1 similar comment
@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 9, 2026

/retest

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 9, 2026

PRPQR e2e-aws-ovn-upgrade Analysis (10 runs, 5 passed / 5 failed — needed 8 to pass)

Shard Run ID Result Duration Failure Type Real (Non-Flake) Blocking Failures
0 2052372720007516160 ❌ FAIL 6h01m MonitorTest + e2e blocking [sig-arch][Late] collect certificate data (blocking), [Monitor] Nodes should reach OSUpdateStaged in a timely fashion, pathological events (etcd, kube-apiserver, kube-controller-manager, kube-scheduler), KubePodNotReady in openshift-marketplace
1 2052372720460500992 ❌ FAIL 5h52m e2e blocking [sig-arch][Late] collect certificate data (blocking), [sig-arch][Late] all registered tls artifacts must have no metadata violation regressions, [sig-arch][Late] all tls artifacts must be registered, [sig-node][Late] CRI-O goroutine dump via SIGUSR1 should contain no stuck image pulls
2 2052372720913485824 ✅ PASS 5h44m
3 2052372721366470656 ✅ PASS 5h26m
4 2052372721815261184 ✅ PASS 5h34m
5 2052372722272440320 ✅ PASS 5h37m
6 2052372722721230848 ❌ FAIL 5h53m MonitorTest [Monitor:legacy-networking-invariants] pods should successfully create sandboxes by other (28× FailedCreatePodSandBox — no CNI config during upgrade)
7 2052372723178409984 ❌ FAIL 5h50m e2e blocking [sig-node] Pod InPlace Resize Container (limit-ranger) [FeatureGate:InPlacePodVerticalScaling] pod-resize-limit-ranger-test exceed maximum Memory and CPU
8 2052372723874664448 ❌ FAIL 5h58m MonitorTest + e2e blocking + upgrade timeout [sig-arch][Late] collect certificate data (blocking), [sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time (85.5m vs 85m limit), [Monitor] Nodes should reach OSUpdateStaged in a timely fashion, KubePodNotReady in ns/default
9 2052372724713525248 ✅ PASS 5h42m

Failure summary

Real Failure Shards Pre-existing on baseline? PR-related?
[sig-arch][Late] collect certificate data 0, 1, 8 ✅ Yes — same certs.go:144 failure on periodic master runs No
[Monitor] pods should successfully create sandboxes by other (no CNI config) 6 Transient CNI gaps seen on all runs (passing + baseline); this run crossed the monitor threshold Unlikely — PR moves feature flags, doesn't change CNI startup sequence
[sig-node] InPlacePodVerticalScaling exceed max Memory and CPU 7 Not verified No — sig-node feature gate test, unrelated to networking
[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time 8 Timing-dependent (85.5m vs 85m limit); MCO was the bottleneck No — MCO-driven node recycling was the long pole
[Monitor] Nodes should reach OSUpdateStaged in a timely fashion 0, 8 Common MCO monitor flake No
[Monitor] KubePodNotReady (marketplace, default) 0, 8 Common alert flake No
Pathological events (etcd, kube-apiserver, etc.) 0 Common upgrade noise No
[sig-arch][Late] tls artifacts / metadata regressions 0, 1, 8 Informing (non-blocking), co-occurs with cert data failure No
[sig-node][Late] CRI-O no stuck image pulls 0, 1, 8 Informing (non-blocking) No

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 9, 2026

PRPQR hypershift e2e-aks Analysis

Run Result Failure Type Blocking Failures
2052372745726988288 ❌ FAIL Hypershift e2e test failure TestUpgradeControlPlane/ValidateHostedCluster/EnsureOAPIMountsTrustBundle — test tried to exec into a Failed-phase HCP openshift-apiserver pod for 300s. Pod failed due to scheduling: TaintToleration (pod didn't tolerate node taints), not a container crash or OOM. Other apiserver replicas were Running. TestUpgradeControlPlane/ValidateHostedCluster — one transient TLS handshake timeout to guest API that later succeeded (connected in 11s).
  • Guest ClusterOperator/network: Available=True, Degraded=False, Progressing=False — CNO not involved
  • Failure is a stale Failed-phase pod from a rollout + test picking that pod for exec instead of a Running one

PRPQR hypershift e2e-aws-ovn Analysis

Run Result Failure Type Blocking Failures
2052372746683289600 ❌ FAIL Hypershift e2e test failure (teardown) TestCreateClusterPrivateWithRouteKAS/Teardown — fixture's wait for AWS resources to disappear hit context deadline with 3 resources still listed: 2 CAPA node EBS volumes + 1 NLB (openshift-ingress/router-default). All functional subtests passed. destroy.log shows the NLB was successfully deleted shortly after infra destroy started — the test fixture's polling deadline expired before async AWS cleanup completed.
  • Stuck NLB is the Ingress Operator's router-default Service LB, not CNO
  • No CNO/OVN failures in the test — CNO not involved
  • Failure is AWS async resource cleanup latency vs test fixture timeout

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 9, 2026

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 9a31008 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 10, 2026

Nightly PRPQR e2e-gcp-ovn-rt-upgrade Analysis (10 runs, 0 passed / 10 failed)

Shard Run ID Failure Type Real Blocking Failures Observed Cause
1 2052372781747671040 e2e blocking (20) mass-test + 9× oc adm must-gather + collect certificate data. Monitor: OSUpdateStaged 20 blocking failures triggered mass-test-failure threshold (>10). must-gather and cert collection failures dominated.
2 2052372784276836352 e2e blocking (19) mass-test + 8× must-gather + collect certificate data + [sig-storage] CSI Mock volume storage capacity Same must-gather/cert pattern + one CSI storage test.
3 2052372785107308544 e2e blocking (20) mass-test + 9× must-gather + collect certificate data. Monitor: OSUpdateStaged Same pattern as shard 1.
4 2052372786789224448 e2e blocking (20) mass-test + 9× must-gather + collect certificate data Same pattern.
5 2052372785954557952 e2e blocking (2) collect certificate data + 2× TLS artifacts registration. Monitor: OSUpdateStaged Only cert-related failures; mass-test check passed on this shard.
6 2052372778409005056 e2e blocking (20) mass-test + 9× must-gather + collect certificate data. Monitor: OSUpdateStaged Same pattern as shard 1.
7 2052372782586531840 Install failure Installer exit code 6 — no e2e tests ran Workers never provisioned (machine-api: 0/2 running replicas). Without workers: ingress router pods couldn't schedule (untolerated taints on 3 control-plane nodes), cascading to console, monitoring, image-registry. CNO was Progressing (waiting on other operators), not Degraded. GCP machine provisioning failure.
8 2052372787623890944 e2e blocking (2) CCO metrics test + 3× kube-apiserver TLS/certificate. Monitor: OSUpdateStaged CCO metrics + cert collection failures.
9 2052372780908810240 e2e blocking (26) + MonitorTest mass-test + [sig-builds] failures. Monitor: [sig-network-edge] disruption/service-load-balancer-with-pdb 26 blocking failures (mostly sig-builds) + service LB disruption monitor. Process timed out / exit 127.
10 2052372783429586944 e2e blocking (20) mass-test + 9× must-gather + collect certificate data Same pattern as shard 1.
  • 0/10 passed — this job is broadly broken. 7/10 shards hit the same must-gather + cert-data pattern suggesting a systemic GCP RT environment issue
  • CNO/OVN not observed as the cause of any failure
  • Periodic baseline data is from March 2026 (stale) — cannot confirm whether this is pre-existing in May 2026

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 10, 2026

Nightly PRPQR e2e-azure-ovn-upgrade Analysis (10 runs, 5 passed / 5 failed)

Aggregator timed out at 7h waiting for shard 6 (deprovision timeout).

Shard Run ID Failure Type Real Blocking Failures Observed Cause
1 2052372770850869248 e2e blocking (2) [sig-api-machinery] AdmissionWebhook should mutate custom resource with pruning [Conformance], [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe Two conformance test failures — AdmissionWebhook mutation and kubelet liveness probe handling. Not networking-related.
5 2052372774202118144 e2e blocking (1) [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe Same liveness probe conformance test as shard 1. Shared failure across 2 shards suggests a flaky conformance test.
6 2052372775049367552 Infra — deprovision timeout All e2e tests passed (2026 pass, 1 flaky). ipi-deprovision-deprovision hit 1h Azure destroy timeout. Tests were green — failure is purely infra cleanup. Caused the aggregator to time out at 7h waiting for this shard.
7 2052372775879839744 e2e blocking (1) [sig-api-machinery] FieldValidation should detect unknown metadata fields in both the root and embedded object of a CR [Conformance] Conformance test for CRD field validation. Not networking-related.
8 2052372776722894848 Upgrade MonitorTest [Monitor:audit-log-analyzer][sig-api-machinery] API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients Audit log analysis detected API load balancers sending requests to kube-apiserver after it reported not-ready during upgrade shutdown. sig-api-machinery issue, not CNO.
  • CNO/OVN not observed in any real failure chain
  • Most common shared failure: sig-node liveness probe test (shards 1, 5)

Nightly PRPQR e2e-aws-ovn-techpreview Analysis

Run Failure Type Real Blocking Failures Observed Cause
2052372803566440448 MonitorTest [Monitor:legacy-test-framework-invariants-pathological] events should not repeat pathologically 21× Back-off pulling image events for OLM webhook-operator pod (quay.io/openshift/community-e2e-images:...webhook-operator...). Image pull backoff repeated enough to cross the pathological events threshold. OLM image pull issue, not CNO.
  • CNO/OVN not in the failure chain

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 10, 2026

/verified by @tsorya

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tsorya: This PR has been marked as verified by @tsorya.

Details

In response to this:

/verified by @tsorya

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tsorya
Copy link
Copy Markdown
Contributor

tsorya commented May 10, 2026

@danwinship can you please take a look?

@danwinship
Copy link
Copy Markdown
Contributor

clean backport with no changes
/lgtm
/label backport-risk-assessed

@openshift-ci openshift-ci Bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 11, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, openshift-cherrypick-robot

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.