fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs by itailev · Pull Request #54 · NVIDIA/doca-platform

itailev · 2026-05-28T09:32:00Z

On first dpu boot (OCP flow) the host kernel can probe mlx5_core while the DPU FW is still in pre-init, triggering a 120s wait_fw_init timeout that leaves no driver bound to the PFs. After that, processNetworkRequest fails forever in SetNumOfVFs because writes to sriov_numvfs return an error when no driver is bound (kernel logs "no driver bound to device; cannot configure SR-IOV" every reconcile tick).

Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the processNetworkRequest operations slice, plus IsDriverBound and BindDriver helpers on PCIHelper. The reconcile loop's existing 30s retry handles the case where FW is not yet ready when the bind is attempted.

The fix is in the shared reconcile path and covers both vanilla DPF (driven by the DPUHostNetworkConfiguration phase handler) and OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs). No protocol change.

On first boot the host kernel can probe mlx5_core while the DPU FW is still in pre-init, triggering a 120s wait_fw_init timeout that leaves no driver bound to the PFs. After that, processNetworkRequest fails forever in SetNumOfVFs because writes to sriov_numvfs return an error when no driver is bound (kernel logs "no driver bound to device; cannot configure SR-IOV" every reconcile tick). Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the processNetworkRequest operations slice, plus IsDriverBound and BindDriver helpers on PCIHelper. The reconcile loop's existing 30s retry handles the case where FW is not yet ready when the bind is attempted. The fix is in the shared reconcile path and covers both vanilla DPF (driven by the DPUHostNetworkConfiguration phase handler) and OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs). No protocol change. Co-authored-by: Cursor <cursoragent@cursor.com>

itailev · 2026-05-28T10:56:42Z

For visibility — companion safety-net fix on the RH side:
rh-ecosystem-edge/dpf-hcp-provisioner-operator#135

That PR adds Restart=on-failure to setup-vfs-devlink.service so the
DPU side recovers automatically even if anything (this PR included)
ever fails to bind the host driver in time. Defense in depth.

…m transient host issues The service is Type=oneshot and the script (setup-vfs-devlink.sh create-vfs) has an internal 600s timeout. If VF setup fails — for example because the host's mlx5_core driver became unbound after a wait_fw_init timeout during host boot — the script exits with status 1 and the service stays in failed state forever. systemd does not retry oneshot services without an explicit Restart= directive. This is a hard fail for DPU provisioning: setup-vfs-devlink.service gates machine-config-daemon-pull.service via the 10-require-setup-vfs drop-in, which gates firstboot, which installs CRI-O, which gates kubelet — so a single failure of this oneshot prevents the DPU from ever joining the hosted cluster, even after the underlying host issue is fixed. Observed: a DPU was stuck in "DPU Config" phase for 13+ hours. Once the host driver was rebound manually, VFs came up on the host immediately, but setup-vfs-devlink.service had given up at the 600s mark and never tried again. Restarting it manually unblocked the entire MCO/kubelet chain. With Restart=on-failure RestartSec=30, systemd retries every 30s. The host's /configure-host-vfs handler is idempotent (short-circuits via nm.reqs[dpu.UID] when a NetworkRequest already exists), so the retry is safe. Companion fix in NVIDIA/doca-platform#54 makes the host-agent rebind mlx5_core before configuring VFs, so the underlying failure mode should not recur. This change is the safety net. Co-authored-by: Cursor <cursoragent@cursor.com>

itailev mentioned this pull request May 28, 2026

Restart setup-vfs-devlink.service on failure to recover transient host driver issues rh-ecosystem-edge/dpf-hcp-provisioner-operator#135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54

fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54
itailev wants to merge 1 commit into
NVIDIA:public-release-v26.4from
itailev:itailev/bind-mlx5-before-vfs

itailev commented May 28, 2026

Uh oh!

itailev commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itailev commented May 28, 2026

Uh oh!

itailev commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant