Skip to content

fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54

Open
itailev wants to merge 1 commit into
NVIDIA:public-release-v26.4from
itailev:itailev/bind-mlx5-before-vfs
Open

fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54
itailev wants to merge 1 commit into
NVIDIA:public-release-v26.4from
itailev:itailev/bind-mlx5-before-vfs

Conversation

@itailev
Copy link
Copy Markdown

@itailev itailev commented May 28, 2026

On first dpu boot (OCP flow) the host kernel can probe mlx5_core while the DPU FW is still in pre-init, triggering a 120s wait_fw_init timeout that leaves no driver bound to the PFs. After that, processNetworkRequest fails forever in SetNumOfVFs because writes to sriov_numvfs return an error when no driver is bound (kernel logs "no driver bound to device; cannot configure SR-IOV" every reconcile tick).

Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the processNetworkRequest operations slice, plus IsDriverBound and BindDriver helpers on PCIHelper. The reconcile loop's existing 30s retry handles the case where FW is not yet ready when the bind is attempted.

The fix is in the shared reconcile path and covers both vanilla DPF (driven by the DPUHostNetworkConfiguration phase handler) and OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs). No protocol change.

On first boot the host kernel can probe mlx5_core while the DPU FW is
still in pre-init, triggering a 120s wait_fw_init timeout that leaves
no driver bound to the PFs. After that, processNetworkRequest fails
forever in SetNumOfVFs because writes to sriov_numvfs return an error
when no driver is bound (kernel logs "no driver bound to device;
cannot configure SR-IOV" every reconcile tick).

Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the
processNetworkRequest operations slice, plus IsDriverBound and
BindDriver helpers on PCIHelper. The reconcile loop's existing 30s
retry handles the case where FW is not yet ready when the bind is
attempted.

The fix is in the shared reconcile path and covers both vanilla DPF
(driven by the DPUHostNetworkConfiguration phase handler) and
OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs).
No protocol change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@itailev
Copy link
Copy Markdown
Author

itailev commented May 28, 2026

For visibility — companion safety-net fix on the RH side:
rh-ecosystem-edge/dpf-hcp-provisioner-operator#135

That PR adds Restart=on-failure to setup-vfs-devlink.service so the
DPU side recovers automatically even if anything (this PR included)
ever fails to bind the host driver in time. Defense in depth.

openshift-merge-bot Bot pushed a commit to rh-ecosystem-edge/dpf-hcp-provisioner-operator that referenced this pull request May 29, 2026
…m transient host issues

The service is Type=oneshot and the script (setup-vfs-devlink.sh
create-vfs) has an internal 600s timeout. If VF setup fails — for
example because the host's mlx5_core driver became unbound after a
wait_fw_init timeout during host boot — the script exits with status 1
and the service stays in failed state forever. systemd does not retry
oneshot services without an explicit Restart= directive.

This is a hard fail for DPU provisioning: setup-vfs-devlink.service
gates machine-config-daemon-pull.service via the 10-require-setup-vfs
drop-in, which gates firstboot, which installs CRI-O, which gates
kubelet — so a single failure of this oneshot prevents the DPU from
ever joining the hosted cluster, even after the underlying host issue
is fixed.

Observed: a DPU was stuck in "DPU Config" phase for 13+ hours. Once
the host driver was rebound manually, VFs came up on the host
immediately, but setup-vfs-devlink.service had given up at the 600s
mark and never tried again. Restarting it manually unblocked the
entire MCO/kubelet chain.

With Restart=on-failure RestartSec=30, systemd retries every 30s.
The host's /configure-host-vfs handler is idempotent (short-circuits
via nm.reqs[dpu.UID] when a NetworkRequest already exists), so the
retry is safe.

Companion fix in NVIDIA/doca-platform#54 makes the host-agent rebind
mlx5_core before configuring VFs, so the underlying failure mode
should not recur. This change is the safety net.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant