fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54
Open
itailev wants to merge 1 commit into
Open
fix(hostagent): bind mlx5_core to DPU PFs before configuring VFs#54itailev wants to merge 1 commit into
itailev wants to merge 1 commit into
Conversation
On first boot the host kernel can probe mlx5_core while the DPU FW is still in pre-init, triggering a 120s wait_fw_init timeout that leaves no driver bound to the PFs. After that, processNetworkRequest fails forever in SetNumOfVFs because writes to sriov_numvfs return an error when no driver is bound (kernel logs "no driver bound to device; cannot configure SR-IOV" every reconcile tick). Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the processNetworkRequest operations slice, plus IsDriverBound and BindDriver helpers on PCIHelper. The reconcile loop's existing 30s retry handles the case where FW is not yet ready when the bind is attempted. The fix is in the shared reconcile path and covers both vanilla DPF (driven by the DPUHostNetworkConfiguration phase handler) and OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs). No protocol change. Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
For visibility — companion safety-net fix on the RH side: That PR adds Restart=on-failure to setup-vfs-devlink.service so the |
openshift-merge-bot Bot
pushed a commit
to rh-ecosystem-edge/dpf-hcp-provisioner-operator
that referenced
this pull request
May 29, 2026
…m transient host issues The service is Type=oneshot and the script (setup-vfs-devlink.sh create-vfs) has an internal 600s timeout. If VF setup fails — for example because the host's mlx5_core driver became unbound after a wait_fw_init timeout during host boot — the script exits with status 1 and the service stays in failed state forever. systemd does not retry oneshot services without an explicit Restart= directive. This is a hard fail for DPU provisioning: setup-vfs-devlink.service gates machine-config-daemon-pull.service via the 10-require-setup-vfs drop-in, which gates firstboot, which installs CRI-O, which gates kubelet — so a single failure of this oneshot prevents the DPU from ever joining the hosted cluster, even after the underlying host issue is fixed. Observed: a DPU was stuck in "DPU Config" phase for 13+ hours. Once the host driver was rebound manually, VFs came up on the host immediately, but setup-vfs-devlink.service had given up at the 600s mark and never tried again. Restarting it manually unblocked the entire MCO/kubelet chain. With Restart=on-failure RestartSec=30, systemd retries every 30s. The host's /configure-host-vfs handler is idempotent (short-circuits via nm.reqs[dpu.UID] when a NetworkRequest already exists), so the retry is safe. Companion fix in NVIDIA/doca-platform#54 makes the host-agent rebind mlx5_core before configuring VFs, so the underlying failure mode should not recur. This change is the safety net. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On first dpu boot (OCP flow) the host kernel can probe mlx5_core while the DPU FW is still in pre-init, triggering a 120s wait_fw_init timeout that leaves no driver bound to the PFs. After that, processNetworkRequest fails forever in SetNumOfVFs because writes to sriov_numvfs return an error when no driver is bound (kernel logs "no driver bound to device; cannot configure SR-IOV" every reconcile tick).
Add EnsureDriverBoundP0 / EnsureDriverBoundP1 ops at the head of the processNetworkRequest operations slice, plus IsDriverBound and BindDriver helpers on PCIHelper. The reconcile loop's existing 30s retry handles the case where FW is not yet ready when the bind is attempted.
The fix is in the shared reconcile path and covers both vanilla DPF (driven by the DPUHostNetworkConfiguration phase handler) and OCP/HCP-provisioner (driven by DPU-side HTTP POST /configure-host-vfs). No protocol change.