Skip to content

runsc: serialize RDMA and PCI sysfs data at boot#13114

Draft
atoniolo76 wants to merge 1 commit into
google:masterfrom
modal-labs:alessio/serialize-rdma-sysfs-boot
Draft

runsc: serialize RDMA and PCI sysfs data at boot#13114
atoniolo76 wants to merge 1 commit into
google:masterfrom
modal-labs:alessio/serialize-rdma-sysfs-boot

Conversation

@atoniolo76
Copy link
Copy Markdown

Snapshot host RDMA/InfiniBand and PCI device sysfs attributes before pivot_root (while the host sysfs is still accessible), serialize them as JSON, and reconstruct them as virtual kernfs entries inside the sentry. This gives NCCL and libibverbs the topology information they need for device discovery and PCI distance computation without granting the sandbox access to the real host sysfs.

Key components:

  • pkg/sentry/fsimpl/sys/rdma.go: RDMA device data collection and virtual /sys/class/infiniband{,_verbs}/ + /sys/class/net/ entries
  • pkg/sentry/fsimpl/sys/pci_devices.go: PCI topology collection and virtual /sys/devices/pci*/ hierarchy with symlinks for NCCL
  • runsc/cmd/chroot.go: collection before pivot_root
  • runsc/cmd/boot.go: --rdmaproxy flag and data deserialization
  • runsc/boot/vfs.go: wire sysfs InternalData for RDMA/PCI
  • runsc/config: --rdmaproxy configuration flag

@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 7, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@atoniolo76 atoniolo76 force-pushed the alessio/serialize-rdma-sysfs-boot branch 5 times, most recently from dad6dcd to 44bb161 Compare May 8, 2026 21:59
@atoniolo76 atoniolo76 force-pushed the alessio/serialize-rdma-sysfs-boot branch from 44bb161 to 97a0121 Compare May 12, 2026 16:07
@atoniolo76 atoniolo76 force-pushed the alessio/serialize-rdma-sysfs-boot branch from 97a0121 to a778e60 Compare May 12, 2026 16:16
@trantoji trantoji self-requested a review May 19, 2026 22:09
Copy link
Copy Markdown
Contributor

@trantoji trantoji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @atoniolo76,

This PR is serializing and virtualizing three things: PCIe topology, NUMA layout, and other RDMA info. Are you open to splitting this PR into three?

What are your thoughts on the following structure:

  1. Infra to serialize and virtualize the minimal set of sysfs nodes and gate it (via flag) to get ib_write_cuda_bw to work. (i'm assuming this tool is dumb and does not need NUMA and PCIe topology information, we can simply target a net interface and CUDA device).
  2. Serialize and virtualize the PCIe topology and NUMA layout and keep the code vendor agnostic. Applications need the PCIe topology (bridges, NICs, and accelerator locations) so that threads can initiate data flows that take the fast path (PCIe P2P). They need the NUMA layout to optimize execution, i.e pin their control threads to the local CPU socket to avoid copies traversing the CPU socket interconnect.
  3. Infra to add vendor specific quirks to the virtualized sysfs to get NCCL and CX device working.

Splitting it this way separates the generic hardware platform from the vendor-specific middleware heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants