Skip to content

fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119

Open
nirandaperera wants to merge 3 commits intoNVIDIA:mainfrom
nirandaperera:topology-gpu-numa-node-fix
Open

fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119
nirandaperera wants to merge 3 commits intoNVIDIA:mainfrom
nirandaperera:topology-gpu-numa-node-fix

Conversation

@nirandaperera
Copy link
Copy Markdown
Contributor

@nirandaperera nirandaperera commented May 6, 2026

GPU numa_node was reported as -1 on hosts whose ACPI firmware (SRAT/SLIT tables) does not publish PCIe-to-NUMA affinity data. The kernel writes -1 to /sys/bus/pci/devices/<pci>/numa_node in that case, and topology_discovery::discover() propagated that straight into gpu_topology_info::numa_node — even though the NVIDIA driver knows the correct NUMA node via its own PCI-bridge traversal (the same source nvidia-smi topo -m uses).

This change adds an NVML-driven fallback. When sysfs returns -1, we now query nvmlDeviceGetMemoryAffinity(..., NVML_AFFINITY_SCOPE_NODE) and pick the lowest-numbered NUMA node from the returned bitmask.

Implementation notes

  • nvmlDeviceGetMemoryAffinity is added to the required symbol set in NvmlLoader
  • nodeSetSize = 1 (one unsigned long = 64 NUMA bits) — covers any realistic system; CONFIG_NODES_SHIFT defaults to 6 on x86_64 and the fallback only runs in the rare ACPI-quirk case anyway.
  • nvmlDeviceGetNumaNodeId was considered but rejected: it returns the NUMA node of the GPU itself, which is only meaningful on Grace/GB200-style coherent platforms where the GPU is its own NUMA node.

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Comment thread src/memory/topology_discovery.cpp Outdated
Comment thread src/memory/topology_discovery.cpp Outdated
Comment on lines +872 to +884
gpu.numa_node = get_numa_node_from_sys(gpu.pci_bus_id);
// Fallback: if the kernel reports -1 (typical when ACPI SRAT lacks PCIe
// affinity data), ask NVML directly. NVML walks the GPU driver's PCI bridge
// topology rather than relying on firmware tables, so it usually has the
// correct answer (same source as `nvidia-smi topo -m`).
if (gpu.numa_node == -1) {
unsigned long nodeset = 0;
if (nvml.p_nvmlDeviceGetMemoryAffinity(device, 1, &nodeset, NVML_AFFINITY_SCOPE_NODE) ==
NVML_SUCCESS &&
nodeset != 0) {
gpu.numa_node = std::countr_zero(nodeset);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually considering whether we need get_numa_node_from_sys anyway, do we use it elsewhere? If not, we should probably just remove it completely and rely directly on nvmlDeviceGetMemoryAffinity for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was thinking the same TBH.

Comment thread test/memory/test_topology_discovery.cpp Outdated
Comment on lines +201 to +202
if (topology.num_gpus == 0 || topology.num_numa_nodes <= 0) {
SUCCEED("Skipped: requires at least one GPU and a NUMA-aware host");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a non-NUMA-aware host even exists, at least in the context that we care about and test? I think not and we should remove the second condition.

Comment thread test/memory/test_topology_discovery.cpp
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera requested a review from pentschev May 6, 2026 22:37
Copy link
Copy Markdown
Contributor

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of request to tidy up docs/comments, after that we should be good.

Comment on lines +185 to +193
// Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI
// SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation
// read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case;
// topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which
// walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks.
//
// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered
// GPU must resolve to a valid NUMA node. This test would catch a regression that
// re-introduced the -1 leak.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI
// SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation
// read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case;
// topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which
// walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks.
//
// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered
// GPU must resolve to a valid NUMA node. This test would catch a regression that
// re-introduced the -1 leak.
// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered
// GPU must resolve to a valid NUMA node.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about details of the previous implementation. Only the expected behavior, which is what being tested, matters.

Comment on lines +337 to +341
* Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and
* returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the
* GPU driver's PCI bridge topology directly, so this is unaffected by ACPI
* SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to
* report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and
* returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the
* GPU driver's PCI bridge topology directly, so this is unaffected by ACPI
* SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to
* report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`.
* Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and
* returns the lowest-numbered NUMA node in the resulting bitmask. Same source as
* `nvidia-smi topo -m`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we don't care about the previous implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants