fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1 by nirandaperera · Pull Request #119 · NVIDIA/cuCascade

nirandaperera · 2026-05-06T20:16:05Z

GPU numa_node was reported as -1 on hosts whose ACPI firmware (SRAT/SLIT tables) does not publish PCIe-to-NUMA affinity data. The kernel writes -1 to /sys/bus/pci/devices/<pci>/numa_node in that case, and topology_discovery::discover() propagated that straight into gpu_topology_info::numa_node — even though the NVIDIA driver knows the correct NUMA node via its own PCI-bridge traversal (the same source nvidia-smi topo -m uses).

This change adds an NVML-driven fallback. When sysfs returns -1, we now query nvmlDeviceGetMemoryAffinity(..., NVML_AFFINITY_SCOPE_NODE) and pick the lowest-numbered NUMA node from the returned bitmask.

Implementation notes

nvmlDeviceGetMemoryAffinity is added to the required symbol set in NvmlLoader
nodeSetSize = 1 (one unsigned long = 64 NUMA bits) — covers any realistic system; CONFIG_NODES_SHIFT defaults to 6 on x86_64 and the fallback only runs in the rare ACPI-quirk case anyway.
nvmlDeviceGetNumaNodeId was considered but rejected: it returns the NUMA node of the GPU itself, which is only meaningful on Grace/GB200-style coherent platforms where the GPU is its own NUMA node.

Signed-off-by: niranda perera <niranda.perera@gmail.com>

pentschev · 2026-05-06T20:33:05Z

+      gpu.numa_node = get_numa_node_from_sys(gpu.pci_bus_id);
+      // Fallback: if the kernel reports -1 (typical when ACPI SRAT lacks PCIe
+      // affinity data), ask NVML directly. NVML walks the GPU driver's PCI bridge
+      // topology rather than relying on firmware tables, so it usually has the
+      // correct answer (same source as `nvidia-smi topo -m`).
+      if (gpu.numa_node == -1) {
+        unsigned long nodeset = 0;
+        if (nvml.p_nvmlDeviceGetMemoryAffinity(device, 1, &nodeset, NVML_AFFINITY_SCOPE_NODE) ==
+              NVML_SUCCESS &&
+            nodeset != 0) {
+          gpu.numa_node = std::countr_zero(nodeset);
+        }
+      }


I'm actually considering whether we need get_numa_node_from_sys anyway, do we use it elsewhere? If not, we should probably just remove it completely and rely directly on nvmlDeviceGetMemoryAffinity for this.

Sure, I was thinking the same TBH.

pentschev · 2026-05-06T20:34:25Z

+  if (topology.num_gpus == 0 || topology.num_numa_nodes <= 0) {
+    SUCCEED("Skipped: requires at least one GPU and a NUMA-aware host");


Does a non-NUMA-aware host even exists, at least in the context that we care about and test? I think not and we should remove the second condition.

Signed-off-by: niranda perera <niranda.perera@gmail.com>

pentschev

Couple of request to tidy up docs/comments, after that we should be good.

pentschev · 2026-05-07T08:46:33Z

+// Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI
+// SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation
+// read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case;
+// topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which
+// walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks.
+//
+// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered
+// GPU must resolve to a valid NUMA node. This test would catch a regression that
+// re-introduced the -1 leak.


Suggested change

// Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI

// SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation

// read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case;

// topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which

// walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks.

//

// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered

// GPU must resolve to a valid NUMA node. This test would catch a regression that

// re-introduced the -1 leak.

// Invariant: when the host advertises NUMA topology and GPUs are present, every discovered

// GPU must resolve to a valid NUMA node.

We don't care about details of the previous implementation. Only the expected behavior, which is what being tested, matters.

pentschev · 2026-05-07T08:48:27Z

+ * Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and
+ * returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the
+ * GPU driver's PCI bridge topology directly, so this is unaffected by ACPI
+ * SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to
+ * report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`.


Suggested change

* Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and

* returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the

* GPU driver's PCI bridge topology directly, so this is unaffected by ACPI

* SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to

* report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`.

* Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and

* returns the lowest-numbered NUMA node in the resulting bitmask. Same source as

* `nvidia-smi topo -m`.

Same here, we don't care about the previous implementations.

nirandaperera added 2 commits May 6, 2026 12:59

fix gpu numa node

f0b1b72

Signed-off-by: niranda perera <niranda.perera@gmail.com>

force loading nvmlDeviceGetMemoryAffinity

1096aa9

Signed-off-by: niranda perera <niranda.perera@gmail.com>

pentschev requested changes May 6, 2026

View reviewed changes

addressing PR comments

27021a6

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera requested a review from pentschev May 6, 2026 22:37

pentschev requested changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119

fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119
nirandaperera wants to merge 3 commits intoNVIDIA:mainfrom
nirandaperera:topology-gpu-numa-node-fix

nirandaperera commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

pentschev May 6, 2026

Uh oh!

nirandaperera May 6, 2026

Uh oh!

pentschev May 6, 2026

Uh oh!

Uh oh!

pentschev left a comment

Uh oh!

pentschev May 7, 2026

Uh oh!

pentschev May 7, 2026

Uh oh!

pentschev May 7, 2026

Uh oh!

pentschev May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (topology.num_gpus == 0 \|\| topology.num_numa_nodes <= 0) {
		SUCCEED("Skipped: requires at least one GPU and a NUMA-aware host");

Conversation

nirandaperera commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

Uh oh!

Uh oh!

pentschev May 6, 2026

Choose a reason for hiding this comment

Uh oh!

nirandaperera May 6, 2026

Choose a reason for hiding this comment

Uh oh!

pentschev May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

pentschev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pentschev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pentschev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

pentschev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nirandaperera commented May 6, 2026 •

edited

Loading