fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119
fix(memory/topology): Fallback to NVML for GPU NUMA node when sysfs reports -1#119nirandaperera wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
| gpu.numa_node = get_numa_node_from_sys(gpu.pci_bus_id); | ||
| // Fallback: if the kernel reports -1 (typical when ACPI SRAT lacks PCIe | ||
| // affinity data), ask NVML directly. NVML walks the GPU driver's PCI bridge | ||
| // topology rather than relying on firmware tables, so it usually has the | ||
| // correct answer (same source as `nvidia-smi topo -m`). | ||
| if (gpu.numa_node == -1) { | ||
| unsigned long nodeset = 0; | ||
| if (nvml.p_nvmlDeviceGetMemoryAffinity(device, 1, &nodeset, NVML_AFFINITY_SCOPE_NODE) == | ||
| NVML_SUCCESS && | ||
| nodeset != 0) { | ||
| gpu.numa_node = std::countr_zero(nodeset); | ||
| } | ||
| } |
There was a problem hiding this comment.
I'm actually considering whether we need get_numa_node_from_sys anyway, do we use it elsewhere? If not, we should probably just remove it completely and rely directly on nvmlDeviceGetMemoryAffinity for this.
There was a problem hiding this comment.
Sure, I was thinking the same TBH.
| if (topology.num_gpus == 0 || topology.num_numa_nodes <= 0) { | ||
| SUCCEED("Skipped: requires at least one GPU and a NUMA-aware host"); |
There was a problem hiding this comment.
Does a non-NUMA-aware host even exists, at least in the context that we care about and test? I think not and we should remove the second condition.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
pentschev
left a comment
There was a problem hiding this comment.
Couple of request to tidy up docs/comments, after that we should be good.
| // Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI | ||
| // SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation | ||
| // read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case; | ||
| // topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which | ||
| // walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks. | ||
| // | ||
| // Invariant: when the host advertises NUMA topology and GPUs are present, every discovered | ||
| // GPU must resolve to a valid NUMA node. This test would catch a regression that | ||
| // re-introduced the -1 leak. |
There was a problem hiding this comment.
| // Regression test for the bug where GPU NUMA node was reported as -1 on hosts whose ACPI | |
| // SRAT/SLIT tables do not publish PCIe-to-NUMA affinity data. The previous implementation | |
| // read /sys/bus/pci/devices/<pci>/numa_node, which returns -1 in that case; | |
| // topology_discovery now resolves the NUMA node via nvmlDeviceGetMemoryAffinity, which | |
| // walks the GPU driver's PCI bridge topology and is unaffected by firmware quirks. | |
| // | |
| // Invariant: when the host advertises NUMA topology and GPUs are present, every discovered | |
| // GPU must resolve to a valid NUMA node. This test would catch a regression that | |
| // re-introduced the -1 leak. | |
| // Invariant: when the host advertises NUMA topology and GPUs are present, every discovered | |
| // GPU must resolve to a valid NUMA node. |
There was a problem hiding this comment.
We don't care about details of the previous implementation. Only the expected behavior, which is what being tested, matters.
| * Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and | ||
| * returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the | ||
| * GPU driver's PCI bridge topology directly, so this is unaffected by ACPI | ||
| * SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to | ||
| * report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`. |
There was a problem hiding this comment.
| * Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and | |
| * returns the lowest-numbered NUMA node in the resulting bitmask. NVML walks the | |
| * GPU driver's PCI bridge topology directly, so this is unaffected by ACPI | |
| * SRAT/SLIT firmware quirks that cause `/sys/bus/pci/devices/<pci>/numa_node` to | |
| * report -1 on otherwise NUMA-aware hosts. Same source as `nvidia-smi topo -m`. | |
| * Queries NVML's `nvmlDeviceGetMemoryAffinity` with `NVML_AFFINITY_SCOPE_NODE` and | |
| * returns the lowest-numbered NUMA node in the resulting bitmask. Same source as | |
| * `nvidia-smi topo -m`. |
There was a problem hiding this comment.
Same here, we don't care about the previous implementations.
GPU
numa_nodewas reported as-1on hosts whose ACPI firmware (SRAT/SLIT tables) does not publish PCIe-to-NUMA affinity data. The kernel writes-1to/sys/bus/pci/devices/<pci>/numa_nodein that case, andtopology_discovery::discover()propagated that straight intogpu_topology_info::numa_node— even though the NVIDIA driver knows the correct NUMA node via its own PCI-bridge traversal (the same sourcenvidia-smi topo -muses).This change adds an NVML-driven fallback. When sysfs returns
-1, we now querynvmlDeviceGetMemoryAffinity(..., NVML_AFFINITY_SCOPE_NODE)and pick the lowest-numbered NUMA node from the returned bitmask.Implementation notes
nvmlDeviceGetMemoryAffinityis added to the required symbol set inNvmlLoadernodeSetSize = 1(oneunsigned long= 64 NUMA bits) — covers any realistic system;CONFIG_NODES_SHIFTdefaults to 6 on x86_64 and the fallback only runs in the rare ACPI-quirk case anyway.nvmlDeviceGetNumaNodeIdwas considered but rejected: it returns the NUMA node of the GPU itself, which is only meaningful on Grace/GB200-style coherent platforms where the GPU is its own NUMA node.