vNUMA: implement CPUID patch for NPOT domU vCPU counts, implement PVH dom0 vNUMA#12
Draft
tycho wants to merge 13 commits into
Draft
vNUMA: implement CPUID patch for NPOT domU vCPU counts, implement PVH dom0 vNUMA#12tycho wants to merge 13 commits into
tycho wants to merge 13 commits into
Conversation
ea043ff to
3357a12
Compare
3357a12 to
78a599b
Compare
Expose the per-vCPU x2APIC identifiers computed by Xen via a new domctl so toolstacks can populate ACPI MADT processor entries with the same APIC IDs Xen reports via CPUID 0xB and the vlapic state. This matters when vNUMA encoding produces non-trivial APIC IDs -- i.e. multi-vnode layouts with non-power-of-two or unbalanced per-vnode vCPU counts, where guest_vcpu_x2apic_id() returns (vnode_index << pkg_shift) | (intra_pkg_offset * 2) rather than the legacy vcpu_id * 2. For POT-balanced and single-vnode layouts the returned values are bit-identical to the legacy encoding, so callers can transparently use this domctl for all layouts without special-casing. Toolstacks that hardcode vcpu_id * 2 in MADT (libxl's acpi_lapic_id() in tools/libs/light/libxl_x86_acpi.c, and the equivalent in Protect's PVH ACPI generator) produce APIC IDs that disagree with Xen's vlapic state for NPOT vNUMA layouts -- INIT/SIPI fails to reach the intended vCPU because Xen's APIC delivery does not find a vCPU matching the MADT-advertised ID, and secondary CPU bringup hangs. This domctl is the fix: toolstacks call it after XEN_DOMCTL_setvnumainfo and use the returned values to populate MADT entries. Buffer sizing follows the XEN_DOMCTL_get_vcpu_msrs convention: a NULL handle (or nr_vcpus == 0) is a capacity query; an insufficient buffer returns -ENOBUFS with nr_filled set to the required size. Older Xen builds return -ENOSYS, which callers may use as a capability probe. Adding a new subop is additive and does not change any existing struct layout, so XEN_DOMCTL_INTERFACE_VERSION is not bumped. Callers identify support via the -ENOSYS pattern rather than the version number, avoiding gratuitous compatibility breakage for clients (xl, libxl, libxc) built against earlier Xen headers. FLASK uses DOMAIN__GETVCPUCONTEXT for permission, matching the read-only "get vcpu state" pattern of get_vcpu_msrs and getvcpucontext. Signed-off-by: Steven Noonan <steven@edera.dev>
recalculate_vnuma_topo() previously bailed out silently for any vNUMA
layout where d->max_vcpus was not divisible by nr_vnodes or where the
resulting vcpus_per_node was not a power of two. The guest then saw
default (zeroed) topology leaves while CPUID 0x80000008 ECX (AMD) and
leaf 4 LLC (Intel) leaked through the host's package size. Common
toolstack-produced layouts -- e.g. 48 vCPUs split 24/24, or 49 split
25/24 -- triggered this fallback.
Add XEN_DOMCTL_CDF_vnuma_apic_topology, a domain-creation opt-in
that enables a vNUMA-derived APIC ID encoding capable of expressing
NPOT and unbalanced layouts. When set:
- recalculate_vnuma_topo() walks vcpu_to_vnode[] to determine
max_per_node (the largest vnode's vCPU count), no longer
requiring uniform distribution.
- pkg_shift = fls(2 * max_per_node - 1) reserves a power-of-2 APIC
ID window per package big enough for the largest vnode; smaller
vnodes leave the tail of their window unused.
- x2APIC IDs become (vnode_index << pkg_shift) | (intra_pkg_offset
* 2) rather than the legacy vcpu_id * 2, so the package boundary
falls on a clean bit even when consecutive vCPU IDs would
otherwise span packages.
- Advertised counts (leaf 0xB EBX, leaf 4 LLC, AMD 0x80000008 NC)
use max_per_node; unbalanced packages just look underfilled.
When the flag is unset (default, matches upstream behavior):
- recalculate_vnuma_topo() restores the legacy gates and silently
falls back to default CPUID topology for any NPOT or unbalanced
layout.
- guest_vcpu_x2apic_id() returns vcpu_id * 2 unconditionally.
- arch_domain_update_vnuma() skips the vlapic APIC ID refresh.
- XEN_DOMCTL_get_vcpu_apicids returns vcpu_id * 2 for each vCPU --
still callable, just not interesting.
The opt-in is mandatory because the new encoding can produce APIC
IDs that disagree with the `vcpu_id * 2` formula hardcoded in
existing toolstacks (libxl's acpi_lapic_id() in
tools/libs/light/libxl_x86_acpi.c) for NPOT layouts. Setting the
flag without sourcing MADT APIC IDs from XEN_DOMCTL_get_vcpu_apicids
produces a MADT inconsistent with Xen's vlapic state, hanging
secondary CPU bringup -- a deliberate contract: setting the flag
asserts the toolstack reads APIC IDs back from Xen.
The flag is rejected for non-HVM (PV) at createdomain time. PV does
not have CPUID 0xB or vlapic emulation, so the opt-in is meaningless
there.
For POT-balanced layouts the new encoding is bit-identical to
vcpu_id * 2 even when the flag is set, so opted-in toolstacks see a
change only for the layouts the legacy code couldn't represent at
all.
Because the APIC ID encoding is now derived from vNUMA (when opted
in), every guest-visible APIC ID interface must agree. Add
guest_vcpu_x2apic_id() as a single source of truth and route the
existing call sites through it (cpuid.c leaves 0x1 and 0xB, vlapic
set_x2apic_id() and vlapic_init()). Add vlapic_reinit_apic_id() and
call it from arch_domain_update_vnuma() (gated on the flag) so the
vlapic register state is refreshed once the policy is patched.
Live migration: opted-in domains can only be created on Xen builds
that know the flag, and migration to a build that does not know it
fails cleanly at createdomain (unknown CDF bit returns -EINVAL).
Non-opted-in domains use vcpu_id * 2 throughout and migrate as
today. Both cases avoid the silent topology mismatch the earlier
unconditional encoding would have produced cross-version.
Leaf 0x1F remains unsupported and is still deferred to the broader
topology rework described in the file-level TODOs.
Signed-off-by: Steven Noonan <steven@edera.dev>
Add a detection gate for the dom0 vNUMA topology work that follows.
Enables the gate when all of:
- dom0 is PVH (the only mode where we can expose SRAT/SLIT/MADT
we generate ourselves; PV dom0 sees host ACPI tables filtered
by pvh_acpi_xsdt_table_allowed())
- dom0 vCPUs are hard-pinned 1:1 to pCPUs (dom0_vcpus_pin=1)
- dom0's vCPU count equals num_present_cpus(), so every pCPU has a
corresponding dom0 vCPU
- The host has more than one NUMA node
Under these conditions the vNUMA layout follows directly from the
host's cpu_to_node() map -- no layout decisions to make. Relaxing
the constraints (partial pCPU coverage, unpinned dom0, PV dom0 via
host-table passthrough) is tracked separately.
When the gate passes, set XEN_DOMCTL_CDF_vnuma_apic_topology in
dom0_cfg so the existing per-domain opt-in mechanism applies to
dom0 too. No behavior change yet: d->vnuma is empty for dom0, so
the gated code paths in recalculate_vnuma_topo(),
guest_vcpu_x2apic_id() and arch_domain_update_vnuma() still
short-circuit. Subsequent commits populate the vNUMA layout, emit
SRAT/SLIT, fix MADT APIC IDs, and bind dom0 memory per-node.
Signed-off-by: Steven Noonan <steven@edera.dev>
Build a vNUMA layout for dom0 derived from the host's physical NUMA topology, and install it via the existing per-domain vnuma_info infrastructure. Runs only when the detection gate from the previous commit caused XEN_DOMCTL_CDF_vnuma_apic_topology to be set on dom0; otherwise the helper short-circuits. Under the first-pass constraint (PVH dom0, hard 1:1 vCPU pinning, dom0 vCPU count == pCPU count), every dom0 vCPU N is pinned to pCPU N, so cpu_to_node(N) gives the host node hosting that vCPU. The set of vnodes is exactly the set of physical nodes the host has; the vnode-to-pnode mapping is the identity over that set; the distance matrix is sliced directly from the host SLIT via __node_distance(). No layout decisions to make. vmemrange entries are emitted one per (E820_RAM region, vnode) pair, splitting each RAM region equally across vnodes. With the constraint above each vnode has the same proportional vCPU share, so equal memory share is the natural default. Installed under d->vnuma_rwlock and followed by arch_domain_update_vnuma() to recalculate the CPUID topology policy. init_dom0_cpuid_policy() ran earlier (setup.c) with empty d->vnuma; this recalc overwrites those values with the vNUMA-aware ones. Call site is between pvh_init_p2m() (which builds d->arch.e820) and pvh_populate_p2m() (which will use the layout in a later phase to drive per-node memory allocation). Also before pvh_setup_acpi() so MADT/SRAT/SLIT generation in subsequent phases can read from d->vnuma. vnuma_alloc() is exported (was static in common/domctl.c) so the dom0 builder can construct vnuma_info directly without going through the domctl path. arch_domain_update_vnuma() gets a real header declaration (was previously only __weak-defined with no header). By itself this commit makes CPUID 0xB topology correct for dom0 (and the AMD 0x80000008 / Intel leaf 4 LLC patches that recalculate_vnuma_topo() applies) but does not yet emit SRAT/SLIT or bind dom0 memory to the right physical nodes. Signed-off-by: Steven Noonan <steven@edera.dev>
pvh_setup_acpi_madt() hardcoded each x2APIC processor entry's local_apic_id as `i * 2`, the legacy encoding that predates vNUMA APIC ID rewriting. Once dom0 has a multi-vnode topology installed (previous commit) and is opted into the vNUMA-derived APIC ID encoding via XEN_DOMCTL_CDF_vnuma_apic_topology, the values from guest_vcpu_x2apic_id() can diverge from `i * 2` -- and the MADT must agree with what Xen's vlapic emulation and CPUID 0xB return, or secondary CPU bringup hits the same MADT-vs-vlapic inconsistency that the corresponding fix for guest MADT generation already addressed in the toolstack. Replace the hardcoded formula with guest_vcpu_x2apic_id(d, i). When dom0 has no vNUMA topology installed (or the CDF flag is unset), the helper returns `vcpu_id * 2`, preserving the previous behavior bit-for-bit. For the first-pass dom0 vNUMA constraint (PVH dom0, 1:1 vCPU pinning, dom0 vCPU count == pCPU count) on hosts with power-of-two per-node pCPU counts -- e.g. typical EPYC/Intel NPS configurations -- the new encoding is also bit-identical to `i * 2`. The change becomes observable when the host has non-power-of-two per-node counts (uncommon, but supported by the guest_vcpu_x2apic_id encoding). Signed-off-by: Steven Noonan <steven@edera.dev>
When dom0 has a vNUMA topology installed (previous commit), generate an ACPI System Resource Affinity Table that describes it: one Processor Local x2APIC Affinity entry per vCPU (proximity_domain = vcpu_to_vnode[i], apic_id from guest_vcpu_x2apic_id()) and one Memory Affinity entry per vmemrange (proximity_domain = nid, base/length from the range). When dom0 has no vNUMA (single-node host, detection gate unmet, etc.) pvh_setup_acpi_srat() returns with *addr = 0 and the SRAT is omitted from the XSDT. Extend pvh_setup_acpi_xsdt() to take an additional optional SRAT table address and include it in the table list when non-zero. Size accounting accommodates the extra slot when SRAT is present. The native host SRAT (if any) is intentionally filtered out -- pvh_acpi_table_allowed() does not include ACPI_SIG_SRAT in its allowlist, so dom0 only sees our generated table. This is correct: the host SRAT describes the host's NUMA topology, but dom0's vNUMA layout (vnode indexing, memory ranges in dom0's GPA space) is a different namespace that the host SRAT cannot represent. By itself this commit gives dom0 a SRAT but no SLIT, so Linux will use default distance values (10 local / 20 remote regardless of actual host topology). The matching SLIT generation comes in the next commit. Signed-off-by: Steven Noonan <steven@edera.dev>
When dom0 has a vNUMA topology installed, generate an ACPI System Locality Distance Information Table from d->vnuma->vdistance. The matrix was sliced from the host's __node_distance() in an earlier commit, so the values exposed to dom0 reflect actual host inter-node latency characteristics rather than the default 10/20 fallback Linux substitutes when SLIT is absent. When dom0 has no vNUMA (single-node host, detection gate unmet, etc.) pvh_setup_acpi_slit() returns *addr = 0 and the SLIT is omitted from the XSDT. SLIT entries are u8; vdistance is unsigned int. Clamp at 254 because 255 is the SLIT "reserved" sentinel. In practice host SLIT distances are always well within u8 range (typical values are 10-32), so the clamp is defensive rather than expected to trigger. Extend pvh_setup_acpi_xsdt() to take an additional optional SLIT table address, matching the pattern just added for SRAT. After this commit dom0 will see both SRAT (per-node CPU and memory affinity) and SLIT (inter-node distance matrix). Linux's NUMA scheduler can now make distance-aware decisions. Memory placement on the physical nodes still needs memory allocator integration -- until then, the SRAT describes a topology that's only partially honored by the underlying allocator. Signed-off-by: Steven Noonan <steven@edera.dev>
After dom0_setup_vnuma() installs a vNUMA topology and pvh_setup_acpi_srat() publishes it to dom0, the actual page allocations still went through the generic dom0_memflags path, which had no awareness of which vmemrange each GPA belonged to. Result: SRAT advertised one layout, real memory landed wherever the heap allocator felt like. Per-allocation, look up the vmemrange covering the target GPA, translate its virtual node id through vnode_to_pnode[], and OR MEMF_node(pnode) into the allocation flags. Combined with the MEMF_exact_node already in dom0_memflags this is a strict bind. If a strict per-node allocation fails at order 0, fall back by disabling node binding for the rest of dom0 construction (with a warning) before falling further back to dropping dom0_memflags entirely. The SRAT will then diverge from physical placement, which is a real degradation, but booting dom0 at all wins over a clean topology. Signed-off-by: Steven Noonan <steven@edera.dev>
Add numa_get_nr_memblks() and numa_get_memblk() to read out the (start, end, nid) triples Xen built from SRAT or device tree at boot. The data is already there in node_memblk_range[] / memblk_nodeid[] but those have been static to common/numa.c; expose them so callers that need to synthesise per-node memory information for a guest -- starting with dom0 vNUMA SRAT generation -- can iterate the canonical layout rather than re-deriving node ownership from less reliable sources (e.g. mfn_to_nid page walks, or intersections with the guest E820). Signed-off-by: Steven Noonan <steven@edera.dev>
The original dom0_setup_vnuma() emitted one vmemrange per (E820 RAM
region, vnode) pair by splitting each region equally across vnodes.
That was wrong on two counts:
1. It lied about node ownership. Every host RAM region physically lives
on exactly one NUMA node; chopping it across vnodes claimed memory
sat on nodes that couldn't host it. Per-page MEMF_node allocation
(driven by dom0_gpa_to_pnode -> vmemrange[]) then disagreed with the
SRAT, defeating the topology guarantee.
2. It under-covered Linux's RAM view. PVH dom0's guest E820 is host-
shaped (XENMEM_memory_map returns d->arch.e820, which mirrors the
full host BIOS E820 even when dom0_mem trims actual ownership), and
Linux's numa_register_memblks() rejects an SRAT whose memory affinity
entries don't fully cover its memblock.memory. Rejection falls back
to a faked single-node layout -- exactly what we observed in dom0
dmesg ("NUMA: no nodes coverage for 170175MB of 261441MB RAM").
Rewrite the builder to emit one vmemrange per physical NUMA memblk
(numa_get_memblk()), filtered to nodes dom0's vCPUs actually span.
This guarantees:
- full coverage of the host physical RAM layout, so the SRAT/Linux
coverage check passes;
- one vmemrange per (node, contiguous physical range) tuple, so
dom0_gpa_to_pnode hands MEMF_node the same node the SRAT advertises;
- no more equal-split lie about ownership.
No other functional change.
Signed-off-by: Steven Noonan <steven@edera.dev>
Xen renumbers PXMs in SRAT memory-table order during boot, so PXM 1 may
become Xen internal node id 3 (etc.). dom0_setup_vnuma() was using
those internal ids as the vnode index, which then leaked into the SRAT
proximity_domain and the APIC ID encoding ((vnode << pkg_shift) | ...).
The result: vCPU 16, pinned to pCPU 16 (host PXM 1), was reported inside
dom0 as "socket 3" because Xen had renumbered PXM 1 to node 3. Tools
running in dom0 (numactl, lscpu, /proc/cpuinfo) disagreed with the host
about proximity numbering, and any cross-layer correlation (e.g.
"numactl -N 1" expecting PXM 1's CPUs) ended up on the wrong socket.
Index pnodes[] by host PXM instead of by a dense allocator-local id, so
vcpu_to_vnode[] and vmemrange[].nid store PXMs directly. The rest of
the SRAT/SLIT/CPUID emission path consumes those values verbatim --
proximity_domain in SRAT entries, the high bits of the synthesised
APIC ID, the SLIT matrix index -- and now agrees with the host.
vnode_to_pnode[] still maps PXM -> Xen internal node id (used for
__node_distance and for MEMF_node in dom0_gpa_to_pnode), so internal
allocation logic is unchanged. Firmware-numbering gaps (e.g. host PXM
set {0, 2, 5}) become empty vnode slots that Linux treats as offline
proximity domains; the SLIT fills the corresponding rows/columns with
the standard {10 on diagonal, 20 off-diagonal} placeholder rather than
zeros.
Guest vNUMA path (toolstack-driven XEN_DOMCTL_setvnumainfo) is
untouched -- guests continue to use dense vnode numbering set by the
toolstack, since they have no host PXM to align with.
Signed-off-by: Steven Noonan <steven@edera.dev>
…ions pvh_setup_e820() previously trimmed dom0's e820 by walking the host e820 in address order, keeping RAM until cur_pages == nr_pages, then marking the remainder UNUSABLE. On multi-socket hosts the host's RAM regions are typically grouped by NUMA node in address order, so this trim piled all of dom0's memory onto whichever nodes own the lowest physical addresses. With dom0_mem=35% on an 8-node host that meant dom0 got 100% of nodes 0-2's RAM and 0 bytes on the other 5 nodes. When dom0 vNUMA is enabled, the per-node MEMF_node bindings in pvh_populate_memory_range() then had no high-address RAM regions to populate -- the topology was correct, the bindings were correct, they just had nothing to allocate against because pvh_setup_e820() had already turned the relevant address ranges into UNUSABLE. Replace the first-fit trim with a two-pass proportional trim: pass 1 sums total host E820_RAM, pass 2 gives each RAM region floor(region_pages * nr_pages / total_ram_pages) RAM pages and marks the rest UNUSABLE, with Bresenham-style remainder accumulation guaranteeing the final RAM total equals nr_pages exactly. The result on the 8-node host above: each host RAM region keeps ~35% of its size as RAM, so dom0_mem is naturally spread across all nodes the host actually has. Non-vNUMA PVH dom0s benefit too -- memory bandwidth is spread across all host nodes even without explicit per-node binding, instead of being concentrated on whichever nodes happen to live at low addresses. Signed-off-by: Steven Noonan <steven@edera.dev>
dom0 backends (netback, blkback, gntdev) map grant pages with no visibility into which host NUMA node the underlying frame lives on. In PVH dom0 the struct page covering a grant-mapped foreign MFN reports NUMA_NO_NODE, so kthreads, IRQs, and per-request allocations land wherever the scheduler happened to put them -- typically all on one node, regardless of the guest's vNUMA placement. Expose the answer Xen already has internally. XENMEM_get_mfn_pxms takes a batch of host MFNs and returns the firmware proximity-domain identifier (host PXM on x86 ACPI) of the node each MFN lives on. PXM is the value space dom0's SRAT already uses, so the dom0-side caller can map results to its own Linux node ids with the standard pxm_to_node() lookup. Returning Xen's internal nid would force every caller to maintain its own Xen-nid -> dom0-node table. Three NUMA identifier namespaces exist around this code path: host PXM (firmware), Xen-internal nid (assigned in SRAT scan order), and dom0 Linux node id. Standardising the public ABI on PXM means the translation chain at any consumer is a single pxm_to_node() call and matches what dom0's own SRAT contains. The op is hardware-domain-only. No other domain has a legitimate need for host MFN -> PXM mapping, and the answer would leak information about the physical topology to untrusted guests. Each call is bounded at 1024 MFNs to keep the per-call buffer under 16 KiB; callers needing more should batch. Invalid MFNs (out of range, no frame-table entry) get XEN_INVALID_NUMA_ID in the output slot rather than failing the whole batch. The op lives in arch_memory_op because numa_node_to_arch_nid is x86-specific. Building Xen without CONFIG_NUMA compiles the case out; the dispatcher then returns -ENOSYS, which the dom0-side caller handles by falling back to NUMA-oblivious behaviour. The subop number is 40. XENMEM subops occupy six bits in the cmd word (see MEMOP_EXTENT_SHIFT in xen/hypercall.h), so the 2xxx convention Edera uses for downstream DOMCTL/SYSCTL ops can't apply; 40 leaves 28..39 free for upstream additions and 40..63 for further Edera customs. Signed-off-by: Steven Noonan <steven@edera.dev>
82136fb to
41536e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The non-power-of-two vCPU count CPUID patching is controlled via an opt-in domain creation flag,
XEN_DOMCTL_CDF_vnuma_apic_topology. Because that patching will modify APIC IDs to match the topology, the toolstack will be required to set MADT to match the Xen-assigned APIC IDs retrieved viaXEN_DOMCTL_get_vcpu_apicids. This change is a dependency for NPOT support in https://github.com/edera-dev/protect/pull/2569Additionally, we now give PVH dom0s a real vNUMA configuration when booted with
dom0_vcpus_pin=1.