Skip to content

vNUMA: implement CPUID patch for NPOT domU vCPU counts, implement PVH dom0 vNUMA#12

Draft
tycho wants to merge 13 commits into
edera/4.21from
steven/vnuma-cpuid-topology-v2
Draft

vNUMA: implement CPUID patch for NPOT domU vCPU counts, implement PVH dom0 vNUMA#12
tycho wants to merge 13 commits into
edera/4.21from
steven/vnuma-cpuid-topology-v2

Conversation

@tycho
Copy link
Copy Markdown
Member

@tycho tycho commented May 18, 2026

The non-power-of-two vCPU count CPUID patching is controlled via an opt-in domain creation flag, XEN_DOMCTL_CDF_vnuma_apic_topology. Because that patching will modify APIC IDs to match the topology, the toolstack will be required to set MADT to match the Xen-assigned APIC IDs retrieved via XEN_DOMCTL_get_vcpu_apicids. This change is a dependency for NPOT support in https://github.com/edera-dev/protect/pull/2569

Additionally, we now give PVH dom0s a real vNUMA configuration when booted with dom0_vcpus_pin=1.

@tycho tycho force-pushed the steven/vnuma-cpuid-topology-v2 branch from ea043ff to 3357a12 Compare May 19, 2026 01:07
@tycho tycho changed the title vNUMA: implement support for generated CPUID topology with NPOT vCPU counts vNUMA: implement CPUID patch for NPOT domU vCPU counts, implement PVH dom0 vNUMA May 19, 2026
@tycho tycho force-pushed the steven/vnuma-cpuid-topology-v2 branch from 3357a12 to 78a599b Compare May 19, 2026 23:54
tycho added 13 commits May 20, 2026 13:36
Expose the per-vCPU x2APIC identifiers computed by Xen via a new
domctl so toolstacks can populate ACPI MADT processor entries with
the same APIC IDs Xen reports via CPUID 0xB and the vlapic state.

This matters when vNUMA encoding produces non-trivial APIC IDs --
i.e. multi-vnode layouts with non-power-of-two or unbalanced
per-vnode vCPU counts, where guest_vcpu_x2apic_id() returns
(vnode_index << pkg_shift) | (intra_pkg_offset * 2) rather than
the legacy vcpu_id * 2.  For POT-balanced and single-vnode layouts
the returned values are bit-identical to the legacy encoding, so
callers can transparently use this domctl for all layouts without
special-casing.

Toolstacks that hardcode vcpu_id * 2 in MADT (libxl's
acpi_lapic_id() in tools/libs/light/libxl_x86_acpi.c, and the
equivalent in Protect's PVH ACPI generator) produce APIC IDs that
disagree with Xen's vlapic state for NPOT vNUMA layouts -- INIT/SIPI
fails to reach the intended vCPU because Xen's APIC delivery does
not find a vCPU matching the MADT-advertised ID, and secondary CPU
bringup hangs.  This domctl is the fix: toolstacks call it after
XEN_DOMCTL_setvnumainfo and use the returned values to populate
MADT entries.

Buffer sizing follows the XEN_DOMCTL_get_vcpu_msrs convention: a
NULL handle (or nr_vcpus == 0) is a capacity query; an insufficient
buffer returns -ENOBUFS with nr_filled set to the required size.
Older Xen builds return -ENOSYS, which callers may use as a
capability probe.

Adding a new subop is additive and does not change any existing
struct layout, so XEN_DOMCTL_INTERFACE_VERSION is not bumped.
Callers identify support via the -ENOSYS pattern rather than the
version number, avoiding gratuitous compatibility breakage for
clients (xl, libxl, libxc) built against earlier Xen headers.

FLASK uses DOMAIN__GETVCPUCONTEXT for permission, matching the
read-only "get vcpu state" pattern of get_vcpu_msrs and
getvcpucontext.

Signed-off-by: Steven Noonan <steven@edera.dev>
recalculate_vnuma_topo() previously bailed out silently for any vNUMA
layout where d->max_vcpus was not divisible by nr_vnodes or where the
resulting vcpus_per_node was not a power of two.  The guest then saw
default (zeroed) topology leaves while CPUID 0x80000008 ECX (AMD) and
leaf 4 LLC (Intel) leaked through the host's package size.  Common
toolstack-produced layouts -- e.g. 48 vCPUs split 24/24, or 49 split
25/24 -- triggered this fallback.

Add XEN_DOMCTL_CDF_vnuma_apic_topology, a domain-creation opt-in
that enables a vNUMA-derived APIC ID encoding capable of expressing
NPOT and unbalanced layouts.  When set:

  - recalculate_vnuma_topo() walks vcpu_to_vnode[] to determine
    max_per_node (the largest vnode's vCPU count), no longer
    requiring uniform distribution.
  - pkg_shift = fls(2 * max_per_node - 1) reserves a power-of-2 APIC
    ID window per package big enough for the largest vnode; smaller
    vnodes leave the tail of their window unused.
  - x2APIC IDs become (vnode_index << pkg_shift) | (intra_pkg_offset
    * 2) rather than the legacy vcpu_id * 2, so the package boundary
    falls on a clean bit even when consecutive vCPU IDs would
    otherwise span packages.
  - Advertised counts (leaf 0xB EBX, leaf 4 LLC, AMD 0x80000008 NC)
    use max_per_node; unbalanced packages just look underfilled.

When the flag is unset (default, matches upstream behavior):

  - recalculate_vnuma_topo() restores the legacy gates and silently
    falls back to default CPUID topology for any NPOT or unbalanced
    layout.
  - guest_vcpu_x2apic_id() returns vcpu_id * 2 unconditionally.
  - arch_domain_update_vnuma() skips the vlapic APIC ID refresh.
  - XEN_DOMCTL_get_vcpu_apicids returns vcpu_id * 2 for each vCPU --
    still callable, just not interesting.

The opt-in is mandatory because the new encoding can produce APIC
IDs that disagree with the `vcpu_id * 2` formula hardcoded in
existing toolstacks (libxl's acpi_lapic_id() in
tools/libs/light/libxl_x86_acpi.c) for NPOT layouts.  Setting the
flag without sourcing MADT APIC IDs from XEN_DOMCTL_get_vcpu_apicids
produces a MADT inconsistent with Xen's vlapic state, hanging
secondary CPU bringup -- a deliberate contract: setting the flag
asserts the toolstack reads APIC IDs back from Xen.

The flag is rejected for non-HVM (PV) at createdomain time.  PV does
not have CPUID 0xB or vlapic emulation, so the opt-in is meaningless
there.

For POT-balanced layouts the new encoding is bit-identical to
vcpu_id * 2 even when the flag is set, so opted-in toolstacks see a
change only for the layouts the legacy code couldn't represent at
all.

Because the APIC ID encoding is now derived from vNUMA (when opted
in), every guest-visible APIC ID interface must agree.  Add
guest_vcpu_x2apic_id() as a single source of truth and route the
existing call sites through it (cpuid.c leaves 0x1 and 0xB, vlapic
set_x2apic_id() and vlapic_init()).  Add vlapic_reinit_apic_id() and
call it from arch_domain_update_vnuma() (gated on the flag) so the
vlapic register state is refreshed once the policy is patched.

Live migration: opted-in domains can only be created on Xen builds
that know the flag, and migration to a build that does not know it
fails cleanly at createdomain (unknown CDF bit returns -EINVAL).
Non-opted-in domains use vcpu_id * 2 throughout and migrate as
today.  Both cases avoid the silent topology mismatch the earlier
unconditional encoding would have produced cross-version.

Leaf 0x1F remains unsupported and is still deferred to the broader
topology rework described in the file-level TODOs.

Signed-off-by: Steven Noonan <steven@edera.dev>
Add a detection gate for the dom0 vNUMA topology work that follows.
Enables the gate when all of:

  - dom0 is PVH (the only mode where we can expose SRAT/SLIT/MADT
    we generate ourselves; PV dom0 sees host ACPI tables filtered
    by pvh_acpi_xsdt_table_allowed())
  - dom0 vCPUs are hard-pinned 1:1 to pCPUs (dom0_vcpus_pin=1)
  - dom0's vCPU count equals num_present_cpus(), so every pCPU has a
    corresponding dom0 vCPU
  - The host has more than one NUMA node

Under these conditions the vNUMA layout follows directly from the
host's cpu_to_node() map -- no layout decisions to make.  Relaxing
the constraints (partial pCPU coverage, unpinned dom0, PV dom0 via
host-table passthrough) is tracked separately.

When the gate passes, set XEN_DOMCTL_CDF_vnuma_apic_topology in
dom0_cfg so the existing per-domain opt-in mechanism applies to
dom0 too.  No behavior change yet: d->vnuma is empty for dom0, so
the gated code paths in recalculate_vnuma_topo(),
guest_vcpu_x2apic_id() and arch_domain_update_vnuma() still
short-circuit.  Subsequent commits populate the vNUMA layout, emit
SRAT/SLIT, fix MADT APIC IDs, and bind dom0 memory per-node.

Signed-off-by: Steven Noonan <steven@edera.dev>
Build a vNUMA layout for dom0 derived from the host's physical NUMA
topology, and install it via the existing per-domain vnuma_info
infrastructure.  Runs only when the detection gate from the previous
commit caused XEN_DOMCTL_CDF_vnuma_apic_topology to be set on dom0;
otherwise the helper short-circuits.

Under the first-pass constraint (PVH dom0, hard 1:1 vCPU pinning,
dom0 vCPU count == pCPU count), every dom0 vCPU N is pinned to pCPU
N, so cpu_to_node(N) gives the host node hosting that vCPU.  The
set of vnodes is exactly the set of physical nodes the host has;
the vnode-to-pnode mapping is the identity over that set; the
distance matrix is sliced directly from the host SLIT via
__node_distance().  No layout decisions to make.

vmemrange entries are emitted one per (E820_RAM region, vnode)
pair, splitting each RAM region equally across vnodes.  With the
constraint above each vnode has the same proportional vCPU share,
so equal memory share is the natural default.

Installed under d->vnuma_rwlock and followed by
arch_domain_update_vnuma() to recalculate the CPUID topology
policy.  init_dom0_cpuid_policy() ran earlier (setup.c) with empty
d->vnuma; this recalc overwrites those values with the vNUMA-aware
ones.

Call site is between pvh_init_p2m() (which builds d->arch.e820) and
pvh_populate_p2m() (which will use the layout in a later phase to
drive per-node memory allocation).  Also before pvh_setup_acpi() so
MADT/SRAT/SLIT generation in subsequent phases can read from
d->vnuma.

vnuma_alloc() is exported (was static in common/domctl.c) so the
dom0 builder can construct vnuma_info directly without going
through the domctl path.  arch_domain_update_vnuma() gets a real
header declaration (was previously only __weak-defined with no
header).

By itself this commit makes CPUID 0xB topology correct for dom0
(and the AMD 0x80000008 / Intel leaf 4 LLC patches that
recalculate_vnuma_topo() applies) but does not yet emit SRAT/SLIT
or bind dom0 memory to the right physical nodes.

Signed-off-by: Steven Noonan <steven@edera.dev>
pvh_setup_acpi_madt() hardcoded each x2APIC processor entry's
local_apic_id as `i * 2`, the legacy encoding that predates vNUMA
APIC ID rewriting.  Once dom0 has a multi-vnode topology installed
(previous commit) and is opted into the vNUMA-derived APIC ID
encoding via XEN_DOMCTL_CDF_vnuma_apic_topology, the values from
guest_vcpu_x2apic_id() can diverge from `i * 2` -- and the MADT
must agree with what Xen's vlapic emulation and CPUID 0xB return,
or secondary CPU bringup hits the same MADT-vs-vlapic inconsistency
that the corresponding fix for guest MADT generation already
addressed in the toolstack.

Replace the hardcoded formula with guest_vcpu_x2apic_id(d, i).
When dom0 has no vNUMA topology installed (or the CDF flag is
unset), the helper returns `vcpu_id * 2`, preserving the previous
behavior bit-for-bit.

For the first-pass dom0 vNUMA constraint (PVH dom0, 1:1 vCPU
pinning, dom0 vCPU count == pCPU count) on hosts with
power-of-two per-node pCPU counts -- e.g. typical EPYC/Intel
NPS configurations -- the new encoding is also bit-identical to
`i * 2`.  The change becomes observable when the host has
non-power-of-two per-node counts (uncommon, but supported by the
guest_vcpu_x2apic_id encoding).

Signed-off-by: Steven Noonan <steven@edera.dev>
When dom0 has a vNUMA topology installed (previous commit), generate
an ACPI System Resource Affinity Table that describes it: one
Processor Local x2APIC Affinity entry per vCPU (proximity_domain =
vcpu_to_vnode[i], apic_id from guest_vcpu_x2apic_id()) and one
Memory Affinity entry per vmemrange (proximity_domain = nid,
base/length from the range).

When dom0 has no vNUMA (single-node host, detection gate unmet,
etc.) pvh_setup_acpi_srat() returns with *addr = 0 and the SRAT
is omitted from the XSDT.

Extend pvh_setup_acpi_xsdt() to take an additional optional SRAT
table address and include it in the table list when non-zero.  Size
accounting accommodates the extra slot when SRAT is present.

The native host SRAT (if any) is intentionally filtered out --
pvh_acpi_table_allowed() does not include ACPI_SIG_SRAT in its
allowlist, so dom0 only sees our generated table.  This is correct:
the host SRAT describes the host's NUMA topology, but dom0's
vNUMA layout (vnode indexing, memory ranges in dom0's GPA space)
is a different namespace that the host SRAT cannot represent.

By itself this commit gives dom0 a SRAT but no SLIT, so Linux will
use default distance values (10 local / 20 remote regardless of
actual host topology).  The matching SLIT generation comes in the
next commit.

Signed-off-by: Steven Noonan <steven@edera.dev>
When dom0 has a vNUMA topology installed, generate an ACPI System
Locality Distance Information Table from d->vnuma->vdistance.  The
matrix was sliced from the host's __node_distance() in an earlier
commit, so the values exposed to dom0 reflect actual host inter-node
latency characteristics rather than the default 10/20 fallback Linux
substitutes when SLIT is absent.

When dom0 has no vNUMA (single-node host, detection gate unmet, etc.)
pvh_setup_acpi_slit() returns *addr = 0 and the SLIT is omitted from the
XSDT.

SLIT entries are u8; vdistance is unsigned int.  Clamp at 254 because
255 is the SLIT "reserved" sentinel.  In practice host SLIT distances
are always well within u8 range (typical values are 10-32), so the clamp
is defensive rather than expected to trigger.

Extend pvh_setup_acpi_xsdt() to take an additional optional SLIT table
address, matching the pattern just added for SRAT.

After this commit dom0 will see both SRAT (per-node CPU and memory
affinity) and SLIT (inter-node distance matrix).  Linux's NUMA scheduler
can now make distance-aware decisions.  Memory placement on the physical
nodes still needs memory allocator integration -- until then, the SRAT
describes a topology that's only partially honored by the underlying
allocator.

Signed-off-by: Steven Noonan <steven@edera.dev>
After dom0_setup_vnuma() installs a vNUMA topology and pvh_setup_acpi_srat()
publishes it to dom0, the actual page allocations still went through the
generic dom0_memflags path, which had no awareness of which vmemrange each
GPA belonged to.  Result: SRAT advertised one layout, real memory landed
wherever the heap allocator felt like.

Per-allocation, look up the vmemrange covering the target GPA, translate
its virtual node id through vnode_to_pnode[], and OR MEMF_node(pnode) into
the allocation flags.  Combined with the MEMF_exact_node already in
dom0_memflags this is a strict bind.

If a strict per-node allocation fails at order 0, fall back by disabling
node binding for the rest of dom0 construction (with a warning) before
falling further back to dropping dom0_memflags entirely.  The SRAT will
then diverge from physical placement, which is a real degradation, but
booting dom0 at all wins over a clean topology.

Signed-off-by: Steven Noonan <steven@edera.dev>
Add numa_get_nr_memblks() and numa_get_memblk() to read out the
(start, end, nid) triples Xen built from SRAT or device tree at boot.
The data is already there in node_memblk_range[] / memblk_nodeid[] but
those have been static to common/numa.c; expose them so callers that
need to synthesise per-node memory information for a guest -- starting
with dom0 vNUMA SRAT generation -- can iterate the canonical layout
rather than re-deriving node ownership from less reliable sources
(e.g. mfn_to_nid page walks, or intersections with the guest E820).

Signed-off-by: Steven Noonan <steven@edera.dev>
The original dom0_setup_vnuma() emitted one vmemrange per (E820 RAM
region, vnode) pair by splitting each region equally across vnodes.
That was wrong on two counts:

1. It lied about node ownership.  Every host RAM region physically lives
   on exactly one NUMA node; chopping it across vnodes claimed memory
   sat on nodes that couldn't host it.  Per-page MEMF_node allocation
   (driven by dom0_gpa_to_pnode -> vmemrange[]) then disagreed with the
   SRAT, defeating the topology guarantee.

2. It under-covered Linux's RAM view.  PVH dom0's guest E820 is host-
   shaped (XENMEM_memory_map returns d->arch.e820, which mirrors the
   full host BIOS E820 even when dom0_mem trims actual ownership), and
   Linux's numa_register_memblks() rejects an SRAT whose memory affinity
   entries don't fully cover its memblock.memory.  Rejection falls back
   to a faked single-node layout -- exactly what we observed in dom0
   dmesg ("NUMA: no nodes coverage for 170175MB of 261441MB RAM").

Rewrite the builder to emit one vmemrange per physical NUMA memblk
(numa_get_memblk()), filtered to nodes dom0's vCPUs actually span.
This guarantees:
 - full coverage of the host physical RAM layout, so the SRAT/Linux
   coverage check passes;
 - one vmemrange per (node, contiguous physical range) tuple, so
   dom0_gpa_to_pnode hands MEMF_node the same node the SRAT advertises;
 - no more equal-split lie about ownership.

No other functional change.

Signed-off-by: Steven Noonan <steven@edera.dev>
Xen renumbers PXMs in SRAT memory-table order during boot, so PXM 1 may
become Xen internal node id 3 (etc.).  dom0_setup_vnuma() was using
those internal ids as the vnode index, which then leaked into the SRAT
proximity_domain and the APIC ID encoding ((vnode << pkg_shift) | ...).
The result: vCPU 16, pinned to pCPU 16 (host PXM 1), was reported inside
dom0 as "socket 3" because Xen had renumbered PXM 1 to node 3.  Tools
running in dom0 (numactl, lscpu, /proc/cpuinfo) disagreed with the host
about proximity numbering, and any cross-layer correlation (e.g.
"numactl -N 1" expecting PXM 1's CPUs) ended up on the wrong socket.

Index pnodes[] by host PXM instead of by a dense allocator-local id, so
vcpu_to_vnode[] and vmemrange[].nid store PXMs directly.  The rest of
the SRAT/SLIT/CPUID emission path consumes those values verbatim --
proximity_domain in SRAT entries, the high bits of the synthesised
APIC ID, the SLIT matrix index -- and now agrees with the host.

vnode_to_pnode[] still maps PXM -> Xen internal node id (used for
__node_distance and for MEMF_node in dom0_gpa_to_pnode), so internal
allocation logic is unchanged.  Firmware-numbering gaps (e.g. host PXM
set {0, 2, 5}) become empty vnode slots that Linux treats as offline
proximity domains; the SLIT fills the corresponding rows/columns with
the standard {10 on diagonal, 20 off-diagonal} placeholder rather than
zeros.

Guest vNUMA path (toolstack-driven XEN_DOMCTL_setvnumainfo) is
untouched -- guests continue to use dense vnode numbering set by the
toolstack, since they have no host PXM to align with.

Signed-off-by: Steven Noonan <steven@edera.dev>
…ions

pvh_setup_e820() previously trimmed dom0's e820 by walking the host
e820 in address order, keeping RAM until cur_pages == nr_pages, then
marking the remainder UNUSABLE.  On multi-socket hosts the host's RAM
regions are typically grouped by NUMA node in address order, so this
trim piled all of dom0's memory onto whichever nodes own the lowest
physical addresses.  With dom0_mem=35% on an 8-node host that meant
dom0 got 100% of nodes 0-2's RAM and 0 bytes on the other 5 nodes.

When dom0 vNUMA is enabled, the per-node MEMF_node bindings in
pvh_populate_memory_range() then had no high-address RAM regions to
populate -- the topology was correct, the bindings were correct,
they just had nothing to allocate against because pvh_setup_e820()
had already turned the relevant address ranges into UNUSABLE.

Replace the first-fit trim with a two-pass proportional trim: pass 1
sums total host E820_RAM, pass 2 gives each RAM region
floor(region_pages * nr_pages / total_ram_pages) RAM pages and marks
the rest UNUSABLE, with Bresenham-style remainder accumulation
guaranteeing the final RAM total equals nr_pages exactly.  The result
on the 8-node host above: each host RAM region keeps ~35% of its
size as RAM, so dom0_mem is naturally spread across all nodes the
host actually has.

Non-vNUMA PVH dom0s benefit too -- memory bandwidth is spread across
all host nodes even without explicit per-node binding, instead of
being concentrated on whichever nodes happen to live at low
addresses.

Signed-off-by: Steven Noonan <steven@edera.dev>
dom0 backends (netback, blkback, gntdev) map grant pages with no
visibility into which host NUMA node the underlying frame lives on.
In PVH dom0 the struct page covering a grant-mapped foreign MFN
reports NUMA_NO_NODE, so kthreads, IRQs, and per-request allocations
land wherever the scheduler happened to put them -- typically all on
one node, regardless of the guest's vNUMA placement.

Expose the answer Xen already has internally.  XENMEM_get_mfn_pxms
takes a batch of host MFNs and returns the firmware proximity-domain
identifier (host PXM on x86 ACPI) of the node each MFN lives on.  PXM
is the value space dom0's SRAT already uses, so the dom0-side caller
can map results to its own Linux node ids with the standard
pxm_to_node() lookup.  Returning Xen's internal nid would force every
caller to maintain its own Xen-nid -> dom0-node table.

Three NUMA identifier namespaces exist around this code path: host
PXM (firmware), Xen-internal nid (assigned in SRAT scan order), and
dom0 Linux node id.  Standardising the public ABI on PXM means the
translation chain at any consumer is a single pxm_to_node() call and
matches what dom0's own SRAT contains.

The op is hardware-domain-only.  No other domain has a legitimate
need for host MFN -> PXM mapping, and the answer would leak
information about the physical topology to untrusted guests.  Each
call is bounded at 1024 MFNs to keep the per-call buffer under 16
KiB; callers needing more should batch.  Invalid MFNs (out of range,
no frame-table entry) get XEN_INVALID_NUMA_ID in the output slot
rather than failing the whole batch.

The op lives in arch_memory_op because numa_node_to_arch_nid is
x86-specific.  Building Xen without CONFIG_NUMA compiles the case
out; the dispatcher then returns -ENOSYS, which the dom0-side caller
handles by falling back to NUMA-oblivious behaviour.

The subop number is 40.  XENMEM subops occupy six bits in the cmd
word (see MEMOP_EXTENT_SHIFT in xen/hypercall.h), so the 2xxx
convention Edera uses for downstream DOMCTL/SYSCTL ops can't apply;
40 leaves 28..39 free for upstream additions and 40..63 for further
Edera customs.

Signed-off-by: Steven Noonan <steven@edera.dev>
@tycho tycho force-pushed the steven/vnuma-cpuid-topology-v2 branch from 82136fb to 41536e4 Compare May 20, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant