Conversation
sys_mmap previously pread-snapshotted file contents into private guest
pages and emulated write-back via msync's pwrite-the-diff path. The
guest never saw concurrent host writes through its mapping, and its own
writes only landed on the file at msync time. Database WAL, lock files,
and any cross-process file-sharing protocol were silently broken.
Install a real host mmap(MAP_FIXED|MAP_SHARED, fd) overlay on top of the
guest slab so the kernel page cache keeps the mapping coherent with the
file (and with peer overlays of the same inode). HVF requires hv_vm_unmap
to target an exactly-previously-mapped range and rejects sub-range unmap
with HV_BAD_ARGUMENT, so the slab is now tracked as a sorted list of
2 MiB-aligned hvf_segment_t entries (guest.h). Each overlay request
splits the containing segment, hv_vm_unmaps it, lays down the host file
mmap at the exact host VA, and hv_vm_maps the segment back so HVF
re-walks the host page tables. Sibling vCPUs are quiesced via
thread_quiesce_siblings during the brief stage-2 window so concurrent
guest accesses cannot fault on the temporarily-unmapped IPA. HVF
resolves stage-2 at sub-2 MiB granularity within a mapped segment, so
a 4 KiB lock-file overlay inside a 2 MiB segment is honored without
per-page hv_vm_map calls (empirically verified during implementation).
Apple Silicon enforces 16 KiB host pages: mmap MAP_FIXED requires the
addr and offset to be 16 KiB-aligned. Two paths handle the gap: the
gap-finder hint advances to the next host-page boundary after each
allocation so sequential mmaps stay overlay-eligible, and find_free_gap_inner
aligns gap_start (and every advance past a walked region) to
host_page_size_cached() rather than the guest 4 KiB page, so an unaligned
addr-hint cannot return a result that lands inside a host page already
covered by another region's overlay tail. Misaligned MAP_FIXED requests
fall back to the snapshot pread path, so guest-supplied addresses that
the host kernel cannot honor still produce correct behaviour through the
legacy emulation.
guest_region_t carries new overlay_active / overlay_start / overlay_end
fields so msync collapses to a plain fsync for overlay regions (the
kernel page cache already keeps them coherent), the snapshot-style
refresh-from-file alias pass is skipped for overlay peers, and MADV_DONTNEED
is a no-op for overlay regions (the existing memset+pread reset would
have written zeros straight into the file via the overlay). The metadata
is clipped through every region split / trim site in src/core/guest.c so
the overlay bounds always match the host-page-aligned region bounds.
cleanup_overlays_in_range now returns int and propagates -EIO; metadata
is cleared only after per-overlay host-VA tear-down succeeds so a
partial failure does not leave the runtime believing an overlay is gone
while the host mmap is still live (which would otherwise let a later
memset write zeros into the user file).
The CoW fork path syncs each live overlay region back into shm_fd before
sending shm_fd over SCM_RIGHTS so the child's MAP_PRIVATE snapshot
reflects the parent's view at fork time, and the child's fork-state
restore demotes every inherited overlay flag to the snapshot path. The
sync-back loop now treats a short pwrite of zero as -EIO instead of
spinning.
Snapshot buffers used by the FIXED replacement / mremap rollback paths
are heap-allocated. region_snapshot_t * GUEST_MAX_REGIONS on the stack
is on the order of half a megabyte, and macOS secondary thread stacks
default to ~512 KiB; the stack-allocated original would crash any worker
pthread that hit the path. A new dispose_region_snapshots helper closes
any dup'd backing fds, frees the heap buffer, and zeros the caller's
pointer so a follow-on call is a no-op.
Notes:
1. sys_mmap rollback when guest_region_add_ex_owned failed after the
host overlay succeeded left the file mmap'd at host_base+ipa with no
region tracking, so a later operation in that range would memset
zeros directly into the user's file; the failure path now calls
hvf_remove_file_overlay before returning -ENOMEM.
2. The final hv_vm_map failure path in hvf_apply_file_overlay restored
slab backing but never re-issued hv_vm_map for the segment, so
sibling vCPUs would page-fault on that IPA after thread_resume_siblings;
the rollback now re-establishes the segment before returning.
3. cleanup_overlays_in_range cleared overlay_active before calling
hvf_remove_file_overlay and ignored its failure; the helper is now
fallible and clears metadata only on per-overlay tear-down success.
4. fork_ipc_recv_memory_regions inherited overlay_active=true on every
shared region but the child never re-established the host-VA overlay;
the child now demotes every inherited overlay to the snapshot path.
5. find_free_gap_inner advanced gap_start with PAGE_ALIGN_UP (4 KiB),
not host_page_size_cached() (16 KiB on Apple Silicon), and the
initial gap_start = min_addr did not round up either; both now align
to the host page so a new mapping cannot land in a host page already
covered by an existing overlay.
6. The CoW fork sync-back loop never treated pwrite returning 0 with
len > 0 as failure and could spin forever; treated as -EIO.
7. hvf_segment_split partial-failure recovery re-issued hv_vm_map for
the original segment while pieces[0..i-1] were still mapped, which
HVF rejects with HV_BAD_ARGUMENT; the recovery now unmaps those
pieces first, and a hard failure of the recovery hv_vm_map is logged
so post-mortem points at the right culprit.
8. sc_sync was wrapped in SC_LOCKED so the fsync loop ran while holding
mmap_lock, blocking concurrent guest mmap on every other thread for
the duration of the global flush; the locks are now taken inline only
for the brief snapshot phase and released before fsync. Under malloc
failure the bulk-dup path falls back to inline per-fd fsync instead
of silently no-opping, since Linux sync(2) is best-effort but is
still expected to initiate writeback.
9. sys_mremap MAYMOVE-grow path: read_file_range_to_guest failure after
cleanup_overlays_in_range tore down the source overlay used to leave
the source on slab backing (silent demotion of MAP_SHARED) and the
dest with phantom PTEs from the just-completed mremap_extend_range;
the failure path now restores the source overlay and invalidates the
dest PTEs.
10. sys_mremap MREMAP_FIXED path: read_file_range_to_guest failure
restored dest region metadata but never restored the destination
page tables; the rollback now also calls restore_snapshot_page_tables.
Sync handling: sc_sync was forwarding to host sync(2), which flushes
every dirty page system-wide. The slab is mmap'd MAP_SHARED to an
internal tempfile (g->shm_fd) for the CoW fork fast path, so a global
flush had to walk multi-GB of demand-paged dirty pages from that
tempfile plus the same from any other elfuse process on the host. In
practice this stalled make check for hundreds of seconds when prior
killed test runs had left stuck-in-uninterruptible busybox/elfuse
processes holding mmap'd tempfiles. sc_sync now iterates the guest
fd_table plus the live overlay regions, dups each target host fd under
the matching lock, and fsyncs them outside both locks so a slow disk
does not stall concurrent FD or memory operations on other threads. The
slab tempfile is implementation detail and is no longer touched. make
check completes in ~13 seconds.
Locked in by tests/test-msync.c with three new cases on top of the
existing four: host pwrite is visible through the mapping without
msync, guest writes through the mapping reach the file without msync,
and an adjacent MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS allocation does
not inherit the shared overlay through host-page sharing (regression
lock for the gap-finder host-page alignment fix).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sys_mmap previously pread-snapshotted file contents into private guest pages and emulated write-back via msync's pwrite-the-diff path. The guest never saw concurrent host writes through its mapping, and its own writes only landed on the file at msync time. Database WAL, lock files, and any cross-process file-sharing protocol were silently broken.
Install a real host mmap(MAP_FIXED|MAP_SHARED, fd) overlay on top of the guest slab so the kernel page cache keeps the mapping coherent with the file (and with peer overlays of the same inode). HVF requires hv_vm_unmap to target an exactly-previously-mapped range and rejects sub-range unmap with HV_BAD_ARGUMENT, so the slab is now tracked as a sorted list of 2 MiB-aligned hvf_segment_t entries (guest.h). Each overlay request splits the containing segment, hv_vm_unmaps it, lays down the host file mmap at the exact host VA, and hv_vm_maps the segment back so HVF re-walks the host page tables. Sibling vCPUs are quiesced via thread_quiesce_siblings during the brief stage-2 window so concurrent guest accesses cannot fault on the temporarily-unmapped IPA. HVF resolves stage-2 at sub-2 MiB granularity within a mapped segment, so a 4 KiB lock-file overlay inside a 2 MiB segment is honored without per-page hv_vm_map calls (empirically verified during implementation).
Apple Silicon enforces 16 KiB host pages: mmap MAP_FIXED requires the addr and offset to be 16 KiB-aligned. Two paths handle the gap: the gap-finder hint advances to the next host-page boundary after each allocation so sequential mmaps stay overlay-eligible, and find_free_gap now aligns to host_page_size_cached() rather than the guest 4 KiB page when stepping past a region. Misaligned MAP_FIXED requests fall back to the snapshot pread path, so mremap-style guest-supplied addresses that the host kernel cannot honor still produce correct behaviour through the legacy emulation.
guest_region_t carries new overlay_active / overlay_start / overlay_end fields so msync collapses to a plain fsync for overlay regions (the kernel page cache already keeps them coherent), the snapshot-style refresh-from-file alias pass is skipped for overlay peers, and MADV_DONTNEED is a no-op for overlay regions (the existing memset+pread reset would have written zeros straight into the file via the overlay). The metadata is clipped through every region split / trim site in src/core/guest.c so the overlay bounds always match the host-page-aligned region bounds. cleanup_overlays_in_range now returns int and propagates -EIO; metadata is cleared only after per-overlay host-VA tear-down succeeds so a partial failure does not leave the runtime believing an overlay is gone while the host mmap is still live (which would otherwise let a later memset write zeros into the user file).
The CoW fork path syncs each live overlay region back into shm_fd before sending shm_fd over SCM_RIGHTS so the child's MAP_PRIVATE snapshot reflects the parent's view at fork time, and the child's fork-state restore demotes every inherited overlay flag to the snapshot path (live cross-fork MAP_SHARED coherence is the next P1 TODO item, deliberately deferred). The sync-back loop now treats a short pwrite of zero as -EIO instead of spinning.
Multi-model review (Gemini, Codex) closed six issues in the same change: (1) sys_mmap rollback when guest_region_add_ex_owned failed after the host overlay succeeded left the file mmap'd at host_base+ipa with no region tracking, so a later operation in that range would memset zeros directly into the user's file; the failure path now calls hvf_remove_file_overlay before returning -ENOMEM. (2) The final hv_vm_map failure path in hvf_apply_file_overlay restored slab backing but never re-issued hv_vm_map for the segment, so sibling vCPUs would page-fault on that IPA after thread_resume_siblings; the rollback now re-establishes the segment before returning. (3) cleanup_overlays_in_range cleared overlay_active before calling hvf_remove_file_overlay and ignored its failure (described above). (4) fork_ipc_recv_memory_regions inherited overlay_active=true on every shared region but the child never re-established the host-VA overlay, so guest writes went to private CoW slab while msync silently skipped writeback; the child now demotes every inherited overlay to the snapshot path. (5) find_free_gap_inner advanced gap_start with PAGE_ALIGN_UP (4 KiB), not host_page_size_cached() (16 KiB on Apple Silicon); after hint rewind, a new mapping could start mid-host-page and silently share an already-overlay-mapped 16 KiB host page, exposing live file content (or causing zero writes back into the file) through the wrong VMA. (6) The CoW fork sync-back pwrite=0-with-len>0 spin described above.
Sync handling needed a parallel fix: sc_sync was forwarding to host sync(2), which flushes every dirty page system-wide. The slab is mmap'd MAP_SHARED to an internal tempfile (g->shm_fd) for the CoW fork fast path, so a global flush had to walk multi-GB of demand-paged dirty pages from that tempfile plus the same from any other elfuse process on the host. In practice this stalled make check for hundreds of seconds when prior killed test runs had left stuck-in-uninterruptible busybox/elfuse processes holding mmap'd tempfiles. sc_sync now iterates the guest fd_table plus the live overlay regions, dups each target host fd under the matching lock, and fsyncs them outside the lock so a slow disk does not stall concurrent FD operations on other threads. The slab tempfile is implementation detail and is no longer touched. Full make check went from timing out to ~13 seconds end to end.
Locked in by tests/test-msync.c with three new cases on top of the existing four: host pwrite is visible through the mapping without msync, guest writes through the mapping reach the file without msync, and an adjacent MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS allocation does not inherit the shared overlay through host-page sharing (regression lock for the gap-finder host-page alignment fix).
Summary by cubic
Implements real
MAP_SHAREDby overlaying hostmmap(MAP_FIXED|MAP_SHARED)on the guest slab so file-backed mappings stay coherent with the file and peers. Fixes broken WAL/lock-file behavior and makes sync fast.New Features
MAP_SHAREDfiles with HVF segment splitting (2 MiB) and vCPU quiesce; HVF is re-mapped so stage-2 walks the updated host PTEs.guest_region_ttracks overlay bounds;msynccollapses tofsyncfor overlays, andMADV_DONTNEEDskips zero+reload; CoW fork copies overlay bytes intoshm_fdand the child demotes overlays to snapshot mode.sync(2)nowfsyncs open files and live overlay fds instead of calling hostsync(), cutting test runtime to ~13s.Bug Fixes
sys_mmap/HVF failures and always re-establish HVF segments; overlay teardown returns errors and clears metadata only after successful host-VA cleanup.mremapis overlay-aware: reads from the file when the source is overlaid; error paths restore the source overlay, invalidate dest PTEs, and rebuild page tables forMREMAP_FIXED. Snapshot/rollback buffers moved to heap to avoid macOS thread stack limits.pwritevisible via mapping withoutmsync, guest writes reach the file withoutmsync, and adjacentMAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUSdoes not inherit a shared overlay.Written for commit 92c13c1. Summary will update on new commits.