Skip to content

Honor MAP_SHARED coherence across fork#16

Merged
jserv merged 1 commit intomainfrom
cross-fork
May 7, 2026
Merged

Honor MAP_SHARED coherence across fork#16
jserv merged 1 commit intomainfrom
cross-fork

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 7, 2026

Both fork paths (CoW shm and legacy IPC byte-copy) silently broke MAP_SHARED visibility across fork: the child mapped the slab MAP_PRIVATE or got a fresh byte copy, so writes from either side stayed local and never reached the kernel page cache the parent shared with the file. MAP_SHARED|MAP_ANONYMOUS, the standard parent-child IPC primitive used by Postgres and other multi-process daemons, was equally broken.

Three pieces close the gap:

  1. Parent-side conversion (mmap_fork_prepare_anon_shared, with commit/abort wrappers). While siblings are quiesced the fork thread walks live regions, promotes each MAP_SHARED|MAP_ANONYMOUS region without a backing fd into a memfd-style overlay (mkstemp+unlink+ftruncate, pwrite-seed from host_base, host MAP_FIXED|MAP_SHARED via the new hvf_apply_file_overlay_quiesced helper, mark_overlay_metadata_range), and pre-stages per-region dup() fds so a transient EMFILE rolls back cleanly. The candidate filter skips regions whose host-page-rounded tail would alias a neighbor mapping. The transactional commit/abort wrappers let the fork-IPC failure path roll back the in-place conversion (overlay teardown plus region metadata restore) before resuming siblings; abort validates every captured snapshot before tearing down so a sibling-drift past the quiesce timeout does not leave host VA out of sync with semantic state. forkipc.c logs a warning when abort returns a partial failure so the parent's stale state is visible in post-mortem.
  2. Child-side restoration (mmap_fork_restore_overlays). The recv path now snapshots parent overlay_active/start/end (and a new parent_had_fd[] mirror) before clearing inherited state, then re-runs hvf_apply_file_overlay against the saved overlay span once SCM_RIGHTS delivers the backing fds. The inner quiesce is a no-op since no worker vCPUs exist yet.
  3. Pre-existing fork-IPC alignment bug. The old recv_backing_fds filter (!MAP_ANONYMOUS && offset != -1) matched the shim region (LINUX_MAP_PRIVATE, offset 0) and ELF text segments and silently stole incoming SCM_RIGHTS fds, leaving the actual file-backed regions with backing_fd=-1. The receiver now uses parent_had_fd[] as the filter so its iteration order matches the sender's "backing_fd >= 0" filter exactly. Unassigned fds are closed instead of leaked.

hvf_apply_file_overlay and hvf_remove_file_overlay are split into a public variant that handles thread_quiesce_siblings and a _quiesced inner that the parent fork-prep / abort paths call without a nested barrier.

Locked in by tests/test-cross-fork-mapshared.c (3 cases: file-backed mkstemp, MAP_SHARED|MAP_ANONYMOUS, /dev/shm via shm_open). Each case verifies pre-fork seed visibility, child-write-visible-to-parent, parent-write-visible-to-child, and on-disk reconciliation. All three pass against Linux ground truth via tests/qemu-runner.sh.


Summary by cubic

Preserves MAP_SHARED coherence across fork for file-backed and anonymous shared mappings. Converts anonymous shared regions to memfd-backed overlays in the parent and re-applies them in the child so both processes see each other’s writes and on-disk state stays correct.

  • Bug Fixes

    • Parent: convert MAP_SHARED|MAP_ANONYMOUS regions without a backing fd into memfd overlays; seed bytes; install MAP_SHARED|MAP_FIXED via new _quiesced helper; pre-stage per-region dup() fds; keep siblings quiesced through SCM_RIGHTS send; transactional commit/abort with rollback validation; skip host-page-tail alias cases.
    • Child: snapshot parent overlay metadata and parent_had_fd[], receive fds in the same order, then re-install overlays before worker vCPUs; per-region failures fall back to snapshot semantics.
    • IPC: fix fd handoff ordering and leaks by matching sender with parent_had_fd[]; close unassigned fds; add strict checks for truncated/missing SCM_RIGHTS payloads.
    • Tests: add test-cross-fork-mapshared covering file-backed, MAP_SHARED|MAP_ANONYMOUS, and /dev/shm; verifies parent↔child visibility and on-disk reconciliation.
  • Refactors

    • Split hvf_apply_file_overlay/hvf_remove_file_overlay into public and _quiesced variants for safe use during fork.

Written for commit 1140b13. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

Both fork paths (CoW shm and legacy IPC byte-copy) silently broke
MAP_SHARED visibility across fork: the child mapped the slab MAP_PRIVATE
or got a fresh byte copy, so writes from either side stayed local and
never reached the kernel page cache the parent shared with the file.
MAP_SHARED|MAP_ANONYMOUS, the standard parent-child IPC primitive used
by Postgres and other multi-process daemons, was equally broken.

Three pieces close the gap:
1. Parent-side conversion (mmap_fork_prepare_anon_shared, with
   commit/abort wrappers). While siblings are quiesced the fork
   thread walks live regions, promotes each MAP_SHARED|MAP_ANONYMOUS
   region without a backing fd into a memfd-style overlay
   (mkstemp+unlink+ftruncate, pwrite-seed from host_base, host
   MAP_FIXED|MAP_SHARED via the new hvf_apply_file_overlay_quiesced
   helper, mark_overlay_metadata_range), and pre-stages per-region
   dup() fds so a transient EMFILE rolls back cleanly. The candidate
   filter skips regions whose host-page-rounded tail would alias a
   neighbor mapping. The transactional commit/abort wrappers let the
   fork-IPC failure path roll back the in-place conversion (overlay
   teardown plus region metadata restore) before resuming siblings;
   abort validates every captured snapshot before tearing down so a
   sibling-drift past the quiesce timeout does not leave host VA out
   of sync with semantic state. forkipc.c logs a warning when abort
   returns a partial failure so the parent's stale state is visible
   in post-mortem.
2. Child-side restoration (mmap_fork_restore_overlays). The recv
   path now snapshots parent overlay_active/start/end (and a new
   parent_had_fd[] mirror) before clearing inherited state, then
   re-runs hvf_apply_file_overlay against the saved overlay span
   once SCM_RIGHTS delivers the backing fds. The inner quiesce is a
   no-op since no worker vCPUs exist yet.
3. Pre-existing fork-IPC alignment bug. The old recv_backing_fds
   filter (!MAP_ANONYMOUS && offset != -1) matched the shim region
   (LINUX_MAP_PRIVATE, offset 0) and ELF text segments and silently
   stole incoming SCM_RIGHTS fds, leaving the actual file-backed
   regions with backing_fd=-1. The receiver now uses parent_had_fd[]
   as the filter so its iteration order matches the sender's
   "backing_fd >= 0" filter exactly. Unassigned fds are closed
   instead of leaked.

hvf_apply_file_overlay and hvf_remove_file_overlay are split into a
public variant that handles thread_quiesce_siblings and a _quiesced
inner that the parent fork-prep / abort paths call without a nested
barrier.

Locked in by tests/test-cross-fork-mapshared.c (3 cases: file-backed
mkstemp, MAP_SHARED|MAP_ANONYMOUS, /dev/shm via shm_open). Each case
verifies pre-fork seed visibility, child-write-visible-to-parent,
parent-write-visible-to-child, and on-disk reconciliation. All three
pass against Linux ground truth via tests/qemu-runner.sh.
@jserv jserv merged commit 992292d into main May 7, 2026
4 checks passed
@jserv jserv deleted the cross-fork branch May 7, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant