Skip to content

feat(uffd): add REMOVE-event handling, pageTracker removed state, and deterministic race tests#2513

Closed
ValentaTomas wants to merge 3 commits intomainfrom
feat/uffd-page-tracker-and-remove-events
Closed

feat(uffd): add REMOVE-event handling, pageTracker removed state, and deterministic race tests#2513
ValentaTomas wants to merge 3 commits intomainfrom
feat/uffd-page-tracker-and-remove-events

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

This PR is the test side of a stacked pair targeting main:

  • PR A (this one) — lifts the minimum production-side UFFD subset from
    feat/free-page-reporting (uffd: add support for UFFD_EVENT_REMOVE events #1896) needed to compile and exercise the
    REMOVE/MISSING/COPY interleavings, plus the new net/rpc/jsonrpc test
    harness over a unix domain socket and three deterministic race tests.
    The race tests are intentionally configured against the buggy
    pre-fix worker dispatch (state read in the parent loop instead of under
    settleRequests.RLock), so TestStaleSourceRaceMissingAndRemove fails
    on this PR. That failure is the bug demonstration.
  • PR B (stacked)fix/uffd-stale-source-race-on-main — applies the
    one-file production fix that flips PR A's failing test green.

Why a separate stack instead of just landing in #1896: the user wanted
the stale-source race to be reviewable as a tests-then-fix pair against
main, independent of the rest of the free-page-reporting feature work
(Firecracker v1.14, FPR feature flag, template-manager plumbing, proto
regen, etc.). When this stack merges, #1896 will rebase on top and drop
the lifted commits.


What's in this PR

1. Production code lifted from #1896 (commit c50322b1d)

Files: userfaultfd.go, deferred.go, page_tracker.go, prefault.go,
plus pre-existing scaffolding in cross_process_helpers_test.go,
fd_helpers_test.go, helpers_test.go, and the new remove_test.go
(scaffolding form — replaced by the harness in commit 2).

The lift adds:

  • pageTracker.pageState gains removed (distinct from missing /
    faulted).
  • Serve() drains the UFFD fd in a per-poll-cycle batch and processes
    UFFD_EVENT_REMOVE: REMOVE batches take settleRequests.Lock() and
    call pageTracker.setState(..., removed) for every page in the
    removed range, before any pagefault dispatch in that iteration.
  • For each pagefault in the batch, Serve() reads pageTracker state
    in the parent loop (NOT under settleRequests.RLock) and captures
    source = u.src there before dispatching the worker goroutine.
    This is the buggy form that PR B fixes. A REMOVE event arriving
    between the parent-loop state read and the worker actually acquiring
    RLock leaves the worker with a stale source = u.src snapshot, which
    it then UFFDIO_COPYs into a page the kernel just MADV_DONTNEED'd.
  • deferred.go batches up pagefaults that returned EAGAIN/short-circuit
    so the next poll cycle picks them up; a self-pipe wakeupPipe wakes
    poll immediately when a deferred fault is enqueued.
  • prefault.go grows the same batched-event handling.

Out of scope (intentionally NOT lifted from #1896): Firecracker v1.14
bump, free-page-reporting feature flag, template-manager / API plumbing,
proto regen, fcversion sandbox feature, anything outside
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/.

Lift size: 8 files, +883 / -193 lines (production + base scaffolding).

2. net/rpc/jsonrpc test harness over unix socket (commit fe8d2df75)

Replaces the previous pile of single-purpose pipes (offsets, ready,
gate-cmd, gate-sync) plus SIGUSR1/SIGUSR2 signals with one bidirectional
unix domain socket carrying stdlib net/rpc + net/rpc/jsonrpc. Only
the kernel userfaultfd is still handed off out-of-band (via
ExtraFiles); source data goes through a temp file.

RPC methods exposed by the child's Service:

  • Service.WaitReady — replaces the ready pipe + ReadAll handshake.
  • Service.PageStates — replaces SIGUSR2 + offsets pipe + binary.Write protocol; returns {missing, faulted, removed} offset slices.
  • Service.ServePause — replaces gate-cmd byte protocol; pauses the serve loop at the next iteration.
  • Service.ServeResume — replaces gate-sync byte protocol; resumes a paused serve loop.
  • Service.InstallFaultBarrier — NEW: arms a deterministic barrier in the child's worker goroutine at one of two hook points (barrierBeforeRLock or barrierBeforeFaultPage); returns a token.
  • Service.WaitFaultHeld — NEW: blocks until the worker reaches the installed barrier — the RPC reply IS the rendezvous.
  • Service.ReleaseFault — NEW: lets the parked worker proceed.
  • Service.Shutdown — replaces SIGUSR1 graceful exit.

Two new test-only fields land on Userfaultfd:
beforeWorkerRLockHook and beforeFaultPageHook, both
func(addr uintptr). Both default to nil and are nil-checked on the
hot path, so production sees zero behavioural change. They are only
assigned in the child's crossProcessServe wiring.

testConfig gains a sourcePatcher hook so race tests can plant a
deterministic sentinel byte into the random source data BEFORE the
content file is written, instead of depending on the happenstance value
of any randomly-generated byte.

The same commit removes t.Parallel() from the gated cross-process
tests (TestRemoveThenWriteGated, TestWriteThenRemoveGated,
TestFaultedShortCircuitOrdering). While the handler is in the gated
paused state, the user thread that triggered the queued WRITE fault
is suspended in the kernel's pagefault path. From the Go runtime's
perspective that goroutine is "running" (not in syscall) and cannot be
preempted. If a CONCURRENT cross-process test triggers a stop-the-world
GC pause, STW will wait forever for the suspended goroutine to reach a
safe point. Running the gated tests sequentially avoids that
interleaving while leaving every other test on t.Parallel().

3. Deterministic race tests (commit 151736b81)

Three tests, all built on the harness + barrier hooks. None of them use
sleeps, retries, or soak loops; each one installs an explicit barrier,
drives the racing kernel operation from the parent, and asserts on a
concrete post-state.

  • TestStaleSourceRaceMissingAndRemove — regression test for the
    fix landing in PR B. Plants a non-zero sentinel (0xc3) into page 1,
    parks the worker via barrierBeforeRLock, fires MADV_DONTNEED on
    the same page, waits for the REMOVE batch to commit
    (state == removed), releases the worker, then asserts the page is
    zero-filled.
    • Pre-fix (this PR): FAILS with:
      page 1 first byte: want 0 (post-fix zero-fault for `removed` state),
      got 0xc3 — if this equals the sentinel 0xc3, the worker used a
      stale `source = u.src` snapshot (regression)
      
      0xc3 is exactly the planted sentinel, proving the worker
      UFFDIO_COPY'd stale u.src bytes after the page was removed.
    • Post-fix (PR B): passes in ~50 ms / variant.
  • TestNoMadviseDeadlockWithInflightCopy — liveness regression test
    for the user-visible symptom that originally surfaced the bug. Parks
    the worker via barrierBeforeFaultPage (i.e. holding RLock,
    mid-handler), fires MADV_DONTNEED from the parent, asserts madvise
    returns within 2 s. Pre-fix this currently passes (the lifted
    readEvents() does not take settleRequests, so madvise is not
    blocked). Future-proofs against any change that accidentally couples
    readEvents() to settleRequests — would surface as a fast 2 s
    assertion failure rather than a 30 m CI timeout.
  • TestFaultedShortCircuitOrdering — smoke test on the
    REMOVE-then-pagefault batch ordering using the gated harness. Pins
    the invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration. Passes both pre- and post-fix — guards
    against future regressions in batch ordering.

Local sudo run on this branch (Linux, Go 1.25.9):

=== RUN   TestStaleSourceRaceMissingAndRemove
--- FAIL: TestStaleSourceRaceMissingAndRemove (0.05s)
    --- FAIL: TestStaleSourceRaceMissingAndRemove/4k       (assertion above, got 0xc3)
    --- FAIL: TestStaleSourceRaceMissingAndRemove/hugepage (assertion above, got 0xc3)
--- PASS: TestNoMadviseDeadlockWithInflightCopy (0.04s)
--- PASS: TestFaultedShortCircuitOrdering       (0.24s)
FAIL

These race tests intentionally fail on this PR. The fix lands in stacked
PR fix/uffd-stale-source-race-on-main. Merge this stack as a unit;
do not merge this PR alone or CI on main will go red.


Stacking

  • This PR's base: main.
  • Stacked child PR's base: this branch (feat/uffd-page-tracker-and-remove-events).

Related

Test plan

  • go build ./... clean against main + this PR.
  • go vet ./... clean.
  • sudo go test -timeout 2m -count=1 -run 'Race|Deadlock|Ordering' ./pkg/sandbox/uffd/userfaultfd/...TestStaleSourceRaceMissingAndRemove fails with the documented assertion (bug demo); other two pass.
  • sudo go test -timeout 5m -count=1 ./pkg/sandbox/uffd/userfaultfd/... — all pre-existing tests (TestRemove*, TestRemoveThenFault, TestRemoveThenWriteGated, TestWriteThenRemoveGated, TestMissing*, TestMissingWrite*, TestAsyncWriteProtection, TestSerial*, TestParallel*) pass.

…g from #1896

This commit takes the production-side UFFD code on PR #1896 (feat/free-page-reporting)
at the chore-cleanup tip f310273 and lifts it onto main as a self-contained subset,
WITHOUT the production fix from 24cb1e1 that closes the stale-source race the next
commits will demonstrate. Specifically:

- pageTracker gains a `removed` pageState distinct from `missing`/`faulted`
  (page_tracker.go).
- Serve() now drains the UFFD fd with a per-poll-cycle batch and processes
  UFFD_EVENT_REMOVE events: REMOVE batches take settleRequests.Lock() and
  call pageTracker.setState(..., removed) for every page in the removed range,
  before any pagefault dispatch in that iteration.
- For each pagefault in the batch, Serve() reads pageTracker state in the
  PARENT loop (NOT under settleRequests.RLock) and captures `source = u.src`
  there before dispatching the worker goroutine. **This is the buggy form
  that the next stacked PR fixes.** A REMOVE event arriving between the
  parent-loop state read and the worker actually acquiring RLock leaves the
  worker with a stale `source = u.src` snapshot, which it then UFFDIO_COPYs
  into a page the kernel just MADV_DONTNEED'd.
- New deferred.go batches up pagefaults that returned EAGAIN/short-circuit
  so the next poll cycle picks them up; a self-pipe wakeupPipe wakes poll
  immediately when a deferred fault is enqueued.
- prefault.go grows the same batched-event handling.
- Test scaffolding to support cross-process REMOVE testing is lifted as-is
  from f310273 (cross_process_helpers_test.go, helpers_test.go,
  remove_test.go, fd_helpers_test.go). The harness in those files still
  uses the older signals/pipes wiring; the next commit replaces it with
  the unix-socket RPC harness.

Out of scope (intentionally NOT lifted from #1896): Firecracker v1.14 bump,
free-page-reporting feature flag, template-manager / API plumbing, proto
regen, fcversion sandbox feature, anything outside packages/orchestrator
/pkg/sandbox/uffd/userfaultfd/.

The next two commits add the deterministic race tests; the stacked PR on
top of this branch ports the fix.
Replace the cross-process userfaultfd test harness's pile of single-purpose
pipes (offsets, ready, gate-cmd, gate-sync) plus SIGUSR1/SIGUSR2 signals
with one bidirectional Unix domain socket carrying stdlib `net/rpc` +
`net/rpc/jsonrpc`. The kernel userfaultfd is the only fd still handed off
out-of-band (via ExtraFiles); the source data is written to a temp file.
Concurrent in-flight calls and request-id correlation are handled by
the standard library, so the harness only needs to register one
Service struct and dial.

Replaced surface, exposed as RPC methods on the child:

  - Service.WaitReady          (replaces ready pipe + ReadAll handshake)
  - Service.PageStates         (replaces SIGUSR2 + offsets pipe + binary.Write protocol)
  - Service.ServePause         (replaces gate-cmd/gate-sync byte protocol)
  - Service.ServeResume        (replaces gate-cmd/gate-sync byte protocol)
  - Service.InstallFaultBarrier (NEW: arms a deterministic barrier in the
                                 child's worker goroutine at one of two
                                 hook points — beforeWorkerRLockHook or
                                 beforeFaultPageHook; returns a token)
  - Service.WaitFaultHeld      (NEW: blocks until the worker reaches the
                                 barrier — the RPC reply IS the rendezvous)
  - Service.ReleaseFault       (NEW: lets the parked worker proceed)
  - Service.Shutdown           (replaces SIGUSR1 graceful exit)

Add two test-only fields on Userfaultfd: `beforeWorkerRLockHook` and
`beforeFaultPageHook`, both `func(addr uintptr)`, both default to nil
and nil-checked on the hot path so production sees zero behavioral
change. They are only assigned in the child's crossProcessServe wiring
(via the test helper that stands up the subprocess). The hooks let the
parent install deterministic "park here, fire racing op, release"
handshakes — necessary for the race tests in the next commit.

testConfig gains a `sourcePatcher` hook so race tests can plant a
deterministic sentinel byte into the random source data BEFORE the
content file is written, without depending on the happenstance value
of any randomly-generated byte.

Also serialise the gated cross-process tests
(`TestRemoveThenWriteGated`, `TestWriteThenRemoveGated`,
`TestFaultedShortCircuitOrdering`) by removing `t.Parallel()`. While
the handler is in the gated `paused` state, any user thread that
triggers a queued pagefault on the registered region is suspended in
the kernel's pagefault path. From the Go runtime's perspective that
goroutine is "running" (not in syscall, since it's a plain memory
store) and cannot be preempted until the fault is served. If a
CONCURRENT cross-process test in the same binary triggers a
stop-the-world GC pause during this window, STW will wait forever
for the suspended goroutine to reach a safe point — the kernel
cannot deliver the SIGURG preempt signal until the pagefault is
served, and the handler is paused. Running the gated tests
sequentially avoids that interleaving while leaving every other
test (including the rest of the race suite) on `t.Parallel()`.
…ed-short-circuit race tests

Three new race tests built on the unix-socket RPC harness and the
test-only fault-barrier hooks. None of them use sleeps, retries, or
soak loops — each test installs explicit barriers on the child's
worker goroutine, drives the racing kernel operation from the parent,
and asserts on a concrete post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    production fix in the next stacked PR. Plants a non-zero sentinel
    into the source page, parks the worker via barrierBeforeRLock,
    fires madvise on the same page, waits for the REMOVE batch to
    commit (state == removed), releases the worker, then asserts the
    page is zero-filled. PRE-FIX (this PR) the worker UFFDIO_COPYs the
    planted sentinel because it captured `source = u.src` in the
    parent loop before the REMOVE landed and never re-reads state
    inside the goroutine; the assertion fires. POST-FIX (next PR) the
    worker re-reads state under RLock, observes `removed`, and
    zero-faults; assertion passes. Wallclock < 50ms / variant.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test
    for the user-visible symptom that originally surfaced the race —
    parent madvise deadlocking while the worker holds RLock. Parks
    the worker via barrierBeforeFaultPage (i.e. holding RLock, mid-
    handler), fires MADV_DONTNEED, asserts madvise returns within 2s.
    Catches any future change that accidentally couples readEvents()
    to settleRequests as a fast assertion failure rather than a 30m
    CI timeout. Wallclock < 50ms / variant.

  - TestFaultedShortCircuitOrdering: smoke test on the
    REMOVE-then-pagefault batch ordering using the gated harness.
    Pins the invariant that REMOVE batches drain before pagefault
    dispatch in a single Serve iteration. Wallclock ~120ms / variant.

These tests are EXPECTED TO FAIL on this PR — they are the bug
demonstration. The production fix lands in the stacked PR
fix/uffd-stale-source-race and flips them green.

Once the fix lands, all three pass `-count=20 -timeout=30s`
deterministically.
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 28, 2026

PR Summary

High Risk
High risk because it changes core userfaultfd event-loop behavior (REMOVE batching, deferred fault retries, wakeup pipe) and page-state tracking, which can affect memory correctness and liveness under concurrency and kernel error conditions.

Overview
Adds removed page tracking and UFFD REMOVE-event batching to the userfaultfd serve loop, including deferring EAGAIN faults for retry (with a self-pipe wakeup) and updating Prefault/fault handling to short-circuit on faulted/removed states.

Reworks the cross-process UFFD tests to use a single unix-socket net/rpc/jsonrpc control channel (readiness, page-state queries, pause/resume, and per-fault barriers) and adds new REMOVE-focused and deterministic race/liveness tests that exercise REMOVE/MISSING/COPY interleavings.

Reviewed by Cursor Bugbot for commit 151736b. Bugbot is set up for automated code reviews on this repo. Configure here.

@ValentaTomas ValentaTomas changed the title feat(uffd): add REMOVE-event handling, pageTracker state, and deterministic race tests feat(uffd): add REMOVE-event handling, pageTracker removed state, and deterministic race tests Apr 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 151736b819

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

})
var source block.Slicer

switch state := u.pageTracker.get(addr); state {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move page-state lookup under settleRequests read lock

Serve() snapshots pageTracker state in the parent loop before starting the worker goroutine, so the worker can run with a stale source decision if a UFFD_EVENT_REMOVE is processed in between. In that interleaving, the worker may still UFFDIO_COPY bytes from u.src into a page that was just marked removed, which is the stale-source race exercised by TestStaleSourceRaceMissingAndRemove. Read the state/source inside the goroutine after acquiring settleRequests.RLock() so REMOVE updates are observed before fault handling.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 151736b. Configure here.

if _, err := unix.FcntlInt(uintptr(dup), unix.F_SETFD, unix.FD_CLOEXEC); err != nil {
childForkMu.Unlock()
require.NoError(t, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex leak on Dup failure deadlocks parallel tests

Low Severity

After childForkMu.Lock() runs, the very next require.NoError checking the syscall.Dup result will call t.FailNow() (which uses runtime.Goexit) without releasing the mutex if Dup fails. Since the mutex serializes every configureCrossProcessTest call across parallel tests in the binary, all subsequent tests would block indefinitely on childForkMu.Lock(). The FcntlInt failure branch unlocks before failing, but the Dup branch does not.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 151736b. Configure here.

defer s.mu.Unlock()
s.startServe()

return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServeResume can spawn duplicate serve goroutines

Low Severity

startServe unconditionally creates a new fdexit, a new done channel, and a new serve goroutine, then overwrites s.stop. If ServeResume is ever called when serving is already active (no preceding ServePause), two Serve goroutines end up polling the same uffd fd concurrently and the previous goroutine's stop closure is leaked. Adding a guard such that startServe only runs when s.stop == nil would make the API self-consistent.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 151736b. Configure here.

@ValentaTomas
Copy link
Copy Markdown
Member Author

Closing in favor of #2511. We're keeping the Stack A form (#2511#2475 → main, #2512 stacked on #2511). #2513/#2514 were a duplicate Option-3 stack opened in parallel; superseded.

@ValentaTomas ValentaTomas deleted the feat/uffd-page-tracker-and-remove-events branch April 28, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants