feat(uffd): add REMOVE-event handling, pageTracker removed state, and deterministic race tests#2513
feat(uffd): add REMOVE-event handling, pageTracker removed state, and deterministic race tests#2513ValentaTomas wants to merge 3 commits intomainfrom
removed state, and deterministic race tests#2513Conversation
…g from #1896 This commit takes the production-side UFFD code on PR #1896 (feat/free-page-reporting) at the chore-cleanup tip f310273 and lifts it onto main as a self-contained subset, WITHOUT the production fix from 24cb1e1 that closes the stale-source race the next commits will demonstrate. Specifically: - pageTracker gains a `removed` pageState distinct from `missing`/`faulted` (page_tracker.go). - Serve() now drains the UFFD fd with a per-poll-cycle batch and processes UFFD_EVENT_REMOVE events: REMOVE batches take settleRequests.Lock() and call pageTracker.setState(..., removed) for every page in the removed range, before any pagefault dispatch in that iteration. - For each pagefault in the batch, Serve() reads pageTracker state in the PARENT loop (NOT under settleRequests.RLock) and captures `source = u.src` there before dispatching the worker goroutine. **This is the buggy form that the next stacked PR fixes.** A REMOVE event arriving between the parent-loop state read and the worker actually acquiring RLock leaves the worker with a stale `source = u.src` snapshot, which it then UFFDIO_COPYs into a page the kernel just MADV_DONTNEED'd. - New deferred.go batches up pagefaults that returned EAGAIN/short-circuit so the next poll cycle picks them up; a self-pipe wakeupPipe wakes poll immediately when a deferred fault is enqueued. - prefault.go grows the same batched-event handling. - Test scaffolding to support cross-process REMOVE testing is lifted as-is from f310273 (cross_process_helpers_test.go, helpers_test.go, remove_test.go, fd_helpers_test.go). The harness in those files still uses the older signals/pipes wiring; the next commit replaces it with the unix-socket RPC harness. Out of scope (intentionally NOT lifted from #1896): Firecracker v1.14 bump, free-page-reporting feature flag, template-manager / API plumbing, proto regen, fcversion sandbox feature, anything outside packages/orchestrator /pkg/sandbox/uffd/userfaultfd/. The next two commits add the deterministic race tests; the stacked PR on top of this branch ports the fix.
Replace the cross-process userfaultfd test harness's pile of single-purpose
pipes (offsets, ready, gate-cmd, gate-sync) plus SIGUSR1/SIGUSR2 signals
with one bidirectional Unix domain socket carrying stdlib `net/rpc` +
`net/rpc/jsonrpc`. The kernel userfaultfd is the only fd still handed off
out-of-band (via ExtraFiles); the source data is written to a temp file.
Concurrent in-flight calls and request-id correlation are handled by
the standard library, so the harness only needs to register one
Service struct and dial.
Replaced surface, exposed as RPC methods on the child:
- Service.WaitReady (replaces ready pipe + ReadAll handshake)
- Service.PageStates (replaces SIGUSR2 + offsets pipe + binary.Write protocol)
- Service.ServePause (replaces gate-cmd/gate-sync byte protocol)
- Service.ServeResume (replaces gate-cmd/gate-sync byte protocol)
- Service.InstallFaultBarrier (NEW: arms a deterministic barrier in the
child's worker goroutine at one of two
hook points — beforeWorkerRLockHook or
beforeFaultPageHook; returns a token)
- Service.WaitFaultHeld (NEW: blocks until the worker reaches the
barrier — the RPC reply IS the rendezvous)
- Service.ReleaseFault (NEW: lets the parked worker proceed)
- Service.Shutdown (replaces SIGUSR1 graceful exit)
Add two test-only fields on Userfaultfd: `beforeWorkerRLockHook` and
`beforeFaultPageHook`, both `func(addr uintptr)`, both default to nil
and nil-checked on the hot path so production sees zero behavioral
change. They are only assigned in the child's crossProcessServe wiring
(via the test helper that stands up the subprocess). The hooks let the
parent install deterministic "park here, fire racing op, release"
handshakes — necessary for the race tests in the next commit.
testConfig gains a `sourcePatcher` hook so race tests can plant a
deterministic sentinel byte into the random source data BEFORE the
content file is written, without depending on the happenstance value
of any randomly-generated byte.
Also serialise the gated cross-process tests
(`TestRemoveThenWriteGated`, `TestWriteThenRemoveGated`,
`TestFaultedShortCircuitOrdering`) by removing `t.Parallel()`. While
the handler is in the gated `paused` state, any user thread that
triggers a queued pagefault on the registered region is suspended in
the kernel's pagefault path. From the Go runtime's perspective that
goroutine is "running" (not in syscall, since it's a plain memory
store) and cannot be preempted until the fault is served. If a
CONCURRENT cross-process test in the same binary triggers a
stop-the-world GC pause during this window, STW will wait forever
for the suspended goroutine to reach a safe point — the kernel
cannot deliver the SIGURG preempt signal until the pagefault is
served, and the handler is paused. Running the gated tests
sequentially avoids that interleaving while leaving every other
test (including the rest of the race suite) on `t.Parallel()`.
…ed-short-circuit race tests
Three new race tests built on the unix-socket RPC harness and the
test-only fault-barrier hooks. None of them use sleeps, retries, or
soak loops — each test installs explicit barriers on the child's
worker goroutine, drives the racing kernel operation from the parent,
and asserts on a concrete post-state.
- TestStaleSourceRaceMissingAndRemove: regression test for the
production fix in the next stacked PR. Plants a non-zero sentinel
into the source page, parks the worker via barrierBeforeRLock,
fires madvise on the same page, waits for the REMOVE batch to
commit (state == removed), releases the worker, then asserts the
page is zero-filled. PRE-FIX (this PR) the worker UFFDIO_COPYs the
planted sentinel because it captured `source = u.src` in the
parent loop before the REMOVE landed and never re-reads state
inside the goroutine; the assertion fires. POST-FIX (next PR) the
worker re-reads state under RLock, observes `removed`, and
zero-faults; assertion passes. Wallclock < 50ms / variant.
- TestNoMadviseDeadlockWithInflightCopy: liveness regression test
for the user-visible symptom that originally surfaced the race —
parent madvise deadlocking while the worker holds RLock. Parks
the worker via barrierBeforeFaultPage (i.e. holding RLock, mid-
handler), fires MADV_DONTNEED, asserts madvise returns within 2s.
Catches any future change that accidentally couples readEvents()
to settleRequests as a fast assertion failure rather than a 30m
CI timeout. Wallclock < 50ms / variant.
- TestFaultedShortCircuitOrdering: smoke test on the
REMOVE-then-pagefault batch ordering using the gated harness.
Pins the invariant that REMOVE batches drain before pagefault
dispatch in a single Serve iteration. Wallclock ~120ms / variant.
These tests are EXPECTED TO FAIL on this PR — they are the bug
demonstration. The production fix lands in the stacked PR
fix/uffd-stale-source-race and flips them green.
Once the fix lands, all three pass `-count=20 -timeout=30s`
deterministically.
PR SummaryHigh Risk Overview Reworks the cross-process UFFD tests to use a single unix-socket Reviewed by Cursor Bugbot for commit 151736b. Bugbot is set up for automated code reviews on this repo. Configure here. |
removed state, and deterministic race tests
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 151736b819
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }) | ||
| var source block.Slicer | ||
|
|
||
| switch state := u.pageTracker.get(addr); state { |
There was a problem hiding this comment.
Move page-state lookup under settleRequests read lock
Serve() snapshots pageTracker state in the parent loop before starting the worker goroutine, so the worker can run with a stale source decision if a UFFD_EVENT_REMOVE is processed in between. In that interleaving, the worker may still UFFDIO_COPY bytes from u.src into a page that was just marked removed, which is the stale-source race exercised by TestStaleSourceRaceMissingAndRemove. Read the state/source inside the goroutine after acquiring settleRequests.RLock() so REMOVE updates are observed before fault handling.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 151736b. Configure here.
| if _, err := unix.FcntlInt(uintptr(dup), unix.F_SETFD, unix.FD_CLOEXEC); err != nil { | ||
| childForkMu.Unlock() | ||
| require.NoError(t, err) | ||
| } |
There was a problem hiding this comment.
Mutex leak on Dup failure deadlocks parallel tests
Low Severity
After childForkMu.Lock() runs, the very next require.NoError checking the syscall.Dup result will call t.FailNow() (which uses runtime.Goexit) without releasing the mutex if Dup fails. Since the mutex serializes every configureCrossProcessTest call across parallel tests in the binary, all subsequent tests would block indefinitely on childForkMu.Lock(). The FcntlInt failure branch unlocks before failing, but the Dup branch does not.
Reviewed by Cursor Bugbot for commit 151736b. Configure here.
| defer s.mu.Unlock() | ||
| s.startServe() | ||
|
|
||
| return nil |
There was a problem hiding this comment.
ServeResume can spawn duplicate serve goroutines
Low Severity
startServe unconditionally creates a new fdexit, a new done channel, and a new serve goroutine, then overwrites s.stop. If ServeResume is ever called when serving is already active (no preceding ServePause), two Serve goroutines end up polling the same uffd fd concurrently and the previous goroutine's stop closure is leaked. Adding a guard such that startServe only runs when s.stop == nil would make the API self-consistent.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 151736b. Configure here.


This PR is the test side of a stacked pair targeting
main:feat/free-page-reporting(uffd: add support for UFFD_EVENT_REMOVE events #1896) needed to compile and exercise theREMOVE/MISSING/COPY interleavings, plus the new
net/rpc/jsonrpctestharness over a unix domain socket and three deterministic race tests.
The race tests are intentionally configured against the buggy
pre-fix worker dispatch (state read in the parent loop instead of under
settleRequests.RLock), soTestStaleSourceRaceMissingAndRemovefailson this PR. That failure is the bug demonstration.
fix/uffd-stale-source-race-on-main— applies theone-file production fix that flips PR A's failing test green.
Why a separate stack instead of just landing in #1896: the user wanted
the stale-source race to be reviewable as a tests-then-fix pair against
main, independent of the rest of the free-page-reporting feature work(Firecracker v1.14, FPR feature flag, template-manager plumbing, proto
regen, etc.). When this stack merges, #1896 will rebase on top and drop
the lifted commits.
What's in this PR
1. Production code lifted from #1896 (commit
c50322b1d)Files:
userfaultfd.go,deferred.go,page_tracker.go,prefault.go,plus pre-existing scaffolding in
cross_process_helpers_test.go,fd_helpers_test.go,helpers_test.go, and the newremove_test.go(scaffolding form — replaced by the harness in commit 2).
The lift adds:
pageTracker.pageStategainsremoved(distinct frommissing/faulted).Serve()drains the UFFD fd in a per-poll-cycle batch and processesUFFD_EVENT_REMOVE: REMOVE batches takesettleRequests.Lock()andcall
pageTracker.setState(..., removed)for every page in theremoved range, before any pagefault dispatch in that iteration.
Serve()readspageTrackerstatein the parent loop (NOT under
settleRequests.RLock) and capturessource = u.srcthere before dispatching the worker goroutine.This is the buggy form that PR B fixes. A
REMOVEevent arrivingbetween the parent-loop state read and the worker actually acquiring
RLock leaves the worker with a stale
source = u.srcsnapshot, whichit then
UFFDIO_COPYs into a page the kernel justMADV_DONTNEED'd.deferred.gobatches up pagefaults that returned EAGAIN/short-circuitso the next poll cycle picks them up; a self-pipe
wakeupPipewakespoll immediately when a deferred fault is enqueued.
prefault.gogrows the same batched-event handling.Out of scope (intentionally NOT lifted from #1896): Firecracker v1.14
bump, free-page-reporting feature flag, template-manager / API plumbing,
proto regen, fcversion sandbox feature, anything outside
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/.Lift size: 8 files, +883 / -193 lines (production + base scaffolding).
2.
net/rpc/jsonrpctest harness over unix socket (commitfe8d2df75)Replaces the previous pile of single-purpose pipes (offsets, ready,
gate-cmd, gate-sync) plus SIGUSR1/SIGUSR2 signals with one bidirectional
unix domain socket carrying stdlib
net/rpc+net/rpc/jsonrpc. Onlythe kernel userfaultfd is still handed off out-of-band (via
ExtraFiles); source data goes through a temp file.RPC methods exposed by the child's
Service:Service.WaitReady— replaces the ready pipe + ReadAll handshake.Service.PageStates— replaces SIGUSR2 + offsets pipe + binary.Write protocol; returns{missing, faulted, removed}offset slices.Service.ServePause— replaces gate-cmd byte protocol; pauses the serve loop at the next iteration.Service.ServeResume— replaces gate-sync byte protocol; resumes a paused serve loop.Service.InstallFaultBarrier— NEW: arms a deterministic barrier in the child's worker goroutine at one of two hook points (barrierBeforeRLockorbarrierBeforeFaultPage); returns a token.Service.WaitFaultHeld— NEW: blocks until the worker reaches the installed barrier — the RPC reply IS the rendezvous.Service.ReleaseFault— NEW: lets the parked worker proceed.Service.Shutdown— replaces SIGUSR1 graceful exit.Two new test-only fields land on
Userfaultfd:beforeWorkerRLockHookandbeforeFaultPageHook, bothfunc(addr uintptr). Both default toniland are nil-checked on thehot path, so production sees zero behavioural change. They are only
assigned in the child's
crossProcessServewiring.testConfiggains asourcePatcherhook so race tests can plant adeterministic sentinel byte into the random source data BEFORE the
content file is written, instead of depending on the happenstance value
of any randomly-generated byte.
The same commit removes
t.Parallel()from the gated cross-processtests (
TestRemoveThenWriteGated,TestWriteThenRemoveGated,TestFaultedShortCircuitOrdering). While the handler is in the gatedpausedstate, the user thread that triggered the queued WRITE faultis suspended in the kernel's pagefault path. From the Go runtime's
perspective that goroutine is "running" (not in syscall) and cannot be
preempted. If a CONCURRENT cross-process test triggers a stop-the-world
GC pause, STW will wait forever for the suspended goroutine to reach a
safe point. Running the gated tests sequentially avoids that
interleaving while leaving every other test on
t.Parallel().3. Deterministic race tests (commit
151736b81)Three tests, all built on the harness + barrier hooks. None of them use
sleeps, retries, or soak loops; each one installs an explicit barrier,
drives the racing kernel operation from the parent, and asserts on a
concrete post-state.
TestStaleSourceRaceMissingAndRemove— regression test for thefix landing in PR B. Plants a non-zero sentinel (
0xc3) into page 1,parks the worker via
barrierBeforeRLock, firesMADV_DONTNEEDonthe same page, waits for the REMOVE batch to commit
(
state == removed), releases the worker, then asserts the page iszero-filled.
0xc3is exactly the planted sentinel, proving the workerUFFDIO_COPY'd staleu.srcbytes after the page was removed.TestNoMadviseDeadlockWithInflightCopy— liveness regression testfor the user-visible symptom that originally surfaced the bug. Parks
the worker via
barrierBeforeFaultPage(i.e. holding RLock,mid-handler), fires
MADV_DONTNEEDfrom the parent, asserts madvisereturns within 2 s. Pre-fix this currently passes (the lifted
readEvents()does not takesettleRequests, so madvise is notblocked). Future-proofs against any change that accidentally couples
readEvents()tosettleRequests— would surface as a fast 2 sassertion failure rather than a 30 m CI timeout.
TestFaultedShortCircuitOrdering— smoke test on theREMOVE-then-pagefault batch ordering using the gated harness. Pins
the invariant that REMOVE batches drain before pagefault dispatch in
a single Serve iteration. Passes both pre- and post-fix — guards
against future regressions in batch ordering.
Local sudo run on this branch (Linux, Go 1.25.9):
Stacking
main.feat/uffd-page-tracker-and-remove-events).Related
feat/free-page-reporting). When thisstack merges, uffd: add support for UFFD_EVENT_REMOVE events #1896 will rebase on top and drop the lifted commits.
race against commit
c8b3d63— that report is what this stackaddresses.
Test plan
go build ./...clean againstmain+ this PR.go vet ./...clean.sudo go test -timeout 2m -count=1 -run 'Race|Deadlock|Ordering' ./pkg/sandbox/uffd/userfaultfd/...—TestStaleSourceRaceMissingAndRemovefails with the documented assertion (bug demo); other two pass.sudo go test -timeout 5m -count=1 ./pkg/sandbox/uffd/userfaultfd/...— all pre-existing tests (TestRemove*,TestRemoveThenFault,TestRemoveThenWriteGated,TestWriteThenRemoveGated,TestMissing*,TestMissingWrite*,TestAsyncWriteProtection,TestSerial*,TestParallel*) pass.