experiment(ext4): noatime+lazytime mounts and lazy mkfs to shrink diffs#2549
Closed
ValentaTomas wants to merge 9 commits intomainfrom
Closed
experiment(ext4): noatime+lazytime mounts and lazy mkfs to shrink diffs#2549ValentaTomas wants to merge 9 commits intomainfrom
ValentaTomas wants to merge 9 commits intomainfrom
Conversation
A small two-state-plus-default tracker backed by roaring bitmaps. Used by upcoming UFFD work to track page states (Missing/Faulted/Removed) and by NBD to track zero pages, replacing ad-hoc map-based trackers with O(1) range ops and cheap snapshot exports.
…state Replace the map-based pageTracker with block.StateTracker[pageState], a roaring-bitmap-backed tracker with O(1) range ops. pageState gains a third value, removed, which is wired at the type level but not yet written anywhere -- #2520 adds the REMOVE-event handler that produces it. Page indices are computed at the call site via header.BlockIdx. pageStateEntries is updated to iterate the exported bitmaps so the cross-process test harness keeps working. Inline the 3-line pageState enum into userfaultfd.go and drop the dedicated page_tracker.go now that pageTracker is gone. Convert block.StateTracker's NewStateTracker / SetRange API from panics to errors. Distinct-state validation and unsupported-state checks now return fmt.Errorf descriptors; the userfaultfd-side init propagates the constructor error through NewUserfaultfdFromFd, and the SetRange call in the worker path logs and continues since these errors only fire on programming bugs.
ext4 in the guest hands the orchestrator three classes of
"this region reads as zero" hints. Until now we serialized all
of them as actual zero bytes in the snapshot diff:
1. NBD_CMD_TRIM (advertised → never sent; ext4 silently skips it).
2. NBD_CMD_WRITE_ZEROES (same).
3. Plain NBD_CMD_WRITE of an all-zero buffer (e.g. dd if=/dev/zero,
scratch wipes, qcow2-style preallocation by user code).
This commit adds the host-side machinery to recognize all three and
record them as Empty in the diff (mapped to uuid.Nil — already handled
by the read path) instead of copying zero payload.
Pieces:
- block.Cache: replace the single dirty bitset with StateTracker tracking
Untouched / Dirty / Zero, and add WriteZeroesAt that clears the mmap
for fully-covered blocks and marks them Empty. Sub-block tails stay
Dirty so partial overwrites of a zeroed block are preserved.
ExportToDiff now populates DiffMetadata.Empty alongside Dirty.
- block.IsZero: 3-byte sample short-circuit + bytes.Equal self-shift
trick (mirrors qemu's buffer_is_zero). Hot-path on every guest write,
so it dispatches to the runtime's SIMD memequal on amd64/arm64.
- nbd.Dispatch: implement NBD_CMD_WRITE_ZEROES (opcode 6), wire TRIM to
the same handler (TRIM is advisory — guaranteeing zero-on-read lets
the diff drop those blocks), and short-circuit zero-buffer cmdWrite
through the same path.
- nbd.DirectPathMount: advertise FlagSendTrim and FlagSendWriteZeroes
so the guest kernel actually emits the optimized opcodes.
- fc/process: rootflags=discard so ext4 emits TRIM on freed blocks.
- StateTracker.HasRange: cheap union-coverage query needed by Cache to
preserve isCached semantics across the two non-default bitmaps.
Tests cover the new cache primitive (aligned vs unaligned WriteZeroesAt)
and the dispatch routing (zero-write detection, WRITE_ZEROES, TRIM,
backend-error response).
…card Two follow-ups uncovered while reviewing the diff path: - header.IsEmptyBlock did a full bytes.Equal against a pre-allocated zero buffer — fine for 4 KiB rootfs blocks, wasteful for 2 MiB hugepages where the qemu-style 3-byte sample would reject most non-zero pages from a single cache line. Move IsZero to the shared header package and have IsEmptyBlock delegate. The caller (uffd diff materialization via DiffMetadataBuilder) now skips the memcmp on the common non-zero path. Drop the orchestrator-side duplicate and the now-unused EmptyBlock global; EmptyHugePage stays because build.go still needs the literal zero buffer for nil-build reads. - ext4 host loop mount (template build) had no discard. Adding loop,discard makes the loop driver translate BLKDISCARD into fallocate(PUNCH_HOLE) on the backing rootfs.ext4, so any deletions during the build phase keep the template file sparse on disk.
Two related fixes triggered by the snapshot-shrink work. cache.WriteZeroesAt now fallocate(PUNCH_HOLE | KEEP_SIZE)s the aligned core instead of memset-zeroing the mmap. Reads still return zero (the mmap fault serves zero from the hole), but the underlying cache file releases the previously-allocated pages immediately. Important because the cache lives on tmpfs in production — punched pages free RAM, not just disk. Sub-block head/tail still go through the clear-and-mark-dirty path because punching a partial block would corrupt the neighbouring half. The cache now keeps the file open for the lifetime of the cache so we don't pay an open/close on every WriteZeroesAt. The NBD wire layout puts a uint16 flags and a uint16 type in the four bytes after magic, but the dispatch was reading them together as one uint32. The switch only matched while the kernel chose flags=0; the moment it set NBD_CMD_FLAG_FUA on a sync write or NBD_CMD_FLAG_NO_HOLE on a WRITE_ZEROES the dispatch fell through to the default error path. Split Request.Flags from Request.Type and document that we ignore every command flag for now (NO_HOLE is moot — shrinking the diff is the point). Tests cover the NO_HOLE-flagged WRITE_ZEROES route (would have been a silent regression with the old parser).
…ch short-circuit Move the IsZero fast-path from the NBD dispatcher into block.Cache so it can split sub-block runs: - Cache.WriteAtWithoutLock now scans aligned full blocks per-block. All- zero blocks are punched (FALLOC_FL_PUNCH_HOLE) and marked Empty; non-zero blocks copy + mark Dirty. Contiguous zero runs coalesce into one fallocate call. A non-zero buffer with zero padding inside (the common shape for qcow2 preallocation, scratch wipes, format-then-write patterns) now contributes only the non-zero sub-blocks to the diff. - WriteZeroesAt and WriteAt share a single punchHole helper and use header.BlockCeilIdx/BlockIdx for alignment instead of local helpers. - nbd/dispatch cmdWrite drops the buffer-wide IsZero check (and its header import); zero-detection is now the cache's responsibility for every writer (NBD, copyProcessMemory if it ever wants it, etc.). - Tests updated: the three "zero block written as Dirty" tests added in #2386 documented the OLD behaviour and now flip — a zero block written through WriteAt routes to Empty and maps to uuid.Nil. New mixed-buffer test covers the sub-block split case explicitly.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 7a99339. Bugbot is set up for automated code reviews on this repo. Configure here. |
Strip verbose doc comments and inline narration across the NBD/cache diff-shrinking change. Drop redundant dispatch tests (kept the two that actually exercise WRITE_ZEROES routing) and simplify mockProvider to the surface the remaining tests use. Sync diff.go/diff_test.go/metadata.go with #2547 so the duplicated content matches once that PR rebases in. No behaviour changes.
Drops the *os.File the cache was holding for the lifetime of the sandbox. MADV_REMOVE is the in-kernel equivalent of fallocate(PUNCH_HOLE | KEEP_SIZE) issued through the mmap, so it frees both the file extent and the tmpfs pages, and reads after it fault back as zero — same semantics, one fewer fd. Also lets NewCache go back to a plain `defer f.Close()` and removes the file-close branch from Close().
Three follow-ups on the discard work, all aimed at reducing snapshot diff size by suppressing metadata churn that doesn't carry user-visible state. Bundled in one PR for discussion — happy to split or drop individually. - Guest rootfs mount: extend rootflags to discard,noatime,lazytime. noatime stops ext4's relatime default from touching the inode-table block ~once per day per accessed inode. lazytime keeps mtime/ctime updates in memory and persists them at sync (which the snapshot path already triggers), so repeated touches collapse into one inode write. - Build-time loop mount: add noatime,lazytime alongside the existing loop,discard so the build phase doesn't dirty inode-table blocks just because we walked or read a directory tree. - mkfs.ext4: pass -E lazy_itable_init=1,lazy_journal_init=1,discard. The first two skip eager zero-fill of the inode table and journal at format time, so the freshly-created rootfs.ext4 file stays maximally sparse on the host. The kernel's ext4lazyinit thread fills them on first mount, where our cache's IsZero path now routes the writes to fallocate(PUNCH_HOLE) instead of growing the diff. `discard` falls back gracefully on regular files.
074035b to
7a99339
Compare
64f075d to
645e279
Compare
4f8d03a to
deaf64a
Compare
11438b0 to
88beca9
Compare
Member
Author
|
Close for now, will reopen after measuring later. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three small ext4 knobs aimed at the same goal as #2546 — keeping snapshot diff size small — but separated here so we can decide each on its own. All "expected positive value, no real downside" in the research summary; none are slam-dunks until measured on a real workload.
Changes
1. Guest rootfs mount —
rootflags = discard,noatime,lazytimeext4's
relatimedefault still touches the inode-table block roughly once per day per accessed inode.noatimeremoves that.lazytimekeeps mtime/ctime updates in memory and only persists them at sync, which the snapshot path already triggers — so repeated touches of the same inode collapse into one inode-table write at snapshot time instead of N.2. Build-time host loop mount —
loop,discard,noatime,lazytimeSame reasoning as (1), applied during template construction so directory walks in the build phase don't dirty inode-table blocks in the template.
3.
mkfs.ext4 -E lazy_itable_init=1,lazy_journal_init=1,discardThe two
lazy_*options skip eager zero-fill of the inode table and journal at format time, so the freshly-createdrootfs.ext4file stays maximally sparse on the host. The kernel'sext4lazyinitthread fills them on first mount — and with the cache changes from #2546, those zero writes now route throughfallocate(PUNCH_HOLE)instead of growing the diff.discardis included because mke2fs handles it gracefully on regular files (no-op orPUNCH_HOLEfallback).What's not in here
inline_data,data=writeback,bigalloc,commit=N,journal_async_commit— all "measure-first" or risky for ephemeral sandboxes. Happy to add separately if anything looks promising in benchmarks.Stacked on #2546 because the changes share the
rootflagsline and themkfslazy-init story only pays off once the cache layer punches zero writes.