Skip to content

experiment(ext4): noatime+lazytime mounts and lazy mkfs to shrink diffs#2549

Closed
ValentaTomas wants to merge 9 commits intomainfrom
experiment/ext4-diff-shrink-options
Closed

experiment(ext4): noatime+lazytime mounts and lazy mkfs to shrink diffs#2549
ValentaTomas wants to merge 9 commits intomainfrom
experiment/ext4-diff-shrink-options

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Three small ext4 knobs aimed at the same goal as #2546 — keeping snapshot diff size small — but separated here so we can decide each on its own. All "expected positive value, no real downside" in the research summary; none are slam-dunks until measured on a real workload.

Changes

1. Guest rootfs mount — rootflags = discard,noatime,lazytime
ext4's relatime default still touches the inode-table block roughly once per day per accessed inode. noatime removes that. lazytime keeps mtime/ctime updates in memory and only persists them at sync, which the snapshot path already triggers — so repeated touches of the same inode collapse into one inode-table write at snapshot time instead of N.

2. Build-time host loop mount — loop,discard,noatime,lazytime
Same reasoning as (1), applied during template construction so directory walks in the build phase don't dirty inode-table blocks in the template.

3. mkfs.ext4 -E lazy_itable_init=1,lazy_journal_init=1,discard
The two lazy_* options skip eager zero-fill of the inode table and journal at format time, so the freshly-created rootfs.ext4 file stays maximally sparse on the host. The kernel's ext4lazyinit thread fills them on first mount — and with the cache changes from #2546, those zero writes now route through fallocate(PUNCH_HOLE) instead of growing the diff. discard is included because mke2fs handles it gracefully on regular files (no-op or PUNCH_HOLE fallback).

What's not in here

inline_data, data=writeback, bigalloc, commit=N, journal_async_commit — all "measure-first" or risky for ephemeral sandboxes. Happy to add separately if anything looks promising in benchmarks.

Stacked on #2546 because the changes share the rootflags line and the mkfs lazy-init story only pays off once the cache layer punches zero writes.

A small two-state-plus-default tracker backed by roaring bitmaps. Used by
upcoming UFFD work to track page states (Missing/Faulted/Removed) and by
NBD to track zero pages, replacing ad-hoc map-based trackers with O(1)
range ops and cheap snapshot exports.
…state

Replace the map-based pageTracker with block.StateTracker[pageState], a
roaring-bitmap-backed tracker with O(1) range ops. pageState gains a
third value, removed, which is wired at the type level but not yet
written anywhere -- #2520 adds the REMOVE-event handler that produces
it. Page indices are computed at the call site via header.BlockIdx.
pageStateEntries is updated to iterate the exported bitmaps so the
cross-process test harness keeps working.

Inline the 3-line pageState enum into userfaultfd.go and drop the
dedicated page_tracker.go now that pageTracker is gone.

Convert block.StateTracker's NewStateTracker / SetRange API from panics
to errors. Distinct-state validation and unsupported-state checks now
return fmt.Errorf descriptors; the userfaultfd-side init propagates the
constructor error through NewUserfaultfdFromFd, and the SetRange call
in the worker path logs and continues since these errors only fire on
programming bugs.
ext4 in the guest hands the orchestrator three classes of
"this region reads as zero" hints. Until now we serialized all
of them as actual zero bytes in the snapshot diff:

  1. NBD_CMD_TRIM (advertised → never sent; ext4 silently skips it).
  2. NBD_CMD_WRITE_ZEROES (same).
  3. Plain NBD_CMD_WRITE of an all-zero buffer (e.g. dd if=/dev/zero,
     scratch wipes, qcow2-style preallocation by user code).

This commit adds the host-side machinery to recognize all three and
record them as Empty in the diff (mapped to uuid.Nil — already handled
by the read path) instead of copying zero payload.

Pieces:

- block.Cache: replace the single dirty bitset with StateTracker tracking
  Untouched / Dirty / Zero, and add WriteZeroesAt that clears the mmap
  for fully-covered blocks and marks them Empty. Sub-block tails stay
  Dirty so partial overwrites of a zeroed block are preserved.
  ExportToDiff now populates DiffMetadata.Empty alongside Dirty.

- block.IsZero: 3-byte sample short-circuit + bytes.Equal self-shift
  trick (mirrors qemu's buffer_is_zero). Hot-path on every guest write,
  so it dispatches to the runtime's SIMD memequal on amd64/arm64.

- nbd.Dispatch: implement NBD_CMD_WRITE_ZEROES (opcode 6), wire TRIM to
  the same handler (TRIM is advisory — guaranteeing zero-on-read lets
  the diff drop those blocks), and short-circuit zero-buffer cmdWrite
  through the same path.

- nbd.DirectPathMount: advertise FlagSendTrim and FlagSendWriteZeroes
  so the guest kernel actually emits the optimized opcodes.

- fc/process: rootflags=discard so ext4 emits TRIM on freed blocks.

- StateTracker.HasRange: cheap union-coverage query needed by Cache to
  preserve isCached semantics across the two non-default bitmaps.

Tests cover the new cache primitive (aligned vs unaligned WriteZeroesAt)
and the dispatch routing (zero-write detection, WRITE_ZEROES, TRIM,
backend-error response).
…card

Two follow-ups uncovered while reviewing the diff path:

- header.IsEmptyBlock did a full bytes.Equal against a pre-allocated
  zero buffer — fine for 4 KiB rootfs blocks, wasteful for 2 MiB
  hugepages where the qemu-style 3-byte sample would reject most
  non-zero pages from a single cache line. Move IsZero to the shared
  header package and have IsEmptyBlock delegate. The caller (uffd
  diff materialization via DiffMetadataBuilder) now skips the memcmp
  on the common non-zero path. Drop the orchestrator-side duplicate
  and the now-unused EmptyBlock global; EmptyHugePage stays because
  build.go still needs the literal zero buffer for nil-build reads.

- ext4 host loop mount (template build) had no discard. Adding
  loop,discard makes the loop driver translate BLKDISCARD into
  fallocate(PUNCH_HOLE) on the backing rootfs.ext4, so any deletions
  during the build phase keep the template file sparse on disk.
Two related fixes triggered by the snapshot-shrink work.

cache.WriteZeroesAt now fallocate(PUNCH_HOLE | KEEP_SIZE)s the aligned
core instead of memset-zeroing the mmap. Reads still return zero (the
mmap fault serves zero from the hole), but the underlying cache file
releases the previously-allocated pages immediately. Important because
the cache lives on tmpfs in production — punched pages free RAM, not
just disk. Sub-block head/tail still go through the clear-and-mark-dirty
path because punching a partial block would corrupt the neighbouring
half. The cache now keeps the file open for the lifetime of the cache
so we don't pay an open/close on every WriteZeroesAt.

The NBD wire layout puts a uint16 flags and a uint16 type in the four
bytes after magic, but the dispatch was reading them together as one
uint32. The switch only matched while the kernel chose flags=0; the
moment it set NBD_CMD_FLAG_FUA on a sync write or NBD_CMD_FLAG_NO_HOLE
on a WRITE_ZEROES the dispatch fell through to the default error path.
Split Request.Flags from Request.Type and document that we ignore every
command flag for now (NO_HOLE is moot — shrinking the diff is the
point).

Tests cover the NO_HOLE-flagged WRITE_ZEROES route (would have been a
silent regression with the old parser).
…ch short-circuit

Move the IsZero fast-path from the NBD dispatcher into block.Cache so it
can split sub-block runs:

- Cache.WriteAtWithoutLock now scans aligned full blocks per-block. All-
  zero blocks are punched (FALLOC_FL_PUNCH_HOLE) and marked Empty;
  non-zero blocks copy + mark Dirty. Contiguous zero runs coalesce into
  one fallocate call. A non-zero buffer with zero padding inside (the
  common shape for qcow2 preallocation, scratch wipes, format-then-write
  patterns) now contributes only the non-zero sub-blocks to the diff.
- WriteZeroesAt and WriteAt share a single punchHole helper and use
  header.BlockCeilIdx/BlockIdx for alignment instead of local helpers.
- nbd/dispatch cmdWrite drops the buffer-wide IsZero check (and its
  header import); zero-detection is now the cache's responsibility for
  every writer (NBD, copyProcessMemory if it ever wants it, etc.).
- Tests updated: the three "zero block written as Dirty" tests added in
  #2386 documented the OLD behaviour and now flip — a zero block written
  through WriteAt routes to Empty and maps to uuid.Nil. New mixed-buffer
  test covers the sub-block split case explicitly.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 3, 2026

PR Summary

Medium Risk
Changes ext4 formatting and mount flags for both the guest root filesystem and host loop mounts, which can affect filesystem writeback behavior and snapshot consistency/performance. While intended to reduce diff size, it needs validation under real workloads and kernel versions.

Overview
Adjusts ext4 handling to reduce snapshot diff growth by enabling lazy zero-initialization during mkfs.ext4 and mounting the root filesystem with noatime and lazytime (in addition to discard) both in the Firecracker guest kernel rootflags and during template build loop mounts.

Reviewed by Cursor Bugbot for commit 7a99339. Bugbot is set up for automated code reviews on this repo. Configure here.

Strip verbose doc comments and inline narration across the NBD/cache
diff-shrinking change. Drop redundant dispatch tests (kept the two that
actually exercise WRITE_ZEROES routing) and simplify mockProvider to
the surface the remaining tests use. Sync diff.go/diff_test.go/metadata.go
with #2547 so the duplicated content matches once that PR rebases in.

No behaviour changes.
Drops the *os.File the cache was holding for the lifetime of the
sandbox. MADV_REMOVE is the in-kernel equivalent of fallocate(PUNCH_HOLE
| KEEP_SIZE) issued through the mmap, so it frees both the file extent
and the tmpfs pages, and reads after it fault back as zero — same
semantics, one fewer fd.

Also lets NewCache go back to a plain `defer f.Close()` and removes the
file-close branch from Close().
Three follow-ups on the discard work, all aimed at reducing snapshot diff
size by suppressing metadata churn that doesn't carry user-visible state.
Bundled in one PR for discussion — happy to split or drop individually.

- Guest rootfs mount: extend rootflags to discard,noatime,lazytime.
  noatime stops ext4's relatime default from touching the inode-table
  block ~once per day per accessed inode. lazytime keeps mtime/ctime
  updates in memory and persists them at sync (which the snapshot path
  already triggers), so repeated touches collapse into one inode write.

- Build-time loop mount: add noatime,lazytime alongside the existing
  loop,discard so the build phase doesn't dirty inode-table blocks just
  because we walked or read a directory tree.

- mkfs.ext4: pass -E lazy_itable_init=1,lazy_journal_init=1,discard.
  The first two skip eager zero-fill of the inode table and journal at
  format time, so the freshly-created rootfs.ext4 file stays maximally
  sparse on the host. The kernel's ext4lazyinit thread fills them on
  first mount, where our cache's IsZero path now routes the writes to
  fallocate(PUNCH_HOLE) instead of growing the diff. `discard` falls
  back gracefully on regular files.
@ValentaTomas ValentaTomas force-pushed the experiment/ext4-diff-shrink-options branch from 074035b to 7a99339 Compare May 3, 2026 23:00
@ValentaTomas ValentaTomas force-pushed the feat/nbd-zero-discard-detection branch 2 times, most recently from 64f075d to 645e279 Compare May 5, 2026 00:36
@ValentaTomas ValentaTomas force-pushed the feat/nbd-zero-discard-detection branch 14 times, most recently from 4f8d03a to deaf64a Compare May 5, 2026 09:27
@ValentaTomas ValentaTomas force-pushed the feat/nbd-zero-discard-detection branch 8 times, most recently from 11438b0 to 88beca9 Compare May 5, 2026 10:46
Base automatically changed from feat/nbd-zero-discard-detection to main May 5, 2026 22:35
@ValentaTomas
Copy link
Copy Markdown
Member Author

Close for now, will reopen after measuring later.

@ValentaTomas ValentaTomas deleted the experiment/ext4-diff-shrink-options branch May 6, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants