Skip to content

[1.14] Implement virtio discard and write zeroes#13

Open
kalyazin wants to merge 19 commits intoe2b-dev:firecracker-v1.14-direct-memfrom
kalyazin:feat/virtio-blk-discard-wz-1.14
Open

[1.14] Implement virtio discard and write zeroes#13
kalyazin wants to merge 19 commits intoe2b-dev:firecracker-v1.14-direct-memfrom
kalyazin:feat/virtio-blk-discard-wz-1.14

Conversation

@kalyazin
Copy link
Copy Markdown

@kalyazin kalyazin commented May 8, 2026

Changes

  • implement VIRTIO_BLK_F_DISCARD
  • implement VIRTIO_BLK_F_WRITE_ZEROES

The feature bits are advertised by Firecracker for rw block devices via calling:

  • discard -> fallocate(ALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
  • write zeroes (unmap) -> fallocate(ALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
  • write zeroes (!unmap) -> fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE)

If the underlying host filesystem doesn't support those (EOPNOTSUPP returned), Firecracker caches the knowledge and does not attempt to run them again.

No new Firecracker API is introduced (no opt-in is required to make use of the feature).

Note: snapshot rebuild is required to make use of the feature as the feature bits are negotiated at the time the block device is initialised (usually on boot).

Reason

Reduce host memory consumed by the drive.

kalyazin added 19 commits May 8, 2026 09:03
The virtio-blk spec defines a 56-byte config space with fields for
geometry, topology, writeback, and discard at fixed offsets. The
previous struct only held `capacity` (8 bytes), so the driver
never saw the fields beyond offset 8.

The full layout is required to advertise VIRTIO_BLK_F_DISCARD: the
spec mandates that `max_discard_sectors` (offset 40) and
`max_discard_seg` (offset 44) be set before the feature can be used
by the guest. Without the full struct, there is nowhere to write
those values. All new fields default to zero, so there is no
behavioral change in this commit.

A compile-time size assertion
(`assert!(size_of::<ConfigSpace>() == 56)`) guards against future
accidental padding changes. Tests are updated to use
`..Default::default()` for forward-compatible ConfigSpace init.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The virtio-blk spec (section 5.2.6.14) defines VIRTIO_BLK_T_DISCARD
(type 11) as a request the guest driver issues to free sectors it no
longer needs. To handle it, the VMM needs three things introduced here:

- `RequestType::Discard` so the dispatcher can route the request type.
- `DiscardWriteZeroes` (16 bytes, ByteValued) matching the layout of a
  single discard segment that the guest places in the data descriptor.
- `DISCARD_SEGMENT_SIZE` constant (16) used in parse() to validate that
  the data buffer length is a non-zero multiple of the segment size.
- `IoErr::InvalidFlags` / `IoErr::InvalidOffset` for per-segment
  validation errors added in the processing step.

Placeholder arms in `finish()` and `process()` keep the build green;
they are replaced with the real implementation in the next commit.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add SyncIoError::Discard, SyncFileEngine::discard() and
AsyncFileEngine::discard() (synchronous fallback — discard is not
latency-critical and IORING_OP_FALLOCATE is not in the io_uring
allowlist). Wire up FileEngine::discard() dispatching to both engines.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Fix 2-space indented getrandom entries (should be 4 spaces like all
other entries) and remove trailing whitespace from madvise comment.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
VIRTIO_BLK_F_DISCARD issues fallocate(FALLOC_FL_PUNCH_HOLE) on the
backing file. Add fallocate to the vmm thread seccomp filter on both
x86_64 and aarch64 so the syscall is not blocked at runtime.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add parse() validation: discard data descriptor must be read-only and
data_len must be a non-zero multiple of DISCARD_SEGMENT_SIZE (16 bytes).

Add process_discard_segments(): validates flags==0, num_sectors>0, and
sector bounds per segment, then calls FileEngine::discard() using
fallocate(FALLOC_FL_PUNCH_HOLE). Returns VIRTIO_BLK_S_IOERR on any
validation or syscall failure; VIRTIO_BLK_S_OK on success.

Add discard_count metric incremented on each successful discard.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Populate max_discard_sectors and max_discard_seg in restore() to match
what new() produces for the expanded ConfigSpace. No version bump needed
— main already carried 9.0.0 → 10.0.0 in this release cycle.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Set VIRTIO_BLK_F_DISCARD in avail_features for all writable block
devices. Populate max_discard_sectors (capped to disk size) and
max_discard_seg=1 in the config space. Read-only devices advertise
neither the feature nor non-zero discard config fields.

This is a behavioral change for existing writable block device
configurations: guests will now see and may negotiate the discard
feature. See docs/api_requests/block-discard.md for host filesystem
requirements and recommendations.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Explain how discard works, host requirements, limitations, and snapshot
compatibility impact.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Test feature flag advertisement, config space discard fields,
and end-to-end discard request processing (success and error paths).
Update test_virtio_features and test_virtio_read_config to account
for the new VIRTIO_BLK_F_DISCARD bit and config space fields.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests:
- test_discard: boot VM with writable ext4 disk, write+delete data,
  run fstrim, verify rc=0 and discard_count metric increments.
- test_discard_not_advertised_for_read_only: verify read-only devices
  report discard_max_bytes=0 in sysfs (feature not advertised).
Both parametrized over Sync and Async IO engines.
Also add discard_count to the fcmetrics block_metrics list.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Replace the 4-byte _unused1 padding at offset 52 with the virtio-blk
write-zeroes fields:
  - max_write_zeroes_sectors  (u32, offset 52)
  - max_write_zeroes_seg      (u32, offset 56)
  - write_zeroes_may_unmap    (u8,  offset 60)
  - _unused1                  (u8;3, offset 61, padding to 64)

Bump the compile-time size assertion to 64. Fields default to zero;
no behavioural change. They will be populated when
VIRTIO_BLK_F_WRITE_ZEROES is wired up.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add the RequestType::WriteZeroes variant and From<u32> mapping for
VIRTIO_BLK_T_WRITE_ZEROES. Add IoErr::WriteZeroesUnsupported and
Status::WriteZeroesUnsupported (parallel to the discard equivalents)
that return VIRTIO_BLK_S_UNSUPP silently when the host filesystem
doesn't support the operation. Add a write_zeroes_unsupported
EOPNOTSUPP cache on DiskProperties (not persisted).

Add placeholder arms in PendingRequest::finish() and Request::process()
so the code compiles. Both currently return WriteZeroesUnsupported;
they will be replaced when the full feature is wired up.

Update the test-only From<RequestType> for u32 and request_type_flags()
helpers to handle the new variant.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add write_zeroes(offset, len, unmap, req) on FileEngine plus the
corresponding SyncFileEngine::write_zeroes() and
AsyncFileEngine::push_write_zeroes() implementations.

The unmap flag selects the fallocate mode at call time:
  - unmap=true:  FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE (mode 3)
  - unmap=false: FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE (mode 17)

The async path uses io_uring's IORING_OP_FALLOCATE (already in the
allowlist from the discard work). The sync path calls libc::fallocate.

Add SyncIoError::WriteZeroes and extend BlockIoError::is_eopnotsupp()
to also match SyncIoError::WriteZeroes so the EOPNOTSUPP fallback
machinery (added later) catches both modes.

No seccomp / io_uring restriction changes are required: fallocate is
already in the seccomp filter and IORING_OP_FALLOCATE is already
allowed via the discard work.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Wire up Request::parse() validation and Request::process() handling
for write-zeroes requests, mirroring the discard path:

  - parse(): reject write-only data descriptors and require data_len
    to be a non-zero multiple of DISCARD_SEGMENT_SIZE (the
    DiscardWriteZeroes struct is shared between both ops).
  - parse_write_zeroes_segment(): validate flags (only bit 0,
    VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, is permitted), reject
    num_sectors=0, check sector range, and return the unmap flag
    so process() can pick the fallocate mode.
  - process(): short-circuit to VIRTIO_BLK_S_UNSUPP if a previous
    EOPNOTSUPP set disk.write_zeroes_unsupported; otherwise dispatch
    to FileEngine::write_zeroes().
  - the (Ok(_), RequestType::WriteZeroes) finish() arm now increments
    the write_zeroes_count metric and returns VIRTIO_BLK_S_OK.
  - the EOPNOTSUPP detection in the error path also handles
    RequestType::WriteZeroes, setting the cache and returning
    Status::WriteZeroesUnsupported (silent UNSUPP).

Add the write_zeroes_count metric and aggregate() line.

The feature flag is not yet advertised — that lands in the next
commit, after the request handling is fully in place.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
…vices

Set VIRTIO_BLK_F_WRITE_ZEROES in avail_features for any non-read-only
block device, alongside VIRTIO_BLK_F_DISCARD. Populate the
max_write_zeroes_sectors, max_write_zeroes_seg, and
write_zeroes_may_unmap config fields in both VirtioBlock::new() and
the persist::restore() path so the values match what the guest sees
on a fresh boot vs after a snapshot restore.

write_zeroes_may_unmap=1 lets the guest set the UNMAP flag on
individual segments, which we then translate to fallocate's
PUNCH_HOLE mode (UNMAP=0 uses ZERO_RANGE).

Update the test_virtio_features and test_virtio_read_config
expectations to account for the new feature bit and config fields.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add three tests parametrised over Sync and Async file engines:

  - test_write_zeroes_feature_and_config: VIRTIO_BLK_F_WRITE_ZEROES
    advertised on writable devices; max_write_zeroes_sectors,
    max_write_zeroes_seg, and write_zeroes_may_unmap populated as
    expected; read-only devices advertise none of these.
  - test_write_zeroes: end-to-end happy path (UNMAP=1 → PUNCH_HOLE,
    metric incremented, VIRTIO_BLK_S_OK) plus four error cases
    (reserved flag bit set, sector range out of bounds, num_sectors=0,
    invalid data length).
  - test_write_zeroes_unsupported_cached: pre-set the cache flag and
    verify subsequent requests return VIRTIO_BLK_S_UNSUPP without
    incrementing the success or invalid-request counters.

The UNMAP=0 (ZERO_RANGE) happy path is not asserted in the unit
tests because ZERO_RANGE support is filesystem-dependent (the host's
tmpfs in some kernels returns EOPNOTSUPP); end-to-end UNMAP=0
correctness is left to the integration tests running on ext4. The
unit tests cover wire-flag handling for UNMAP=0 via the
invalid-flags and zero-sectors cases.

Also extend test_request_type_from to assert the new
VIRTIO_BLK_T_WRITE_ZEROES → RequestType::WriteZeroes mapping.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests, parametrised over Sync and Async IO engines:

  - test_write_zeroes: boot VM with a writable block device, write
    random data to a 1 MiB region of /dev/vdb, run `blkdiscard -z`
    (BLKZEROOUT ioctl), then verify:
      * sysfs /queue/write_zeroes_max_bytes is non-zero (feature
        negotiated)
      * the region reads back as zeros after the operation
      * no requests failed
  - test_write_zeroes_not_advertised_for_read_only: read-only device
    reports write_zeroes_max_bytes=0 in sysfs.

The test deliberately does not assert that the write_zeroes_count
metric incremented. The Linux kernel's blkdev_issue_zeroout() may
legitimately fall back to issuing plain zero-page writes
(REQ_OP_WRITE) instead of REQ_OP_WRITE_ZEROES even when the feature
is advertised, depending on internal heuristics. Both paths leave
the device reading as zeros, so a metric assertion would be flaky in
CI without indicating any actual bug. Direct WRITE_ZEROES
request-handling correctness is covered by the unit tests.

Use cmp -n /dev/vdb /dev/zero for the zero check (od collapses
repeated lines with *, which makes naive byte-comparison fragile).

Add write_zeroes_count to host_tools/fcmetrics.py so the metrics
schema validation passes.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add docs/api_requests/block-write-zeroes.md describing:
  - automatic advertisement on writable devices
  - UNMAP=0 → FALLOC_FL_ZERO_RANGE (zeros in place)
  - UNMAP=1 → FALLOC_FL_PUNCH_HOLE (zeros + deallocate)
  - host filesystem requirements
  - EOPNOTSUPP fallback (silent VIRTIO_BLK_S_UNSUPP, shared cache)
  - known limitations

Remove the "write_zeroes is not supported" line from block-discard.md
now that the feature is implemented.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@cla-bot cla-bot Bot added the cla-signed label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant