Value separation (WiscKey, GlideFS-tuned) + hardening#3
Merged
Conversation
Large values (>= 128 KiB = one GlideFS block) are written to a
content-addressed, refcounted blob store (values/blob-{blake3-128});
the log record carries only the 16-byte hash. Because compaction then
relocates only the tiny pointer and never the value, write
amplification on GlideFS (where dedup is offset-keyed, so relocation =
re-upload) collapses to ~1x for large values. Small values stay inline.
Correctness hardening shipped with it:
- Blob durability ordering: blob written + fsync'd (data + dir) BEFORE
the pointer record; deletes deferred past the next log fsync
(collect_garbage) so a crash can never leave a dangling pointer.
- Per-key stripe locks make CAS atomic (check-before-append, no orphan).
- Per-content file-op locks serialize a blob's create vs GC unlink.
- Orphan reclamation (sweep_orphans) + blob read integrity (re-hash vs
content hash on every read).
- Reclaim drains in-flight writes before sealing so the footer is a
consistent snapshot (no acked write lost on footer recovery); writes
wait on reclaim instead of erroring.
- Revision (tstamp) seeded from max recovered across restart (monotonic).
- file_id widened u16 -> u32 (IndexEntry stays 24 bytes).
- Namespace cap re-checked after the open await (no concurrent overshoot).
Tests (teeth-verified where a regression must fail the test):
- Exhaustive crash-consistency suite (tail truncation at every offset,
bit-rot, torn-footer scan fallback across files).
- GC race, reclaim drain, revision monotonicity, blob integrity,
value-sep watch replay, ENOSPC write-poison, EMFILE graceful
degradation, namespace-cap concurrency.
ARCHITECTURE.md updated to describe the storage format and the why.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- value_store: surface blob-directory fsync failures (warn) instead of swallowing them — a silent fsync error hid a non-durable directory entry. - log: checked millis->u64 conversion (try_from + saturating warn) replacing an unchecked `as` cast that could truncate/overflow. - watch: bound the subscription registry (MAX_TOTAL_SUBSCRIPTIONS); subscribe_* now return Result so a client cannot grow per-shard memory without limit. - tests/writeamp: measure compaction write amplification (value-sep vs inline baseline) via the compaction_bytes counter — 2224x fewer relocated bytes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Large values (≥ 128 KiB = one GlideFS block) go to a content-addressed, refcounted blob store (
values/blob-{blake3-128}); the log record carries only the 16-byte hash. Compaction then relocates only the tiny pointer, never the value.Why
GlideFS dedup is offset-keyed, so relocation = re-upload. Inline, a value is re-uploaded on every compaction; separated, it's uploaded once.
Measured (engine
compaction_bytescounter, 60×32 KiB values, 15 compactions):Verified on real GlideFS code (in-memory/local-fs S3): blob re-uploaded ~13× inline vs ~1× separated; forks share via CoW (fork adds 131 KB, not a 2 MB copy); end-to-end on a real GlideFS ext4 device, 40 blobs stayed byte-identical (inode+mtime) across 5 KV compactions → 0 re-uploads.
Hardening included
mise run format/check:rs(clippy -D warnings) /cargo test --liball green locally.🤖 Generated with Claude Code