Skip to content

Value separation (WiscKey, GlideFS-tuned) + hardening#3

Merged
jaredLunde merged 2 commits into
mainfrom
jared/redesign
Jun 3, 2026
Merged

Value separation (WiscKey, GlideFS-tuned) + hardening#3
jaredLunde merged 2 commits into
mainfrom
jared/redesign

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

What

Large values (≥ 128 KiB = one GlideFS block) go to a content-addressed, refcounted blob store (values/blob-{blake3-128}); the log record carries only the 16-byte hash. Compaction then relocates only the tiny pointer, never the value.

Why

GlideFS dedup is offset-keyed, so relocation = re-upload. Inline, a value is re-uploaded on every compaction; separated, it's uploaded once.

Measured (engine compaction_bytes counter, 60×32 KiB values, 15 compactions):

  • inline baseline: 37.5 MiB relocated
  • value separation: 0.017 MiB
  • 2224× less write amplification

Verified on real GlideFS code (in-memory/local-fs S3): blob re-uploaded ~13× inline vs ~1× separated; forks share via CoW (fork adds 131 KB, not a 2 MB copy); end-to-end on a real GlideFS ext4 device, 40 blobs stayed byte-identical (inode+mtime) across 5 KV compactions → 0 re-uploads.

Hardening included

  • Blob durability ordering, deferred GC, per-key stripe locks (atomic CAS), orphan reclamation, blob read integrity, reclaim drain, file_id u16→u32, namespace-cap concurrency.
  • Surfaced blob-dir fsync errors (no silent swallow), checked millis→u64 cast, bounded watch subscriptions.
  • Exhaustive crash-consistency suite + ENOSPC/EMFILE degradation tests.

mise run format / check:rs (clippy -D warnings) / cargo test --lib all green locally.

🤖 Generated with Claude Code

jaredLunde and others added 2 commits June 2, 2026 18:00
Large values (>= 128 KiB = one GlideFS block) are written to a
content-addressed, refcounted blob store (values/blob-{blake3-128});
the log record carries only the 16-byte hash. Because compaction then
relocates only the tiny pointer and never the value, write
amplification on GlideFS (where dedup is offset-keyed, so relocation =
re-upload) collapses to ~1x for large values. Small values stay inline.

Correctness hardening shipped with it:
- Blob durability ordering: blob written + fsync'd (data + dir) BEFORE
  the pointer record; deletes deferred past the next log fsync
  (collect_garbage) so a crash can never leave a dangling pointer.
- Per-key stripe locks make CAS atomic (check-before-append, no orphan).
- Per-content file-op locks serialize a blob's create vs GC unlink.
- Orphan reclamation (sweep_orphans) + blob read integrity (re-hash vs
  content hash on every read).
- Reclaim drains in-flight writes before sealing so the footer is a
  consistent snapshot (no acked write lost on footer recovery); writes
  wait on reclaim instead of erroring.
- Revision (tstamp) seeded from max recovered across restart (monotonic).
- file_id widened u16 -> u32 (IndexEntry stays 24 bytes).
- Namespace cap re-checked after the open await (no concurrent overshoot).

Tests (teeth-verified where a regression must fail the test):
- Exhaustive crash-consistency suite (tail truncation at every offset,
  bit-rot, torn-footer scan fallback across files).
- GC race, reclaim drain, revision monotonicity, blob integrity,
  value-sep watch replay, ENOSPC write-poison, EMFILE graceful
  degradation, namespace-cap concurrency.

ARCHITECTURE.md updated to describe the storage format and the why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- value_store: surface blob-directory fsync failures (warn) instead of
  swallowing them — a silent fsync error hid a non-durable directory entry.
- log: checked millis->u64 conversion (try_from + saturating warn) replacing
  an unchecked `as` cast that could truncate/overflow.
- watch: bound the subscription registry (MAX_TOTAL_SUBSCRIPTIONS); subscribe_*
  now return Result so a client cannot grow per-shard memory without limit.
- tests/writeamp: measure compaction write amplification (value-sep vs inline
  baseline) via the compaction_bytes counter — 2224x fewer relocated bytes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 8a6c274 into main Jun 3, 2026
4 of 5 checks passed
@jaredLunde jaredLunde deleted the jared/redesign branch June 3, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant