Execution order (sequential, focused)
Project goal: structured-zstd is a drop-in replacement for libzstd / zstd CLI — same ABI/CLI/feature surface (a superset of upstream — every documented upstream symbol works, plus additional Rust-side higher-level streams), but NOT binary parity (compressed output bytes do not have to match upstream — encoder may make different and sometimes better choices). Wire-format interop (frames round-trip both directions) is required; byte-identical reproduction of donor output is not.
Phase 1: Correctness & Core (P0-critical) — DONE ✅
✅ #15 (large-literals panic), ✅ #17 (FSE table reuse + offset history), ✅ #5 (default level dfast/3).
Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
✅ #24 (bench suite), ✅ #14 (encoder match interleaving), ✅ #12 (sequence execution wildcopy), ✅ #8 (dictionary), ✅ #9 (streaming encoder), ✅ #6 (better level lazy2/7).
Phase 3: Performance Parity (P2-medium) — DONE ✅
✅ #10 (Huffman 4-stream), ✅ #11 (FSE batched refill), ✅ #13 (bitstream reader), ✅ #20 (decoder pre-alloc), ✅ #16 (frame content size), ✅ #7 (best level btlazy2/11), ✅ #21 (numeric levels 1-22), ✅ #25 (FastCOVER), ✅ #56 (packed FSE Entry), ✅ #47 (scratch reuse), ✅ #51 (HC rebase), ✅ #67 (row matcher).
Phase 4: SIMD & Hardware (P1/P2) — DONE ✅
✅ #68 (SIMD wildcopy), ✅ #69 (branchless offset + prefetch + pext), ✅ #66 (SIMD HUF decode BMI2/AVX2/VBMI2/NEON), ✅ #88 (default small-input dfast cliff), ✅ #70 (SIMD match-length), ✅ #97 (early incompressible fast-path), ✅ #71 (ARM CRC32/NEON/SVE2), ✅ #108 (AVX2 unroll-2 wildcopy candidate).
Phase 4B: Dictionary Decode Hot Path — DONE ✅
✅ #86 (pre-parsed dictionary handle).
Phase 4C: Encoder Architecture Rewrite — IN PROGRESS
#111 — split the 9000+ line match_generator.rs monolith, introduce const-generic Strategy dispatch, arena allocator, drop defensive saturating_* from hot path. Internal phases: structural split ✅ → cost_model+arena ✅ → Strategy const generics ✅ → bt/hc/opt clean ✅ → LDM (#18) ✅ → block splitting (#23) 🟡 IN PROGRESS (pre-split borders PR #140 ✅; superblock multi-table + literal-compression decisions still open).
Phase 5: Fast Strategy Donor Port — DONE ✅
#198 — ported ZSTD_compressBlock_fast_noDict_generic to recover the catastrophic 22× regression introduced by phase 1b. Shipped across PR #215 (phase 1a modules), #217 (phase 1b wiring), and #229 (phase 3 — 4-cursor + per-level cParams + cmov + window correctness).
M1-M8 milestones delivered:
- M1: per-level
fast_hash_log / fast_mls dispatch in LevelParams
- M2: full 4-cursor
ip0/ip1/ip2/ip3 lookahead body + immediate-rep2 inner loop (donor zstd_fast.c:192-420)
- M3:
cmov match-found variant + per-window dispatch surface
- M4: beyond-donor
fast_hash_log: 13 → 14 for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)
- M6: per-level
fast_step_size from donor's targetLength = -level; restores negative-level acceleration gradient
- M7: added donor's missing
current0+2 hash insertion after each match emit (zstd_fast.c:407); raised L1/decodecorpus seq match rate 43.1% → 57.7%
- M8: dropped RESERVED_PREFIX_BYTES dummy byte;
history layout now matches donor's, sentinel-0 protection via INITIAL_PREFIX_START_INDEX = 1
Headline results on large-log-stream (25 MiB dense):
Residual ratio gap: L1 Fast +7.43% on decodecorpus-z000033 → tracked as follow-up #220. All other levels in parity or better (lazy band −5.7..−6.2%, btopt L17 parity, btultra2 within ±0.2%).
Tier 1 — critical path (sequential, focused)
Strict sequential ordering. No parallel scope expansion until each item completes. Bench / profile / test gates apply to every step before advancing.
| Order |
Issue |
What |
Est |
Status |
| T1.1 |
#111 |
encoder architecture rewrite (incl. #23 block splitting as internal Phase 6) |
~8-13d remaining |
IN PROGRESS (Phase 1-5 merged; rewrite Phase 6 #23 in progress) |
| T1.2 |
#220 |
Fast L1 +7.43% ratio residual on decodecorpus |
2-3d |
OPEN (follow-up to #198) |
| T1.3 |
#180 |
skip_matching_with_hint self-time dominance (~25%) — drop eager block-boundary inserts |
1-2d |
OPEN |
| T1.4 |
#27 |
configurable parameters API (ZSTD_CCtx_setParameter surface — hard prereq for Phase 6.2) |
1d |
OPEN, blocked by #111 stable internal API |
Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)
Target upstream version: v1.5.7. Vendored headers verbatim + hand-written extern "C" wrappers. Wire-format interop required, byte-identical compressed output NOT required (encoder may differ).
| Order |
Issue |
What |
Est |
Blocked by |
| 6.1 |
#126 |
C ABI core (cdylib + vendored headers + simple/context/error/frame-inspection wrappers) |
10-12d |
#111 |
| 6.2 |
#127 |
advanced + streaming + dict C FFI surface (incl. ZSTD_c_nbWorkers inline wiring — does NOT depend on #19) |
10-12d |
#126, #27 |
| 6.3 |
#128 |
zstd CLI v1.5.7 parity (argv[0] dispatch, env vars, dict, list/test, -T#) |
12-14d |
#126, #127 |
| 6.5 |
#130 |
legacy frame decoders v0.1-v0.4 + per-version Cargo features |
14-16d |
#126 |
| 6.6 |
#131 |
legacy frame decoders v0.5-v0.7 |
10-12d |
#130 |
| 6.7 |
#132 |
conformance (tests/playTests.sh) + cross-validation + ABI symbol snapshot + reverse-dep smoke |
10-12d |
#128, #131 |
Note: Phase 6.4 number retired (originally a separate Cargo features subtask, folded into 6.5).
Phase 6 total: ~66-78 working days (~13-16 weeks).
Tier 3 — additional speed/decode after drop-in ships
| Issue |
What |
Est |
| #72 |
parallel block decompression for multi-block frames |
3d |
| #184 |
lazy band investigation — ratio + speed regressions |
3-5d |
| #199 |
HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + const-generic kernel |
4-6d |
| #205 |
HUF burst body x86-64 inline asm experiment (BMI2) — may be subsumed by #199 |
2-3d |
| #206 |
block splitter donor-parity vs ZSTD_splitBlock (refinement of #23) |
2-3d |
| #207 |
strategy-aware literal compression gates G4+G5 (skipLitCheck + minGain) |
2-3d |
| #208 |
match prefetch 4-stage pipeline for long-distance matches — 🟡 PR #227 open |
3-4d |
| #211 |
per-alloc-site memory tracker tooling (compare_ffi_memory.rs) |
1-2d |
| #230 |
dictionary training + dict-driven compress/decompress benches vs FFI |
2-3d |
Deferred — post-Phase 6 only
| Issue |
Why deferred |
Note |
| #19 |
Rust-side rayon MT — expands test matrix + adds bench noise during focus phase |
Phase 6.2 (#127) wires ZSTD_c_nbWorkers inline without #19 |
| #171 |
typed SkippableFrame Rust API (lsm bilateral PR-A) |
Banner in body |
| #172 |
skippable-payload visitor callback (lsm bilateral PR-B) |
Banner in body |
| #173 |
FrameEmitInfo + per-block XXH64 sidecar (lsm bilateral PR-C) |
Banner in body |
| #174 |
block-precise error position (lsm bilateral PR-D) |
Banner in body |
| #175 |
partial-decode SPIKE (lsm bilateral PR-E) |
Banner in body |
| #176 |
magic allocations registry docs (lsm bilateral PR-A bundle) |
Banner in body |
| #177 |
expected-field validation setters (lsm bilateral PR-F) |
Banner in body |
| #46 |
CLI aliasing decompress |
rideable with #128 |
| #49 |
HcMatchGenerator cross-slice unit test |
minor test gap |
Bilateral commitment status (lsm-tree)
Bilateral commitment with structured-world/coordinode-lsm-tree preserved. Execution defers until drop-in parity (Phase 6) ships. Cross-reference table kept for context:
| zstd |
lsm-tree counterpart |
LSM-T# |
Direction |
| #171 |
#251 |
LSM-T2 |
blocks lsm wire-format |
| #176 |
#250 |
LSM-T1 |
spec cross-link |
Execution order (sequential, focused)
Project goal: structured-zstd is a drop-in replacement for
libzstd/zstdCLI — same ABI/CLI/feature surface (a superset of upstream — every documented upstream symbol works, plus additional Rust-side higher-level streams), but NOT binary parity (compressed output bytes do not have to match upstream — encoder may make different and sometimes better choices). Wire-format interop (frames round-trip both directions) is required; byte-identical reproduction of donor output is not.Phase 1: Correctness & Core (P0-critical) — DONE ✅
✅ #15 (large-literals panic), ✅ #17 (FSE table reuse + offset history), ✅ #5 (default level dfast/3).
Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
✅ #24 (bench suite), ✅ #14 (encoder match interleaving), ✅ #12 (sequence execution wildcopy), ✅ #8 (dictionary), ✅ #9 (streaming encoder), ✅ #6 (better level lazy2/7).
Phase 3: Performance Parity (P2-medium) — DONE ✅
✅ #10 (Huffman 4-stream), ✅ #11 (FSE batched refill), ✅ #13 (bitstream reader), ✅ #20 (decoder pre-alloc), ✅ #16 (frame content size), ✅ #7 (best level btlazy2/11), ✅ #21 (numeric levels 1-22), ✅ #25 (FastCOVER), ✅ #56 (packed FSE Entry), ✅ #47 (scratch reuse), ✅ #51 (HC rebase), ✅ #67 (row matcher).
Phase 4: SIMD & Hardware (P1/P2) — DONE ✅
✅ #68 (SIMD wildcopy), ✅ #69 (branchless offset + prefetch + pext), ✅ #66 (SIMD HUF decode BMI2/AVX2/VBMI2/NEON), ✅ #88 (default small-input dfast cliff), ✅ #70 (SIMD match-length), ✅ #97 (early incompressible fast-path), ✅ #71 (ARM CRC32/NEON/SVE2), ✅ #108 (AVX2 unroll-2 wildcopy candidate).
Phase 4B: Dictionary Decode Hot Path — DONE ✅
✅ #86 (pre-parsed dictionary handle).
Phase 4C: Encoder Architecture Rewrite — IN PROGRESS
#111 — split the 9000+ line
match_generator.rsmonolith, introduce const-genericStrategydispatch, arena allocator, drop defensivesaturating_*from hot path. Internal phases: structural split ✅ → cost_model+arena ✅ → Strategy const generics ✅ → bt/hc/opt clean ✅ → LDM (#18) ✅ → block splitting (#23) 🟡 IN PROGRESS (pre-split borders PR #140 ✅; superblock multi-table + literal-compression decisions still open).Phase 5: Fast Strategy Donor Port — DONE ✅
#198 — ported
ZSTD_compressBlock_fast_noDict_genericto recover the catastrophic 22× regression introduced by phase 1b. Shipped across PR #215 (phase 1a modules), #217 (phase 1b wiring), and #229 (phase 3 — 4-cursor + per-level cParams + cmov + window correctness).M1-M8 milestones delivered:
fast_hash_log/fast_mlsdispatch inLevelParamsip0/ip1/ip2/ip3lookahead body + immediate-rep2 inner loop (donorzstd_fast.c:192-420)cmovmatch-found variant + per-window dispatch surfacefast_hash_log: 13 → 14for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)fast_step_sizefrom donor'stargetLength = -level; restores negative-level acceleration gradientcurrent0+2hash insertion after each match emit (zstd_fast.c:407); raised L1/decodecorpus seq match rate 43.1% → 57.7%historylayout now matches donor's, sentinel-0 protection viaINITIAL_PREFIX_START_INDEX = 1Headline results on
large-log-stream(25 MiB dense):Residual ratio gap: L1 Fast +7.43% on decodecorpus-z000033 → tracked as follow-up #220. All other levels in parity or better (lazy band −5.7..−6.2%, btopt L17 parity, btultra2 within ±0.2%).
Tier 1 — critical path (sequential, focused)
Strict sequential ordering. No parallel scope expansion until each item completes. Bench / profile / test gates apply to every step before advancing.
skip_matching_with_hintself-time dominance (~25%) — drop eager block-boundary insertsZSTD_CCtx_setParametersurface — hard prereq for Phase 6.2)Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)
Target upstream version: v1.5.7. Vendored headers verbatim + hand-written
extern "C"wrappers. Wire-format interop required, byte-identical compressed output NOT required (encoder may differ).ZSTD_c_nbWorkersinline wiring — does NOT depend on #19)zstdCLI v1.5.7 parity (argv[0] dispatch, env vars, dict, list/test,-T#)tests/playTests.sh) + cross-validation + ABI symbol snapshot + reverse-dep smokePhase 6 total: ~66-78 working days (~13-16 weeks).
Tier 3 — additional speed/decode after drop-in ships
ZSTD_splitBlock(refinement of #23)skipLitCheck+minGain)Deferred — post-Phase 6 only
ZSTD_c_nbWorkersinline without #19SkippableFrameRust API (lsm bilateral PR-A)FrameEmitInfo+ per-block XXH64 sidecar (lsm bilateral PR-C)Bilateral commitment status (lsm-tree)
Bilateral commitment with
structured-world/coordinode-lsm-treepreserved. Execution defers until drop-in parity (Phase 6) ships. Cross-reference table kept for context: