Roadmap: structured-zstd feature parity with C zstd

## Execution order (sequential, focused)

**Project goal:** structured-zstd is a **drop-in replacement** for `libzstd` / `zstd` CLI — same ABI/CLI/feature surface (a **superset** of upstream — every documented upstream symbol works, plus additional Rust-side higher-level streams), but **NOT** binary parity (compressed output bytes do not have to match upstream — encoder may make different and sometimes better choices). Wire-format interop (frames round-trip both directions) is required; byte-identical reproduction of donor output is not.

### Phase 1: Correctness & Core (P0-critical) — DONE ✅
✅ #15 (large-literals panic), ✅ #17 (FSE table reuse + offset history), ✅ #5 (default level dfast/3).

### Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅
✅ #24 (bench suite), ✅ #14 (encoder match interleaving), ✅ #12 (sequence execution wildcopy), ✅ #8 (dictionary), ✅ #9 (streaming encoder), ✅ #6 (better level lazy2/7).

### Phase 3: Performance Parity (P2-medium) — DONE ✅
✅ #10 (Huffman 4-stream), ✅ #11 (FSE batched refill), ✅ #13 (bitstream reader), ✅ #20 (decoder pre-alloc), ✅ #16 (frame content size), ✅ #7 (best level btlazy2/11), ✅ #21 (numeric levels 1-22), ✅ #25 (FastCOVER), ✅ #56 (packed FSE Entry), ✅ #47 (scratch reuse), ✅ #51 (HC rebase), ✅ #67 (row matcher).

### Phase 4: SIMD & Hardware (P1/P2) — DONE ✅
✅ #68 (SIMD wildcopy), ✅ #69 (branchless offset + prefetch + pext), ✅ #66 (SIMD HUF decode BMI2/AVX2/VBMI2/NEON), ✅ #88 (default small-input dfast cliff), ✅ #70 (SIMD match-length), ✅ #97 (early incompressible fast-path), ✅ #71 (ARM CRC32/NEON/SVE2), ✅ #108 (AVX2 unroll-2 wildcopy candidate).

### Phase 4B: Dictionary Decode Hot Path — DONE ✅
✅ #86 (pre-parsed dictionary handle).

### Phase 4C: Encoder Architecture Rewrite — IN PROGRESS

**#111** — split the 9000+ line `match_generator.rs` monolith, introduce const-generic `Strategy` dispatch, arena allocator, drop defensive `saturating_*` from hot path. Internal phases: structural split ✅ → cost_model+arena ✅ → Strategy const generics ✅ → bt/hc/opt clean ✅ → LDM (#18) ✅ → **block splitting (#23) 🟡 IN PROGRESS** (pre-split borders PR #140 ✅; superblock multi-table + literal-compression decisions still open).

### Phase 5: Fast Strategy Donor Port — DONE ✅

**#198** — ported `ZSTD_compressBlock_fast_noDict_generic` to recover the catastrophic 22× regression introduced by phase 1b. Shipped across PR #215 (phase 1a modules), #217 (phase 1b wiring), and #229 (phase 3 — 4-cursor + per-level cParams + cmov + window correctness).

M1-M8 milestones delivered:

- M1: per-level `fast_hash_log` / `fast_mls` dispatch in `LevelParams`
- M2: full 4-cursor `ip0/ip1/ip2/ip3` lookahead body + immediate-rep2 inner loop (donor `zstd_fast.c:192-420`)
- M3: `cmov` match-found variant + per-window dispatch surface
- M4: beyond-donor `fast_hash_log: 13 → 14` for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)
- M6: per-level `fast_step_size` from donor's `targetLength = -level`; restores negative-level acceleration gradient
- M7: added donor's missing `current0+2` hash insertion after each match emit (`zstd_fast.c:407`); raised L1/decodecorpus seq match rate 43.1% → 57.7%
- M8: dropped RESERVED_PREFIX_BYTES dummy byte; `history` layout now matches donor's, sentinel-0 protection via `INITIAL_PREFIX_START_INDEX = 1`

**Headline results on `large-log-stream` (25 MiB dense):**
- Phase 1b state (#217 merged): +122% time vs main (~235 MiB/s)
- Phase 3 (M1-M8): −22% time vs main (~790 MiB/s) — **3.4× faster than phase 1b**

**Residual ratio gap:** L1 Fast +7.43% on decodecorpus-z000033 → tracked as follow-up **#220**. All other levels in parity or better (lazy band −5.7..−6.2%, btopt L17 parity, btultra2 within ±0.2%).

---

## Tier 1 — critical path (sequential, focused)

Strict sequential ordering. No parallel scope expansion until each item completes. Bench / profile / test gates apply to every step before advancing.

| Order | Issue | What | Est | Status |
|-------|-------|------|-----|--------|
| **T1.1** | **#111** | encoder architecture rewrite (incl. **#23** block splitting as internal Phase 6) | ~8-13d remaining | **IN PROGRESS** (Phase 1-5 merged; rewrite Phase 6 #23 in progress) |
| **T1.2** | **#220** | Fast L1 +7.43% ratio residual on decodecorpus | 2-3d | **OPEN** (follow-up to #198) |
| **T1.3** | **#180** | `skip_matching_with_hint` self-time dominance (~25%) — drop eager block-boundary inserts | 1-2d | OPEN |
| **T1.4** | **#27** | configurable parameters API (`ZSTD_CCtx_setParameter` surface — hard prereq for Phase 6.2) | 1d | OPEN, blocked by #111 stable internal API |

## Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)

Target upstream version: **v1.5.7**. Vendored headers verbatim + hand-written `extern "C"` wrappers. Wire-format interop required, byte-identical compressed output NOT required (encoder may differ).

| Order | Issue | What | Est | Blocked by |
|-------|-------|------|-----|------------|
| 6.1 | #126 | C ABI core (cdylib + vendored headers + simple/context/error/frame-inspection wrappers) | 10-12d | #111 |
| 6.2 | #127 | advanced + streaming + dict C FFI surface (incl. `ZSTD_c_nbWorkers` inline wiring — does NOT depend on #19) | 10-12d | #126, #27 |
| 6.3 | #128 | `zstd` CLI v1.5.7 parity (argv[0] dispatch, env vars, dict, list/test, `-T#`) | 12-14d | #126, #127 |
| 6.5 | #130 | legacy frame decoders v0.1-v0.4 + per-version Cargo features | 14-16d | #126 |
| 6.6 | #131 | legacy frame decoders v0.5-v0.7 | 10-12d | #130 |
| 6.7 | #132 | conformance (`tests/playTests.sh`) + cross-validation + ABI symbol snapshot + reverse-dep smoke | 10-12d | #128, #131 |

> Note: Phase 6.4 number retired (originally a separate Cargo features subtask, folded into 6.5).

**Phase 6 total**: ~66-78 working days (~13-16 weeks).

## Tier 3 — additional speed/decode after drop-in ships

| Issue | What | Est |
|-------|------|-----|
| #72 | parallel block decompression for multi-block frames | 3d |
| #184 | lazy band investigation — ratio + speed regressions | 3-5d |
| #199 | HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + const-generic kernel | 4-6d |
| #205 | HUF burst body x86-64 inline asm experiment (BMI2) — may be subsumed by #199 | 2-3d |
| #206 | block splitter donor-parity vs `ZSTD_splitBlock` (refinement of #23) | 2-3d |
| #207 | strategy-aware literal compression gates G4+G5 (`skipLitCheck` + `minGain`) | 2-3d |
| #208 | match prefetch 4-stage pipeline for long-distance matches — 🟡 PR #227 open | 3-4d |
| #211 | per-alloc-site memory tracker tooling (compare_ffi_memory.rs) | 1-2d |
| #230 | dictionary training + dict-driven compress/decompress benches vs FFI | 2-3d |

## Deferred — post-Phase 6 only

| Issue | Why deferred | Note |
|-------|--------------|------|
| **#19** | Rust-side rayon MT — expands test matrix + adds bench noise during focus phase | Phase 6.2 (#127) wires `ZSTD_c_nbWorkers` inline without #19 |
| **#171** | typed `SkippableFrame` Rust API (lsm bilateral PR-A) | Banner in body |
| **#172** | skippable-payload visitor callback (lsm bilateral PR-B) | Banner in body |
| **#173** | `FrameEmitInfo` + per-block XXH64 sidecar (lsm bilateral PR-C) | Banner in body |
| **#174** | block-precise error position (lsm bilateral PR-D) | Banner in body |
| **#175** | partial-decode SPIKE (lsm bilateral PR-E) | Banner in body |
| **#176** | magic allocations registry docs (lsm bilateral PR-A bundle) | Banner in body |
| **#177** | expected-field validation setters (lsm bilateral PR-F) | Banner in body |
| #46 | CLI aliasing decompress | rideable with #128 |
| #49 | HcMatchGenerator cross-slice unit test | minor test gap |

## Bilateral commitment status (lsm-tree)

Bilateral commitment with `structured-world/coordinode-lsm-tree` preserved. Execution defers until drop-in parity (Phase 6) ships. Cross-reference table kept for context:

| zstd | lsm-tree counterpart | LSM-T# | Direction |
|------|----------------------|--------|-----------|
| #171 | [#251](https://github.com/structured-world/coordinode-lsm-tree/issues/251) | LSM-T2 | blocks lsm wire-format |
| #176 | [#250](https://github.com/structured-world/coordinode-lsm-tree/issues/250) | LSM-T1 | spec cross-link |


Order	Issue	What	Est	Status
T1.1	#111	encoder architecture rewrite (incl. #23 block splitting as internal Phase 6)	~8-13d remaining	IN PROGRESS (Phase 1-5 merged; rewrite Phase 6 #23 in progress)
T1.2	#220	Fast L1 +7.43% ratio residual on decodecorpus	2-3d	OPEN (follow-up to #198)
T1.3	#180	`skip_matching_with_hint` self-time dominance (~25%) — drop eager block-boundary inserts	1-2d	OPEN
T1.4	#27	configurable parameters API (`ZSTD_CCtx_setParameter` surface — hard prereq for Phase 6.2)	1d	OPEN, blocked by #111 stable internal API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: structured-zstd feature parity with C zstd #28

Execution order (sequential, focused)

Phase 1: Correctness & Core (P0-critical) — DONE ✅

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

Phase 3: Performance Parity (P2-medium) — DONE ✅

Phase 4: SIMD & Hardware (P1/P2) — DONE ✅

Phase 4B: Dictionary Decode Hot Path — DONE ✅

Phase 4C: Encoder Architecture Rewrite — IN PROGRESS

Phase 5: Fast Strategy Donor Port — DONE ✅

Tier 1 — critical path (sequential, focused)

Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)

Tier 3 — additional speed/decode after drop-in ships

Deferred — post-Phase 6 only

Bilateral commitment status (lsm-tree)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Order	Issue	What	Est	Blocked by
6.1	#126	C ABI core (cdylib + vendored headers + simple/context/error/frame-inspection wrappers)	10-12d	#111
6.2	#127	advanced + streaming + dict C FFI surface (incl. `ZSTD_c_nbWorkers` inline wiring — does NOT depend on #19)	10-12d	#126, #27
6.3	#128	`zstd` CLI v1.5.7 parity (argv[0] dispatch, env vars, dict, list/test, `-T#`)	12-14d	#126, #127
6.5	#130	legacy frame decoders v0.1-v0.4 + per-version Cargo features	14-16d	#126
6.6	#131	legacy frame decoders v0.5-v0.7	10-12d	#130
6.7	#132	conformance (`tests/playTests.sh`) + cross-validation + ABI symbol snapshot + reverse-dep smoke	10-12d	#128, #131

Issue	What	Est
#72	parallel block decompression for multi-block frames	3d
#184	lazy band investigation — ratio + speed regressions	3-5d
#199	HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + const-generic kernel	4-6d
#205	HUF burst body x86-64 inline asm experiment (BMI2) — may be subsumed by #199	2-3d
#206	block splitter donor-parity vs `ZSTD_splitBlock` (refinement of #23)	2-3d
#207	strategy-aware literal compression gates G4+G5 (`skipLitCheck` + `minGain`)	2-3d
#208	match prefetch 4-stage pipeline for long-distance matches — 🟡 PR #227 open	3-4d
#211	per-alloc-site memory tracker tooling (compare_ffi_memory.rs)	1-2d
#230	dictionary training + dict-driven compress/decompress benches vs FFI	2-3d

Issue	Why deferred	Note
#19	Rust-side rayon MT — expands test matrix + adds bench noise during focus phase	Phase 6.2 (#127) wires `ZSTD_c_nbWorkers` inline without #19
#171	typed `SkippableFrame` Rust API (lsm bilateral PR-A)	Banner in body
#172	skippable-payload visitor callback (lsm bilateral PR-B)	Banner in body
#173	`FrameEmitInfo` + per-block XXH64 sidecar (lsm bilateral PR-C)	Banner in body
#174	block-precise error position (lsm bilateral PR-D)	Banner in body
#175	partial-decode SPIKE (lsm bilateral PR-E)	Banner in body
#176	magic allocations registry docs (lsm bilateral PR-A bundle)	Banner in body
#177	expected-field validation setters (lsm bilateral PR-F)	Banner in body
#46	CLI aliasing decompress	rideable with #128
#49	HcMatchGenerator cross-slice unit test	minor test gap

zstd	lsm-tree counterpart	LSM-T#	Direction
#171	#251	LSM-T2	blocks lsm wire-format
#176	#250	LSM-T1	spec cross-link

Roadmap: structured-zstd feature parity with C zstd #28

Description

Execution order (sequential, focused)

Phase 1: Correctness & Core (P0-critical) — DONE ✅

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

Phase 3: Performance Parity (P2-medium) — DONE ✅

Phase 4: SIMD & Hardware (P1/P2) — DONE ✅

Phase 4B: Dictionary Decode Hot Path — DONE ✅

Phase 4C: Encoder Architecture Rewrite — IN PROGRESS

Phase 5: Fast Strategy Donor Port — DONE ✅

Tier 1 — critical path (sequential, focused)

Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)

Tier 3 — additional speed/decode after drop-in ships

Deferred — post-Phase 6 only

Bilateral commitment status (lsm-tree)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions