Skip to content

Roadmap: structured-zstd feature parity with C zstd #28

@polaz

Description

@polaz

Execution order (sequential, focused)

Project goal: structured-zstd is a drop-in replacement for libzstd / zstd CLI — same ABI/CLI/feature surface (a superset of upstream — every documented upstream symbol works, plus additional Rust-side higher-level streams), but NOT binary parity (compressed output bytes do not have to match upstream — encoder may make different and sometimes better choices). Wire-format interop (frames round-trip both directions) is required; byte-identical reproduction of donor output is not.

Phase 1: Correctness & Core (P0-critical) — DONE ✅

#15 (large-literals panic), ✅ #17 (FSE table reuse + offset history), ✅ #5 (default level dfast/3).

Phase 2: CoordiNode Critical Path (P1-high) — DONE ✅

#24 (bench suite), ✅ #14 (encoder match interleaving), ✅ #12 (sequence execution wildcopy), ✅ #8 (dictionary), ✅ #9 (streaming encoder), ✅ #6 (better level lazy2/7).

Phase 3: Performance Parity (P2-medium) — DONE ✅

#10 (Huffman 4-stream), ✅ #11 (FSE batched refill), ✅ #13 (bitstream reader), ✅ #20 (decoder pre-alloc), ✅ #16 (frame content size), ✅ #7 (best level btlazy2/11), ✅ #21 (numeric levels 1-22), ✅ #25 (FastCOVER), ✅ #56 (packed FSE Entry), ✅ #47 (scratch reuse), ✅ #51 (HC rebase), ✅ #67 (row matcher).

Phase 4: SIMD & Hardware (P1/P2) — DONE ✅

#68 (SIMD wildcopy), ✅ #69 (branchless offset + prefetch + pext), ✅ #66 (SIMD HUF decode BMI2/AVX2/VBMI2/NEON), ✅ #88 (default small-input dfast cliff), ✅ #70 (SIMD match-length), ✅ #97 (early incompressible fast-path), ✅ #71 (ARM CRC32/NEON/SVE2), ✅ #108 (AVX2 unroll-2 wildcopy candidate).

Phase 4B: Dictionary Decode Hot Path — DONE ✅

#86 (pre-parsed dictionary handle).

Phase 4C: Encoder Architecture Rewrite — IN PROGRESS

#111 — split the 9000+ line match_generator.rs monolith, introduce const-generic Strategy dispatch, arena allocator, drop defensive saturating_* from hot path. Internal phases: structural split ✅ → cost_model+arena ✅ → Strategy const generics ✅ → bt/hc/opt clean ✅ → LDM (#18) ✅ → block splitting (#23) 🟡 IN PROGRESS (pre-split borders PR #140 ✅; superblock multi-table + literal-compression decisions still open).

Phase 5: Fast Strategy Donor Port — DONE ✅

#198 — ported ZSTD_compressBlock_fast_noDict_generic to recover the catastrophic 22× regression introduced by phase 1b. Shipped across PR #215 (phase 1a modules), #217 (phase 1b wiring), and #229 (phase 3 — 4-cursor + per-level cParams + cmov + window correctness).

M1-M8 milestones delivered:

  • M1: per-level fast_hash_log / fast_mls dispatch in LevelParams
  • M2: full 4-cursor ip0/ip1/ip2/ip3 lookahead body + immediate-rep2 inner loop (donor zstd_fast.c:192-420)
  • M3: cmov match-found variant + per-window dispatch surface
  • M4: beyond-donor fast_hash_log: 13 → 14 for negative levels (+32 KB memory, 2× fewer collisions on structured corpora)
  • M6: per-level fast_step_size from donor's targetLength = -level; restores negative-level acceleration gradient
  • M7: added donor's missing current0+2 hash insertion after each match emit (zstd_fast.c:407); raised L1/decodecorpus seq match rate 43.1% → 57.7%
  • M8: dropped RESERVED_PREFIX_BYTES dummy byte; history layout now matches donor's, sentinel-0 protection via INITIAL_PREFIX_START_INDEX = 1

Headline results on large-log-stream (25 MiB dense):

Residual ratio gap: L1 Fast +7.43% on decodecorpus-z000033 → tracked as follow-up #220. All other levels in parity or better (lazy band −5.7..−6.2%, btopt L17 parity, btultra2 within ±0.2%).


Tier 1 — critical path (sequential, focused)

Strict sequential ordering. No parallel scope expansion until each item completes. Bench / profile / test gates apply to every step before advancing.

Order Issue What Est Status
T1.1 #111 encoder architecture rewrite (incl. #23 block splitting as internal Phase 6) ~8-13d remaining IN PROGRESS (Phase 1-5 merged; rewrite Phase 6 #23 in progress)
T1.2 #220 Fast L1 +7.43% ratio residual on decodecorpus 2-3d OPEN (follow-up to #198)
T1.3 #180 skip_matching_with_hint self-time dominance (~25%) — drop eager block-boundary inserts 1-2d OPEN
T1.4 #27 configurable parameters API (ZSTD_CCtx_setParameter surface — hard prereq for Phase 6.2) 1d OPEN, blocked by #111 stable internal API

Phase 6 — drop-in C ABI / CLI parity (after Tier 1 ships)

Target upstream version: v1.5.7. Vendored headers verbatim + hand-written extern "C" wrappers. Wire-format interop required, byte-identical compressed output NOT required (encoder may differ).

Order Issue What Est Blocked by
6.1 #126 C ABI core (cdylib + vendored headers + simple/context/error/frame-inspection wrappers) 10-12d #111
6.2 #127 advanced + streaming + dict C FFI surface (incl. ZSTD_c_nbWorkers inline wiring — does NOT depend on #19) 10-12d #126, #27
6.3 #128 zstd CLI v1.5.7 parity (argv[0] dispatch, env vars, dict, list/test, -T#) 12-14d #126, #127
6.5 #130 legacy frame decoders v0.1-v0.4 + per-version Cargo features 14-16d #126
6.6 #131 legacy frame decoders v0.5-v0.7 10-12d #130
6.7 #132 conformance (tests/playTests.sh) + cross-validation + ABI symbol snapshot + reverse-dep smoke 10-12d #128, #131

Note: Phase 6.4 number retired (originally a separate Cargo features subtask, folded into 6.5).

Phase 6 total: ~66-78 working days (~13-16 weeks).

Tier 3 — additional speed/decode after drop-in ships

Issue What Est
#72 parallel block decompression for multi-block frames 3d
#184 lazy band investigation — ratio + speed regressions 3-5d
#199 HUF 4-stream rewrite — sentinel-bit refill + 5-symbol unroll + const-generic kernel 4-6d
#205 HUF burst body x86-64 inline asm experiment (BMI2) — may be subsumed by #199 2-3d
#206 block splitter donor-parity vs ZSTD_splitBlock (refinement of #23) 2-3d
#207 strategy-aware literal compression gates G4+G5 (skipLitCheck + minGain) 2-3d
#208 match prefetch 4-stage pipeline for long-distance matches — 🟡 PR #227 open 3-4d
#211 per-alloc-site memory tracker tooling (compare_ffi_memory.rs) 1-2d
#230 dictionary training + dict-driven compress/decompress benches vs FFI 2-3d

Deferred — post-Phase 6 only

Issue Why deferred Note
#19 Rust-side rayon MT — expands test matrix + adds bench noise during focus phase Phase 6.2 (#127) wires ZSTD_c_nbWorkers inline without #19
#171 typed SkippableFrame Rust API (lsm bilateral PR-A) Banner in body
#172 skippable-payload visitor callback (lsm bilateral PR-B) Banner in body
#173 FrameEmitInfo + per-block XXH64 sidecar (lsm bilateral PR-C) Banner in body
#174 block-precise error position (lsm bilateral PR-D) Banner in body
#175 partial-decode SPIKE (lsm bilateral PR-E) Banner in body
#176 magic allocations registry docs (lsm bilateral PR-A bundle) Banner in body
#177 expected-field validation setters (lsm bilateral PR-F) Banner in body
#46 CLI aliasing decompress rideable with #128
#49 HcMatchGenerator cross-slice unit test minor test gap

Bilateral commitment status (lsm-tree)

Bilateral commitment with structured-world/coordinode-lsm-tree preserved. Execution defers until drop-in parity (Phase 6) ships. Cross-reference table kept for context:

zstd lsm-tree counterpart LSM-T# Direction
#171 #251 LSM-T2 blocks lsm wire-format
#176 #250 LSM-T1 spec cross-link

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityP2-mediumMedium priority — important improvementdocumentationImprovements or additions to documentationenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions