Problem
Lazy strategy band (CompressionLevel::Level(5..=15)) is 5-9× slower
than C FFI on mid-compressibility inputs. Ratio is fine across the
band; the regression is purely on speed.
Triggered by user observation during G4+G5 (#182) review: L8 lazy
showed −8.4% vs main (pre-G3) speed on decodecorpus-z000033. Full
matrix below.
Phase 1 — diagnostic matrix (DONE)
Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)
- Zero ratio losses anywhere in the band.
- Compressible scenarios: rust strictly beats ffi.
decodecorpus-z000033: e.g. L8 rust 480589 vs ffi 510499 (−5.9%)
large-log-stream: rust 8947 vs ffi 18582 (−51.8%) on L8+
high-entropy-1m: tie within +3 bytes (frame overhead, irrelevant)
small-1k-random, small-10k-random: incompressible, zero delta
Ratio is not the problem. No action needed on ratio axis.
Speed (criterion 10 samples, single run)
| level |
scenario |
rust thrpt |
ffi thrpt |
rust/ffi |
| L5 lazy |
decodecorpus-z000033 |
15.3 MiB/s |
137.0 MiB/s |
0.11× |
| L8 lazy |
decodecorpus-z000033 |
10.3 MiB/s |
95.0 MiB/s |
0.11× |
| L12 lazy |
decodecorpus-z000033 |
9.0 MiB/s |
70.0 MiB/s |
0.13× |
| L15 lazy |
decodecorpus-z000033 |
9.1 MiB/s |
34.8 MiB/s |
0.26× |
| L5 lazy |
small-4k-log-lines |
85.4 MiB/s |
535.6 MiB/s |
0.16× |
| L8 lazy |
small-4k-log-lines |
79.7 MiB/s |
511.3 MiB/s |
0.16× |
| L12 lazy |
small-4k-log-lines |
83.6 MiB/s |
262.5 MiB/s |
0.32× |
| L15 lazy |
small-4k-log-lines |
83.3 MiB/s |
109.9 MiB/s |
0.76× |
| L15 lazy |
large-log-stream |
282.2 MiB/s |
27.1 MiB/s |
10.4× faster |
Worst-offender cell: L8 lazy × decodecorpus-z000033 (0.11× of FFI).
Phase 2 — root-cause analysis (PENDING)
Lazy uses BackendTag::HashChain per strategy.rs:313. Entry point
candidates to read:
zstd/src/encoding/hc/mod.rs (939 lines) — HashChain matcher,
start_matching for lazy/btopt/btultra/btultra2
zstd/src/encoding/match_generator.rs — driver/dispatcher
- Donor reference:
lib/compress/zstd_lazy.c ZSTD_compressBlock_lazy_*
Hypotheses to validate via reading + flamegraph:
- Hash-chain walk depth not capped per donor's
searchLog —
donor uses cParams.searchLog to bound chain walks; if ours walks
unbounded or with wrong cap, per-position cost balloons.
- Missing
nice_match early-exit — donor breaks the lazy search
as soon as a match ≥ nice_match length is found; we may walk the
full chain on long matches.
- No
prefetch on hash slot loads — donor prefetches the next
hash slot while comparing the current; absence costs L1 misses in
the chain walk.
- Estimator path runs more for lazy — block splitter (
SplitEstimator)
probes scale with sequence count; lazy emits many small sequences
→ more probes than fast/dfast. Confirm via cargo flamegraph on
compare_ffi --bench level_8_lazy/decodecorpus.
- No row matcher for levels 5-9 — donor uses
ZSTD_RowFindBestMatch
for lazy at levels 5..15 with SIMD row hashing; if we route those
through generic HashChain instead, we pay quadratic search cost.
Phase 3 — fix subtasks (after Phase 2)
Subtasks will be carved into G-numbered items mirroring block-split G3-G7
once Phase 2 identifies which of the hypotheses above (and any newly
discovered ones) account for the bulk of the gap. Estimated 3-6 PRs.
Acceptance criteria
- L8 lazy × decodecorpus-z000033 within 0.5× of FFI (vs current 0.11×).
- No ratio regression anywhere in the band.
- L15 lazy × large-log-stream fast-skip preserved (current 10.4× FFI).
Related
Branch
feat/#184-lazy-investigation — Phase 1 done in this branch, no code
changes yet. Subsequent fix branches will be carved per subtask.
Phase 2 — root-cause findings (2026-05-19)
Hot path
MatchGeneratorDriver::start_matching → compress_block::<Lazy> →
HcMatchGenerator::start_matching_strategy::<Lazy> →
start_matching_lazy (match_generator.rs:2973).
Per source position, the body calls:
hc.find_best_match(abs_pos, lit_len) (hc/mod.rs:331) — always
hc.pick_lazy_match(abs_pos, lit_len, best) (hc/mod.rs:351) — if a candidate exists
pick_lazy_match recursively calls find_best_match at:
abs_pos + 1, lit_len + 1 (always when lazy_depth >= 1)
abs_pos + 2, lit_len + 2 (when lazy_depth >= 2, i.e. L7+)
So a committed position at lazy_depth=2 triggers up to 3 chain
walks.
L1 — chain_candidates materializes a 4 KB stack array per call
hc/mod.rs:105-164 — chain_candidates signature:
pub(crate) fn chain_candidates(
&self,
table: &MatchTable,
abs_pos: usize,
) -> [usize; MAX_HC_SEARCH_DEPTH] { /* MAX_HC_SEARCH_DEPTH = 512 */
let mut buf = [usize::MAX; MAX_HC_SEARCH_DEPTH];
...
buf
}
That's a 512 × 8 = 4096 byte stack allocation that gets zero-filled
on entry and returned by value (ABI: memcpy or hidden out-ptr). The
sole production caller (hash_chain_candidate at hc/mod.rs:280)
then iterates the array until usize::MAX. So we do:
- Producer: walk chain, fill array up to
search_depth entries
- Return: 4 KB copy
- Consumer: iterate the materialized array
For L5 lazy (search_depth=4) we still zero-fill 4 KB just to use 4
slots. For L8 lazy (search_depth=24) we use 24 slots of 512. The
zero-fill happens on EVERY position.
Donor (zstd_lazy.c ZSTD_HcFindBestMatch) does NOT materialize a
candidate array — it has one for(; matchIndex >= lowLimit && nbAttempts > 0; nbAttempts--)
loop that compares match length and tracks best (ml in donor)
inline. There is no producer-consumer split.
Multiplied across the lazy_depth=2 lookahead (3× chain walks per
committed position), this is roughly 12 KB of stack zero-fill +
return-copy traffic per accepted match at L7+. That dwarfs the
useful work on inputs like decodecorpus-z000033 where lit-len:match-len
ratios produce many short sequences (= many committed positions).
L2 — producer/consumer split prevents fusion with the speculative gate
Donor's loop fuses three operations per chain link:
- Read next chain pointer
- Compute candidate's
currentMl
- Update
best and the early-exit ip+currentMl == iLimit
Our chain_candidates produces the abs-positions and STOPS; the
consumer in hash_chain_candidate does steps 2-3 in a separate loop.
Fusing them lets the optimizer hoist common reads, schedule the chain
prefetch (donor uses PREFETCH_L1 on chain_table[next & chainMask]
inside the loop), and short-circuit the chain walk as soon as
ip+currentMl == iLimit.
We have early-out at best.match_len >= self.target_len in
hash_chain_candidate — good — but the chain walk still completes the
full search_depth iterations whenever target_len isn't reached,
because the producer's stop condition is filled >= max_chain_steps,
not "consumer is done".
L3 — find_best_match re-runs the whole rep+chain pipeline at +1/+2
pick_lazy_match calls self.find_best_match(table, abs_pos+1, ...)
which re-builds the rep-code candidate from scratch and re-walks the
chain at the new position. Donor's lazy loop reuses some of this: the
chain step at +1 overlaps with the chain step at +0, and the rep
probe only requires reading 4 bytes at known offsets — no chain
re-walk.
We don't share work between the three chain walks in lazy. Possible
optimization: cache the chain-walk result for (abs_pos, search_depth)
across the inner pick_lazy_match calls (only the latest 1-2 results
are needed).
This is more invasive than L1 — depends on confirmed gain from L1
profiling.
Subtasks (carved from Phase 2)
| ID |
Title |
Estimate |
Files |
| L1 |
Inline chain_candidates into hash_chain_candidate — eliminate 4 KB array materialization |
1d |
hc/mod.rs |
| L2 |
Fuse chain-walk and match-length compare; add donor-equivalent PREFETCH_L1 on chain links |
1-2d |
hc/mod.rs |
| L3 |
Share rep + chain results across lazy lookahead (pos, pos+1, pos+2) |
2-3d |
hc/mod.rs, match_generator.rs |
| L4 |
Validate nice_match early-exit parity vs donor (target_len) |
0.5d |
hc/mod.rs pick_lazy_match |
Start with L1 — single most concentrated win, smallest risk, gates
L2 (which inherits its loop structure).
Bench target after L1: L8 lazy × decodecorpus-z000033 from 0.11× FFI to
at least 0.3× FFI. Full matrix re-measure before claiming done.
Problem
Lazy strategy band (
CompressionLevel::Level(5..=15)) is 5-9× slowerthan C FFI on mid-compressibility inputs. Ratio is fine across the
band; the regression is purely on speed.
Triggered by user observation during G4+G5 (#182) review: L8 lazy
showed −8.4% vs main (pre-G3) speed on decodecorpus-z000033. Full
matrix below.
Phase 1 — diagnostic matrix (DONE)
Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)
decodecorpus-z000033: e.g. L8 rust 480589 vs ffi 510499 (−5.9%)large-log-stream: rust 8947 vs ffi 18582 (−51.8%) on L8+high-entropy-1m: tie within +3 bytes (frame overhead, irrelevant)small-1k-random,small-10k-random: incompressible, zero deltaRatio is not the problem. No action needed on ratio axis.
Speed (criterion 10 samples, single run)
Worst-offender cell: L8 lazy × decodecorpus-z000033 (0.11× of FFI).
Phase 2 — root-cause analysis (PENDING)
Lazy uses
BackendTag::HashChainperstrategy.rs:313. Entry pointcandidates to read:
zstd/src/encoding/hc/mod.rs(939 lines) — HashChain matcher,start_matchingfor lazy/btopt/btultra/btultra2zstd/src/encoding/match_generator.rs— driver/dispatcherlib/compress/zstd_lazy.cZSTD_compressBlock_lazy_*Hypotheses to validate via reading + flamegraph:
searchLog—donor uses
cParams.searchLogto bound chain walks; if ours walksunbounded or with wrong cap, per-position cost balloons.
nice_matchearly-exit — donor breaks the lazy searchas soon as a match ≥
nice_matchlength is found; we may walk thefull chain on long matches.
prefetchon hash slot loads — donor prefetches the nexthash slot while comparing the current; absence costs L1 misses in
the chain walk.
SplitEstimator)probes scale with sequence count; lazy emits many small sequences
→ more probes than fast/dfast. Confirm via
cargo flamegraphoncompare_ffi --bench level_8_lazy/decodecorpus.ZSTD_RowFindBestMatchfor lazy at levels 5..15 with SIMD row hashing; if we route those
through generic HashChain instead, we pay quadratic search cost.
Phase 3 — fix subtasks (after Phase 2)
Subtasks will be carved into G-numbered items mirroring block-split G3-G7
once Phase 2 identifies which of the hypotheses above (and any newly
discovered ones) account for the bulk of the gap. Estimated 3-6 PRs.
Acceptance criteria
Related
Branch
feat/#184-lazy-investigation— Phase 1 done in this branch, no codechanges yet. Subsequent fix branches will be carved per subtask.
Phase 2 — root-cause findings (2026-05-19)
Hot path
MatchGeneratorDriver::start_matching→compress_block::<Lazy>→HcMatchGenerator::start_matching_strategy::<Lazy>→start_matching_lazy(match_generator.rs:2973).Per source position, the body calls:
hc.find_best_match(abs_pos, lit_len)(hc/mod.rs:331) — alwayshc.pick_lazy_match(abs_pos, lit_len, best)(hc/mod.rs:351) — if a candidate existspick_lazy_matchrecursively callsfind_best_matchat:abs_pos + 1, lit_len + 1(always whenlazy_depth >= 1)abs_pos + 2, lit_len + 2(whenlazy_depth >= 2, i.e. L7+)So a committed position at
lazy_depth=2triggers up to 3 chainwalks.
L1 —
chain_candidatesmaterializes a 4 KB stack array per callhc/mod.rs:105-164—chain_candidatessignature:That's a
512 × 8 = 4096 bytestack allocation that gets zero-filledon entry and returned by value (ABI: memcpy or hidden out-ptr). The
sole production caller (
hash_chain_candidateathc/mod.rs:280)then iterates the array until
usize::MAX. So we do:search_depthentriesFor L5 lazy (
search_depth=4) we still zero-fill 4 KB just to use 4slots. For L8 lazy (
search_depth=24) we use 24 slots of 512. Thezero-fill happens on EVERY position.
Donor (
zstd_lazy.cZSTD_HcFindBestMatch) does NOT materialize acandidate array — it has one
for(; matchIndex >= lowLimit && nbAttempts > 0; nbAttempts--)loop that compares match length and tracks
best(mlin donor)inline. There is no producer-consumer split.
Multiplied across the
lazy_depth=2lookahead (3× chain walks percommitted position), this is roughly 12 KB of stack zero-fill +
return-copy traffic per accepted match at L7+. That dwarfs the
useful work on inputs like
decodecorpus-z000033where lit-len:match-lenratios produce many short sequences (= many committed positions).
L2 — producer/consumer split prevents fusion with the speculative gate
Donor's loop fuses three operations per chain link:
currentMlbestand the early-exitip+currentMl == iLimitOur
chain_candidatesproduces the abs-positions and STOPS; theconsumer in
hash_chain_candidatedoes steps 2-3 in a separate loop.Fusing them lets the optimizer hoist common reads, schedule the chain
prefetch (donor uses
PREFETCH_L1onchain_table[next & chainMask]inside the loop), and short-circuit the chain walk as soon as
ip+currentMl == iLimit.We have early-out at
best.match_len >= self.target_leninhash_chain_candidate— good — but the chain walk still completes thefull
search_depthiterations whenever target_len isn't reached,because the producer's stop condition is
filled >= max_chain_steps,not "consumer is done".
L3 —
find_best_matchre-runs the whole rep+chain pipeline at +1/+2pick_lazy_matchcallsself.find_best_match(table, abs_pos+1, ...)which re-builds the rep-code candidate from scratch and re-walks the
chain at the new position. Donor's lazy loop reuses some of this: the
chain step at
+1overlaps with the chain step at+0, and the repprobe only requires reading 4 bytes at known offsets — no chain
re-walk.
We don't share work between the three chain walks in lazy. Possible
optimization: cache the chain-walk result for
(abs_pos, search_depth)across the inner pick_lazy_match calls (only the latest 1-2 results
are needed).
This is more invasive than L1 — depends on confirmed gain from L1
profiling.
Subtasks (carved from Phase 2)
chain_candidatesintohash_chain_candidate— eliminate 4 KB array materializationhc/mod.rsPREFETCH_L1on chain linkshc/mod.rspos,pos+1,pos+2)hc/mod.rs,match_generator.rsnice_matchearly-exit parity vs donor (target_len)hc/mod.rspick_lazy_matchStart with L1 — single most concentrated win, smallest risk, gates
L2 (which inherits its loop structure).
Bench target after L1: L8 lazy × decodecorpus-z000033 from 0.11× FFI to
at least 0.3× FFI. Full matrix re-measure before claiming done.