Skip to content

perf+ratio(encoder): lazy band investigation — ratio and speed regressions vs FFI #184

@polaz

Description

@polaz

Problem

Lazy strategy band (CompressionLevel::Level(5..=15)) is 5-9× slower
than C FFI
on mid-compressibility inputs. Ratio is fine across the
band; the regression is purely on speed.

Triggered by user observation during G4+G5 (#182) review: L8 lazy
showed −8.4% vs main (pre-G3) speed on decodecorpus-z000033. Full
matrix below.

Phase 1 — diagnostic matrix (DONE)

Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)

  • Zero ratio losses anywhere in the band.
  • Compressible scenarios: rust strictly beats ffi.
    • decodecorpus-z000033: e.g. L8 rust 480589 vs ffi 510499 (−5.9%)
    • large-log-stream: rust 8947 vs ffi 18582 (−51.8%) on L8+
  • high-entropy-1m: tie within +3 bytes (frame overhead, irrelevant)
  • small-1k-random, small-10k-random: incompressible, zero delta

Ratio is not the problem. No action needed on ratio axis.

Speed (criterion 10 samples, single run)

level scenario rust thrpt ffi thrpt rust/ffi
L5 lazy decodecorpus-z000033 15.3 MiB/s 137.0 MiB/s 0.11×
L8 lazy decodecorpus-z000033 10.3 MiB/s 95.0 MiB/s 0.11×
L12 lazy decodecorpus-z000033 9.0 MiB/s 70.0 MiB/s 0.13×
L15 lazy decodecorpus-z000033 9.1 MiB/s 34.8 MiB/s 0.26×
L5 lazy small-4k-log-lines 85.4 MiB/s 535.6 MiB/s 0.16×
L8 lazy small-4k-log-lines 79.7 MiB/s 511.3 MiB/s 0.16×
L12 lazy small-4k-log-lines 83.6 MiB/s 262.5 MiB/s 0.32×
L15 lazy small-4k-log-lines 83.3 MiB/s 109.9 MiB/s 0.76×
L15 lazy large-log-stream 282.2 MiB/s 27.1 MiB/s 10.4× faster

Worst-offender cell: L8 lazy × decodecorpus-z000033 (0.11× of FFI).

Phase 2 — root-cause analysis (PENDING)

Lazy uses BackendTag::HashChain per strategy.rs:313. Entry point
candidates to read:

  • zstd/src/encoding/hc/mod.rs (939 lines) — HashChain matcher,
    start_matching for lazy/btopt/btultra/btultra2
  • zstd/src/encoding/match_generator.rs — driver/dispatcher
  • Donor reference: lib/compress/zstd_lazy.c ZSTD_compressBlock_lazy_*

Hypotheses to validate via reading + flamegraph:

  1. Hash-chain walk depth not capped per donor's searchLog
    donor uses cParams.searchLog to bound chain walks; if ours walks
    unbounded or with wrong cap, per-position cost balloons.
  2. Missing nice_match early-exit — donor breaks the lazy search
    as soon as a match ≥ nice_match length is found; we may walk the
    full chain on long matches.
  3. No prefetch on hash slot loads — donor prefetches the next
    hash slot while comparing the current; absence costs L1 misses in
    the chain walk.
  4. Estimator path runs more for lazy — block splitter (SplitEstimator)
    probes scale with sequence count; lazy emits many small sequences
    → more probes than fast/dfast. Confirm via cargo flamegraph on
    compare_ffi --bench level_8_lazy/decodecorpus.
  5. No row matcher for levels 5-9 — donor uses ZSTD_RowFindBestMatch
    for lazy at levels 5..15 with SIMD row hashing; if we route those
    through generic HashChain instead, we pay quadratic search cost.

Phase 3 — fix subtasks (after Phase 2)

Subtasks will be carved into G-numbered items mirroring block-split G3-G7
once Phase 2 identifies which of the hypotheses above (and any newly
discovered ones) account for the bulk of the gap. Estimated 3-6 PRs.

Acceptance criteria

  • L8 lazy × decodecorpus-z000033 within 0.5× of FFI (vs current 0.11×).
  • No ratio regression anywhere in the band.
  • L15 lazy × large-log-stream fast-skip preserved (current 10.4× FFI).

Related

Branch

feat/#184-lazy-investigation — Phase 1 done in this branch, no code
changes yet. Subsequent fix branches will be carved per subtask.


Phase 2 — root-cause findings (2026-05-19)

Hot path

MatchGeneratorDriver::start_matchingcompress_block::<Lazy>
HcMatchGenerator::start_matching_strategy::<Lazy>
start_matching_lazy (match_generator.rs:2973).

Per source position, the body calls:

  • hc.find_best_match(abs_pos, lit_len) (hc/mod.rs:331) — always
  • hc.pick_lazy_match(abs_pos, lit_len, best) (hc/mod.rs:351) — if a candidate exists

pick_lazy_match recursively calls find_best_match at:

  • abs_pos + 1, lit_len + 1 (always when lazy_depth >= 1)
  • abs_pos + 2, lit_len + 2 (when lazy_depth >= 2, i.e. L7+)

So a committed position at lazy_depth=2 triggers up to 3 chain
walks
.

L1 — chain_candidates materializes a 4 KB stack array per call

hc/mod.rs:105-164chain_candidates signature:

pub(crate) fn chain_candidates(
    &self,
    table: &MatchTable,
    abs_pos: usize,
) -> [usize; MAX_HC_SEARCH_DEPTH] { /* MAX_HC_SEARCH_DEPTH = 512 */
    let mut buf = [usize::MAX; MAX_HC_SEARCH_DEPTH];
    ...
    buf
}

That's a 512 × 8 = 4096 byte stack allocation that gets zero-filled
on entry and returned by value (ABI: memcpy or hidden out-ptr). The
sole production caller (hash_chain_candidate at hc/mod.rs:280)
then iterates the array until usize::MAX. So we do:

  1. Producer: walk chain, fill array up to search_depth entries
  2. Return: 4 KB copy
  3. Consumer: iterate the materialized array

For L5 lazy (search_depth=4) we still zero-fill 4 KB just to use 4
slots. For L8 lazy (search_depth=24) we use 24 slots of 512. The
zero-fill happens on EVERY position.

Donor (zstd_lazy.c ZSTD_HcFindBestMatch) does NOT materialize a
candidate array — it has one for(; matchIndex >= lowLimit && nbAttempts > 0; nbAttempts--)
loop that compares match length and tracks best (ml in donor)
inline. There is no producer-consumer split.

Multiplied across the lazy_depth=2 lookahead (3× chain walks per
committed position), this is roughly 12 KB of stack zero-fill +
return-copy traffic per accepted match
at L7+. That dwarfs the
useful work on inputs like decodecorpus-z000033 where lit-len:match-len
ratios produce many short sequences (= many committed positions).

L2 — producer/consumer split prevents fusion with the speculative gate

Donor's loop fuses three operations per chain link:

  1. Read next chain pointer
  2. Compute candidate's currentMl
  3. Update best and the early-exit ip+currentMl == iLimit

Our chain_candidates produces the abs-positions and STOPS; the
consumer in hash_chain_candidate does steps 2-3 in a separate loop.
Fusing them lets the optimizer hoist common reads, schedule the chain
prefetch (donor uses PREFETCH_L1 on chain_table[next & chainMask]
inside the loop), and short-circuit the chain walk as soon as
ip+currentMl == iLimit.

We have early-out at best.match_len >= self.target_len in
hash_chain_candidate — good — but the chain walk still completes the
full search_depth iterations whenever target_len isn't reached,
because the producer's stop condition is filled >= max_chain_steps,
not "consumer is done".

L3 — find_best_match re-runs the whole rep+chain pipeline at +1/+2

pick_lazy_match calls self.find_best_match(table, abs_pos+1, ...)
which re-builds the rep-code candidate from scratch and re-walks the
chain at the new position. Donor's lazy loop reuses some of this: the
chain step at +1 overlaps with the chain step at +0, and the rep
probe only requires reading 4 bytes at known offsets — no chain
re-walk.

We don't share work between the three chain walks in lazy. Possible
optimization: cache the chain-walk result for (abs_pos, search_depth)
across the inner pick_lazy_match calls (only the latest 1-2 results
are needed).

This is more invasive than L1 — depends on confirmed gain from L1
profiling.

Subtasks (carved from Phase 2)

ID Title Estimate Files
L1 Inline chain_candidates into hash_chain_candidate — eliminate 4 KB array materialization 1d hc/mod.rs
L2 Fuse chain-walk and match-length compare; add donor-equivalent PREFETCH_L1 on chain links 1-2d hc/mod.rs
L3 Share rep + chain results across lazy lookahead (pos, pos+1, pos+2) 2-3d hc/mod.rs, match_generator.rs
L4 Validate nice_match early-exit parity vs donor (target_len) 0.5d hc/mod.rs pick_lazy_match

Start with L1 — single most concentrated win, smallest risk, gates
L2 (which inherits its loop structure).

Bench target after L1: L8 lazy × decodecorpus-z000033 from 0.11× FFI to
at least 0.3× FFI. Full matrix re-measure before claiming done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions