perf+ratio(encoder): lazy band investigation — ratio and speed regressions vs FFI

## Problem

Lazy strategy band (`CompressionLevel::Level(5..=15)`) is **5-9× slower
than C FFI** on mid-compressibility inputs. Ratio is fine across the
band; the regression is purely on speed.

Triggered by user observation during G4+G5 (#182) review: L8 lazy
showed −8.4% vs main (pre-G3) speed on decodecorpus-z000033. Full
matrix below.

## Phase 1 — diagnostic matrix (DONE)

### Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)

- **Zero ratio losses** anywhere in the band.
- Compressible scenarios: rust strictly beats ffi.
  - `decodecorpus-z000033`: e.g. L8 rust 480589 vs ffi 510499 (−5.9%)
  - `large-log-stream`: rust 8947 vs ffi 18582 (−51.8%) on L8+
- `high-entropy-1m`: tie within +3 bytes (frame overhead, irrelevant)
- `small-1k-random`, `small-10k-random`: incompressible, zero delta

Ratio is **not the problem**. No action needed on ratio axis.

### Speed (criterion 10 samples, single run)

| level | scenario | rust thrpt | ffi thrpt | rust/ffi |
|---|---|---:|---:|---:|
| L5 lazy | decodecorpus-z000033 | 15.3 MiB/s | 137.0 MiB/s | **0.11×** |
| L8 lazy | decodecorpus-z000033 | 10.3 MiB/s | 95.0 MiB/s | **0.11×** |
| L12 lazy | decodecorpus-z000033 | 9.0 MiB/s | 70.0 MiB/s | 0.13× |
| L15 lazy | decodecorpus-z000033 | 9.1 MiB/s | 34.8 MiB/s | 0.26× |
| L5 lazy | small-4k-log-lines | 85.4 MiB/s | 535.6 MiB/s | 0.16× |
| L8 lazy | small-4k-log-lines | 79.7 MiB/s | 511.3 MiB/s | 0.16× |
| L12 lazy | small-4k-log-lines | 83.6 MiB/s | 262.5 MiB/s | 0.32× |
| L15 lazy | small-4k-log-lines | 83.3 MiB/s | 109.9 MiB/s | 0.76× |
| L15 lazy | large-log-stream | 282.2 MiB/s | 27.1 MiB/s | **10.4× faster** |

Worst-offender cell: **L8 lazy × decodecorpus-z000033** (0.11× of FFI).

## Phase 2 — root-cause analysis (PENDING)

Lazy uses `BackendTag::HashChain` per `strategy.rs:313`. Entry point
candidates to read:
- `zstd/src/encoding/hc/mod.rs` (939 lines) — HashChain matcher,
  `start_matching` for lazy/btopt/btultra/btultra2
- `zstd/src/encoding/match_generator.rs` — driver/dispatcher
- Donor reference: `lib/compress/zstd_lazy.c` `ZSTD_compressBlock_lazy_*`

Hypotheses to validate via reading + flamegraph:
1. **Hash-chain walk depth not capped per donor's `searchLog`** —
   donor uses `cParams.searchLog` to bound chain walks; if ours walks
   unbounded or with wrong cap, per-position cost balloons.
2. **Missing `nice_match` early-exit** — donor breaks the lazy search
   as soon as a match ≥ `nice_match` length is found; we may walk the
   full chain on long matches.
3. **No `prefetch` on hash slot loads** — donor prefetches the next
   hash slot while comparing the current; absence costs L1 misses in
   the chain walk.
4. **Estimator path runs more for lazy** — block splitter (`SplitEstimator`)
   probes scale with sequence count; lazy emits many small sequences
   → more probes than fast/dfast. Confirm via `cargo flamegraph` on
   `compare_ffi --bench level_8_lazy/decodecorpus`.
5. **No row matcher for levels 5-9** — donor uses `ZSTD_RowFindBestMatch`
   for lazy at levels 5..15 with SIMD row hashing; if we route those
   through generic HashChain instead, we pay quadratic search cost.

## Phase 3 — fix subtasks (after Phase 2)

Subtasks will be carved into G-numbered items mirroring block-split G3-G7
once Phase 2 identifies which of the hypotheses above (and any newly
discovered ones) account for the bulk of the gap. Estimated 3-6 PRs.

## Acceptance criteria

- L8 lazy × decodecorpus-z000033 within 0.5× of FFI (vs current 0.11×).
- No ratio regression anywhere in the band.
- L15 lazy × large-log-stream fast-skip preserved (current 10.4× FFI).

## Related

- PR #182 G4+G5 documented the −8% finding pre-merge
- #23 (block-split track)

## Branch

`feat/#184-lazy-investigation` — Phase 1 done in this branch, no code
changes yet. Subsequent fix branches will be carved per subtask.


---

## Phase 2 — root-cause findings (2026-05-19)

### Hot path

`MatchGeneratorDriver::start_matching` → `compress_block::<Lazy>` →
`HcMatchGenerator::start_matching_strategy::<Lazy>` →
`start_matching_lazy` (`match_generator.rs:2973`).

Per source position, the body calls:
- `hc.find_best_match(abs_pos, lit_len)` (`hc/mod.rs:331`) — always
- `hc.pick_lazy_match(abs_pos, lit_len, best)` (`hc/mod.rs:351`) — if a candidate exists

`pick_lazy_match` recursively calls `find_best_match` at:
- `abs_pos + 1, lit_len + 1` (always when `lazy_depth >= 1`)
- `abs_pos + 2, lit_len + 2` (when `lazy_depth >= 2`, i.e. L7+)

So a committed position at `lazy_depth=2` triggers up to **3 chain
walks**.

### L1 — `chain_candidates` materializes a 4 KB stack array per call

`hc/mod.rs:105-164` — `chain_candidates` signature:
```rust
pub(crate) fn chain_candidates(
    &self,
    table: &MatchTable,
    abs_pos: usize,
) -> [usize; MAX_HC_SEARCH_DEPTH] { /* MAX_HC_SEARCH_DEPTH = 512 */
    let mut buf = [usize::MAX; MAX_HC_SEARCH_DEPTH];
    ...
    buf
}
```

That's a `512 × 8 = 4096 byte` stack allocation that gets zero-filled
on entry and returned by value (ABI: memcpy or hidden out-ptr). The
sole production caller (`hash_chain_candidate` at `hc/mod.rs:280`)
then iterates the array until `usize::MAX`. So we do:

1. Producer: walk chain, fill array up to `search_depth` entries
2. Return: 4 KB copy
3. Consumer: iterate the materialized array

For L5 lazy (`search_depth=4`) we still zero-fill 4 KB just to use 4
slots. For L8 lazy (`search_depth=24`) we use 24 slots of 512. The
zero-fill happens on EVERY position.

**Donor** (`zstd_lazy.c` `ZSTD_HcFindBestMatch`) does NOT materialize a
candidate array — it has one `for(; matchIndex >= lowLimit && nbAttempts > 0; nbAttempts--)`
loop that compares match length and tracks `best` (`ml` in donor)
inline. There is no producer-consumer split.

Multiplied across the `lazy_depth=2` lookahead (3× chain walks per
committed position), this is roughly **12 KB of stack zero-fill +
return-copy traffic per accepted match** at L7+. That dwarfs the
useful work on inputs like `decodecorpus-z000033` where lit-len:match-len
ratios produce many short sequences (= many committed positions).

### L2 — producer/consumer split prevents fusion with the speculative gate

Donor's loop fuses three operations per chain link:
1. Read next chain pointer
2. Compute candidate's `currentMl`
3. Update `best` and the early-exit `ip+currentMl == iLimit`

Our `chain_candidates` produces the abs-positions and STOPS; the
consumer in `hash_chain_candidate` does steps 2-3 in a separate loop.
Fusing them lets the optimizer hoist common reads, schedule the chain
prefetch (donor uses `PREFETCH_L1` on `chain_table[next & chainMask]`
inside the loop), and short-circuit the chain walk as soon as
`ip+currentMl == iLimit`.

We have early-out at `best.match_len >= self.target_len` in
`hash_chain_candidate` — good — but the chain walk still completes the
full `search_depth` iterations whenever target_len isn't reached,
because the producer's stop condition is `filled >= max_chain_steps`,
not "consumer is done".

### L3 — `find_best_match` re-runs the whole rep+chain pipeline at +1/+2

`pick_lazy_match` calls `self.find_best_match(table, abs_pos+1, ...)`
which re-builds the rep-code candidate from scratch and re-walks the
chain at the new position. Donor's lazy loop reuses some of this: the
chain step at `+1` overlaps with the chain step at `+0`, and the rep
probe only requires reading 4 bytes at known offsets — no chain
re-walk.

We don't share work between the three chain walks in lazy. Possible
optimization: cache the chain-walk result for `(abs_pos, search_depth)`
across the inner pick_lazy_match calls (only the latest 1-2 results
are needed).

This is more invasive than L1 — depends on confirmed gain from L1
profiling.

## Subtasks (carved from Phase 2)

| ID | Title | Estimate | Files |
|---|---|---|---|
| **L1** | Inline `chain_candidates` into `hash_chain_candidate` — eliminate 4 KB array materialization | 1d | `hc/mod.rs` |
| L2 | Fuse chain-walk and match-length compare; add donor-equivalent `PREFETCH_L1` on chain links | 1-2d | `hc/mod.rs` |
| L3 | Share rep + chain results across lazy lookahead (`pos`, `pos+1`, `pos+2`) | 2-3d | `hc/mod.rs`, `match_generator.rs` |
| L4 | Validate `nice_match` early-exit parity vs donor (`target_len`) | 0.5d | `hc/mod.rs` `pick_lazy_match` |

Start with **L1** — single most concentrated win, smallest risk, gates
L2 (which inherits its loop structure).

Bench target after L1: L8 lazy × decodecorpus-z000033 from 0.11× FFI to
at least 0.3× FFI. Full matrix re-measure before claiming done.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf+ratio(encoder): lazy band investigation — ratio and speed regressions vs FFI #184

Problem

Phase 1 — diagnostic matrix (DONE)

Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)

Speed (criterion 10 samples, single run)

Phase 2 — root-cause analysis (PENDING)

Phase 3 — fix subtasks (after Phase 2)

Acceptance criteria

Related

Branch

Phase 2 — root-cause findings (2026-05-19)

Hot path

L1 — `chain_candidates` materializes a 4 KB stack array per call

L2 — producer/consumer split prevents fusion with the speculative gate

L3 — `find_best_match` re-runs the whole rep+chain pipeline at +1/+2

Subtasks (carved from Phase 2)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

level	scenario	rust thrpt	ffi thrpt	rust/ffi
L5 lazy	decodecorpus-z000033	15.3 MiB/s	137.0 MiB/s	0.11×
L8 lazy	decodecorpus-z000033	10.3 MiB/s	95.0 MiB/s	0.11×
L12 lazy	decodecorpus-z000033	9.0 MiB/s	70.0 MiB/s	0.13×
L15 lazy	decodecorpus-z000033	9.1 MiB/s	34.8 MiB/s	0.26×
L5 lazy	small-4k-log-lines	85.4 MiB/s	535.6 MiB/s	0.16×
L8 lazy	small-4k-log-lines	79.7 MiB/s	511.3 MiB/s	0.16×
L12 lazy	small-4k-log-lines	83.6 MiB/s	262.5 MiB/s	0.32×
L15 lazy	small-4k-log-lines	83.3 MiB/s	109.9 MiB/s	0.76×
L15 lazy	large-log-stream	282.2 MiB/s	27.1 MiB/s	10.4× faster

ID	Title	Estimate	Files
L1	Inline `chain_candidates` into `hash_chain_candidate` — eliminate 4 KB array materialization	1d	`hc/mod.rs`
L2	Fuse chain-walk and match-length compare; add donor-equivalent `PREFETCH_L1` on chain links	1-2d	`hc/mod.rs`
L3	Share rep + chain results across lazy lookahead (`pos`, `pos+1`, `pos+2`)	2-3d	`hc/mod.rs`, `match_generator.rs`
L4	Validate `nice_match` early-exit parity vs donor (`target_len`)	0.5d	`hc/mod.rs` `pick_lazy_match`

perf+ratio(encoder): lazy band investigation — ratio and speed regressions vs FFI #184

Description

Problem

Phase 1 — diagnostic matrix (DONE)

Ratio (REPORT lines, 77 cells across 11 lazy levels × 7 scenarios)

Speed (criterion 10 samples, single run)

Phase 2 — root-cause analysis (PENDING)

Phase 3 — fix subtasks (after Phase 2)

Acceptance criteria

Related

Branch

Phase 2 — root-cause findings (2026-05-19)

Hot path

L1 — chain_candidates materializes a 4 KB stack array per call

L2 — producer/consumer split prevents fusion with the speculative gate

L3 — find_best_match re-runs the whole rep+chain pipeline at +1/+2

Subtasks (carved from Phase 2)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

L1 — `chain_candidates` materializes a 4 KB stack array per call

L3 — `find_best_match` re-runs the whole rep+chain pipeline at +1/+2