perf(encoder): skip_matching_with_hint dominates self-time (~25%) — drop eager block-boundary inserts

## Status

Investigation + speed track. While benchmarking the new donor-parity greedy parse at L4 (PR #179), profiled `compress/level_4_greedy/large-log-stream/matrix/pure_rust` with `samply` and discovered that **`RowMatchGenerator::skip_matching_with_hint` is the dominant Rust self-time consumer (~25%)** on streams with many block boundaries. The 102M `insert_position` calls (104 MB stream / 128 KiB block × ~128K inserts per block) overshadow any per-iteration delta in the parse loop.

This is independent of PR #179 — `start_matching` (the inner parse loop) and `skip_matching_with_hint` (called once per block at boundary setup) are separate hot paths. The greedy PR's +9% regression on large-log-stream is a downstream symptom of this structural bottleneck.

## Profile evidence

`samply record` on L4 large-log-stream (8 s profile-time on macOS aarch64 release build, 26 210 samples). Top self-time, symbolicated via `atos`:

| % self time | Function |
|---|---|
| 22% | `ZSTD_ldm_fillHashTable` (donor C — c_ffi side of the bench, not us) |
| **13%** | `RowMatchGenerator::skip_matching_with_hint` (+0x40) |
| 5.4% | `RowMatchGenerator::skip_matching_with_hint` (+0x4C) |
| 3.0% | `RowMatchGenerator::skip_matching_with_hint` (+0x5C) |
| ~25% | **(sum across `skip_matching_with_hint` offsets — dominant Rust self-time)** |
| 2.9% | `MatchTable::insert_hash3_only_no_rebase` (HC matcher, other scenarios in profile) |
| 1.5% | `ZSTD_decodeLiteralsBlock` (donor C decode) |

## What `skip_matching_with_hint` actually does

`zstd/src/encoding/row/mod.rs:139`:

```rust
pub(crate) fn skip_matching_with_hint(&mut self, incompressible_hint: Option<bool>) {
    self.ensure_tables();
    let current_len = self.window.back().unwrap().len();
    let current_abs_start = ...;
    let current_abs_end = ...;
    let backfill_start = self.backfill_start(current_abs_start);
    if backfill_start < current_abs_start {
        self.insert_positions(backfill_start, current_abs_start);
    }
    if incompressible_hint == Some(true) {
        // sparse step (only every 8th position)
    } else {
        // DENSE: insert every position in the block
        self.insert_positions(current_abs_start, current_abs_end);
    }
}
```

For a 128 KiB block with `incompressible_hint = None | Some(false)`, this inserts **131 072 positions** into the row-hash table. On a 104 MB stream that's ~800 blocks × 131 072 = **~104.8 M `insert_position` calls** spent on hash-table maintenance alone, before the parse loop even starts.

Each `insert_position` does `hash_and_row` (u32 read + multiply + masking) + 3 indexed writes to `row_heads` / `row_tags` / `row_positions`. Cheap individually but the volume swamps everything else.

## Hypothesis: donor's `ZSTD_row_fillHashCache` is sparser

Donor `zstd_lazy.c:ZSTD_row_fillHashCache` only fills the hash cache for `ZSTD_ROW_HASH_CACHE_SIZE = 8` positions (it pre-computes the next 8 positions' hashes to feed the SIMD prefetch in `ZSTD_RowFindBestMatch`). It does NOT insert all block-byte positions into the row table at block boundary — those inserts happen lazily during the parse loop via `ZSTD_row_update` per emitted match.

Our `skip_matching_with_hint` is doing what donor does *inside the parse loop on demand*, but *eagerly at block setup*. That's why our 100 MB stream pays 100 M inserts instead of inserts only for the few positions that actually get searched.

If confirmed, the fix is to drop the eager `insert_positions(current_abs_start, current_abs_end)` and let `start_matching` / `start_matching_greedy` insert positions as they walk the block. Position-insert volume drops from O(block_size) to O(emitted_sequences * MIN_MATCH).

## Expected impact

- **L4 large-log-stream**: the +9% regression in #179 is dominated by `skip_matching_with_hint`. Removing it should swing that cell back to a speed win (large-log-stream parse loop hits perfectly-periodic matches → very few `insert_position` calls during the loop itself).
- **L5-L15 (Row backend, lazy)**: similar gain expected on any scenario with many block boundaries (large inputs). Smaller scenarios (≤ block size) see no change.
- **L1-L3 fast/dfast**: not affected (different backend, different skip-matching path).

## Acceptance criteria

- [ ] Verify hypothesis: drop the dense `insert_positions(current_abs_start, current_abs_end)` from `skip_matching_with_hint`; let `start_matching` / `start_matching_greedy` do per-iteration inserts.
- [ ] Confirm correctness: existing test suite must stay green (519/519 in PR #179 — same gates).
- [ ] Confirm ratio preserved: `compare_ffi` REPORT lines `rust_bytes <= ffi_bytes` constraint per CLAUDE.md.
- [ ] Measure: L4 large-log-stream time delta vs main; expect the +9% to flip to negative.
- [ ] Re-profile to confirm `skip_matching_with_hint` no longer dominates self-time.
- [ ] If hypothesis wrong, fall back to: instrument the call to find which call site (dictionary priming / block boundary / etc.) triggers it most, then optimize that site.

## Related

- PR #179 (greedy parse at L4 — introduces measurement of this issue)
- #178 (parent: decode/encode shared primitives + greedy)
- #28 (project roadmap)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoder): skip_matching_with_hint dominates self-time (~25%) — drop eager block-boundary inserts #180

Status

Profile evidence

What `skip_matching_with_hint` actually does

Hypothesis: donor's `ZSTD_row_fillHashCache` is sparser

Expected impact

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

% self time	Function
22%	`ZSTD_ldm_fillHashTable` (donor C — c_ffi side of the bench, not us)
13%	`RowMatchGenerator::skip_matching_with_hint` (+0x40)
5.4%	`RowMatchGenerator::skip_matching_with_hint` (+0x4C)
3.0%	`RowMatchGenerator::skip_matching_with_hint` (+0x5C)
~25%	(sum across `skip_matching_with_hint` offsets — dominant Rust self-time)
2.9%	`MatchTable::insert_hash3_only_no_rebase` (HC matcher, other scenarios in profile)
1.5%	`ZSTD_decodeLiteralsBlock` (donor C decode)

perf(encoder): skip_matching_with_hint dominates self-time (~25%) — drop eager block-boundary inserts #180

Description

Status

Profile evidence

What skip_matching_with_hint actually does

Hypothesis: donor's ZSTD_row_fillHashCache is sparser

Expected impact

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What `skip_matching_with_hint` actually does

Hypothesis: donor's `ZSTD_row_fillHashCache` is sparser