Skip to content

perf(encoder): skip_matching_with_hint dominates self-time (~25%) — drop eager block-boundary inserts #180

@polaz

Description

@polaz

Status

Investigation + speed track. While benchmarking the new donor-parity greedy parse at L4 (PR #179), profiled compress/level_4_greedy/large-log-stream/matrix/pure_rust with samply and discovered that RowMatchGenerator::skip_matching_with_hint is the dominant Rust self-time consumer (~25%) on streams with many block boundaries. The 102M insert_position calls (104 MB stream / 128 KiB block × ~128K inserts per block) overshadow any per-iteration delta in the parse loop.

This is independent of PR #179start_matching (the inner parse loop) and skip_matching_with_hint (called once per block at boundary setup) are separate hot paths. The greedy PR's +9% regression on large-log-stream is a downstream symptom of this structural bottleneck.

Profile evidence

samply record on L4 large-log-stream (8 s profile-time on macOS aarch64 release build, 26 210 samples). Top self-time, symbolicated via atos:

% self time Function
22% ZSTD_ldm_fillHashTable (donor C — c_ffi side of the bench, not us)
13% RowMatchGenerator::skip_matching_with_hint (+0x40)
5.4% RowMatchGenerator::skip_matching_with_hint (+0x4C)
3.0% RowMatchGenerator::skip_matching_with_hint (+0x5C)
~25% (sum across skip_matching_with_hint offsets — dominant Rust self-time)
2.9% MatchTable::insert_hash3_only_no_rebase (HC matcher, other scenarios in profile)
1.5% ZSTD_decodeLiteralsBlock (donor C decode)

What skip_matching_with_hint actually does

zstd/src/encoding/row/mod.rs:139:

pub(crate) fn skip_matching_with_hint(&mut self, incompressible_hint: Option<bool>) {
    self.ensure_tables();
    let current_len = self.window.back().unwrap().len();
    let current_abs_start = ...;
    let current_abs_end = ...;
    let backfill_start = self.backfill_start(current_abs_start);
    if backfill_start < current_abs_start {
        self.insert_positions(backfill_start, current_abs_start);
    }
    if incompressible_hint == Some(true) {
        // sparse step (only every 8th position)
    } else {
        // DENSE: insert every position in the block
        self.insert_positions(current_abs_start, current_abs_end);
    }
}

For a 128 KiB block with incompressible_hint = None | Some(false), this inserts 131 072 positions into the row-hash table. On a 104 MB stream that's ~800 blocks × 131 072 = ~104.8 M insert_position calls spent on hash-table maintenance alone, before the parse loop even starts.

Each insert_position does hash_and_row (u32 read + multiply + masking) + 3 indexed writes to row_heads / row_tags / row_positions. Cheap individually but the volume swamps everything else.

Hypothesis: donor's ZSTD_row_fillHashCache is sparser

Donor zstd_lazy.c:ZSTD_row_fillHashCache only fills the hash cache for ZSTD_ROW_HASH_CACHE_SIZE = 8 positions (it pre-computes the next 8 positions' hashes to feed the SIMD prefetch in ZSTD_RowFindBestMatch). It does NOT insert all block-byte positions into the row table at block boundary — those inserts happen lazily during the parse loop via ZSTD_row_update per emitted match.

Our skip_matching_with_hint is doing what donor does inside the parse loop on demand, but eagerly at block setup. That's why our 100 MB stream pays 100 M inserts instead of inserts only for the few positions that actually get searched.

If confirmed, the fix is to drop the eager insert_positions(current_abs_start, current_abs_end) and let start_matching / start_matching_greedy insert positions as they walk the block. Position-insert volume drops from O(block_size) to O(emitted_sequences * MIN_MATCH).

Expected impact

  • L4 large-log-stream: the +9% regression in perf(encoder): donor-parity greedy parse at L4 — ratio + speed win #179 is dominated by skip_matching_with_hint. Removing it should swing that cell back to a speed win (large-log-stream parse loop hits perfectly-periodic matches → very few insert_position calls during the loop itself).
  • L5-L15 (Row backend, lazy): similar gain expected on any scenario with many block boundaries (large inputs). Smaller scenarios (≤ block size) see no change.
  • L1-L3 fast/dfast: not affected (different backend, different skip-matching path).

Acceptance criteria

  • Verify hypothesis: drop the dense insert_positions(current_abs_start, current_abs_end) from skip_matching_with_hint; let start_matching / start_matching_greedy do per-iteration inserts.
  • Confirm correctness: existing test suite must stay green (519/519 in PR perf(encoder): donor-parity greedy parse at L4 — ratio + speed win #179 — same gates).
  • Confirm ratio preserved: compare_ffi REPORT lines rust_bytes <= ffi_bytes constraint per CLAUDE.md.
  • Measure: L4 large-log-stream time delta vs main; expect the +9% to flip to negative.
  • Re-profile to confirm skip_matching_with_hint no longer dominates self-time.
  • If hypothesis wrong, fall back to: instrument the call to find which call site (dictionary priming / block boundary / etc.) triggers it most, then optimize that site.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions