You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigation + speed track. While benchmarking the new donor-parity greedy parse at L4 (PR #179), profiled compress/level_4_greedy/large-log-stream/matrix/pure_rust with samply and discovered that RowMatchGenerator::skip_matching_with_hint is the dominant Rust self-time consumer (~25%) on streams with many block boundaries. The 102M insert_position calls (104 MB stream / 128 KiB block × ~128K inserts per block) overshadow any per-iteration delta in the parse loop.
This is independent of PR #179 — start_matching (the inner parse loop) and skip_matching_with_hint (called once per block at boundary setup) are separate hot paths. The greedy PR's +9% regression on large-log-stream is a downstream symptom of this structural bottleneck.
Profile evidence
samply record on L4 large-log-stream (8 s profile-time on macOS aarch64 release build, 26 210 samples). Top self-time, symbolicated via atos:
% self time
Function
22%
ZSTD_ldm_fillHashTable (donor C — c_ffi side of the bench, not us)
(sum across skip_matching_with_hint offsets — dominant Rust self-time)
2.9%
MatchTable::insert_hash3_only_no_rebase (HC matcher, other scenarios in profile)
1.5%
ZSTD_decodeLiteralsBlock (donor C decode)
What skip_matching_with_hint actually does
zstd/src/encoding/row/mod.rs:139:
pub(crate)fnskip_matching_with_hint(&mutself,incompressible_hint:Option<bool>){self.ensure_tables();let current_len = self.window.back().unwrap().len();let current_abs_start = ...;let current_abs_end = ...;let backfill_start = self.backfill_start(current_abs_start);if backfill_start < current_abs_start {self.insert_positions(backfill_start, current_abs_start);}if incompressible_hint == Some(true){// sparse step (only every 8th position)}else{// DENSE: insert every position in the blockself.insert_positions(current_abs_start, current_abs_end);}}
For a 128 KiB block with incompressible_hint = None | Some(false), this inserts 131 072 positions into the row-hash table. On a 104 MB stream that's ~800 blocks × 131 072 = ~104.8 M insert_position calls spent on hash-table maintenance alone, before the parse loop even starts.
Each insert_position does hash_and_row (u32 read + multiply + masking) + 3 indexed writes to row_heads / row_tags / row_positions. Cheap individually but the volume swamps everything else.
Hypothesis: donor's ZSTD_row_fillHashCache is sparser
Donor zstd_lazy.c:ZSTD_row_fillHashCache only fills the hash cache for ZSTD_ROW_HASH_CACHE_SIZE = 8 positions (it pre-computes the next 8 positions' hashes to feed the SIMD prefetch in ZSTD_RowFindBestMatch). It does NOT insert all block-byte positions into the row table at block boundary — those inserts happen lazily during the parse loop via ZSTD_row_update per emitted match.
Our skip_matching_with_hint is doing what donor does inside the parse loop on demand, but eagerly at block setup. That's why our 100 MB stream pays 100 M inserts instead of inserts only for the few positions that actually get searched.
If confirmed, the fix is to drop the eager insert_positions(current_abs_start, current_abs_end) and let start_matching / start_matching_greedy insert positions as they walk the block. Position-insert volume drops from O(block_size) to O(emitted_sequences * MIN_MATCH).
Expected impact
L4 large-log-stream: the +9% regression in perf(encoder): donor-parity greedy parse at L4 — ratio + speed win #179 is dominated by skip_matching_with_hint. Removing it should swing that cell back to a speed win (large-log-stream parse loop hits perfectly-periodic matches → very few insert_position calls during the loop itself).
L5-L15 (Row backend, lazy): similar gain expected on any scenario with many block boundaries (large inputs). Smaller scenarios (≤ block size) see no change.
L1-L3 fast/dfast: not affected (different backend, different skip-matching path).
Acceptance criteria
Verify hypothesis: drop the dense insert_positions(current_abs_start, current_abs_end) from skip_matching_with_hint; let start_matching / start_matching_greedy do per-iteration inserts.
Confirm ratio preserved: compare_ffi REPORT lines rust_bytes <= ffi_bytes constraint per CLAUDE.md.
Measure: L4 large-log-stream time delta vs main; expect the +9% to flip to negative.
Re-profile to confirm skip_matching_with_hint no longer dominates self-time.
If hypothesis wrong, fall back to: instrument the call to find which call site (dictionary priming / block boundary / etc.) triggers it most, then optimize that site.
Status
Investigation + speed track. While benchmarking the new donor-parity greedy parse at L4 (PR #179), profiled
compress/level_4_greedy/large-log-stream/matrix/pure_rustwithsamplyand discovered thatRowMatchGenerator::skip_matching_with_hintis the dominant Rust self-time consumer (~25%) on streams with many block boundaries. The 102Minsert_positioncalls (104 MB stream / 128 KiB block × ~128K inserts per block) overshadow any per-iteration delta in the parse loop.This is independent of PR #179 —
start_matching(the inner parse loop) andskip_matching_with_hint(called once per block at boundary setup) are separate hot paths. The greedy PR's +9% regression on large-log-stream is a downstream symptom of this structural bottleneck.Profile evidence
samply recordon L4 large-log-stream (8 s profile-time on macOS aarch64 release build, 26 210 samples). Top self-time, symbolicated viaatos:ZSTD_ldm_fillHashTable(donor C — c_ffi side of the bench, not us)RowMatchGenerator::skip_matching_with_hint(+0x40)RowMatchGenerator::skip_matching_with_hint(+0x4C)RowMatchGenerator::skip_matching_with_hint(+0x5C)skip_matching_with_hintoffsets — dominant Rust self-time)MatchTable::insert_hash3_only_no_rebase(HC matcher, other scenarios in profile)ZSTD_decodeLiteralsBlock(donor C decode)What
skip_matching_with_hintactually doeszstd/src/encoding/row/mod.rs:139:For a 128 KiB block with
incompressible_hint = None | Some(false), this inserts 131 072 positions into the row-hash table. On a 104 MB stream that's ~800 blocks × 131 072 = ~104.8 Minsert_positioncalls spent on hash-table maintenance alone, before the parse loop even starts.Each
insert_positiondoeshash_and_row(u32 read + multiply + masking) + 3 indexed writes torow_heads/row_tags/row_positions. Cheap individually but the volume swamps everything else.Hypothesis: donor's
ZSTD_row_fillHashCacheis sparserDonor
zstd_lazy.c:ZSTD_row_fillHashCacheonly fills the hash cache forZSTD_ROW_HASH_CACHE_SIZE = 8positions (it pre-computes the next 8 positions' hashes to feed the SIMD prefetch inZSTD_RowFindBestMatch). It does NOT insert all block-byte positions into the row table at block boundary — those inserts happen lazily during the parse loop viaZSTD_row_updateper emitted match.Our
skip_matching_with_hintis doing what donor does inside the parse loop on demand, but eagerly at block setup. That's why our 100 MB stream pays 100 M inserts instead of inserts only for the few positions that actually get searched.If confirmed, the fix is to drop the eager
insert_positions(current_abs_start, current_abs_end)and letstart_matching/start_matching_greedyinsert positions as they walk the block. Position-insert volume drops from O(block_size) to O(emitted_sequences * MIN_MATCH).Expected impact
skip_matching_with_hint. Removing it should swing that cell back to a speed win (large-log-stream parse loop hits perfectly-periodic matches → very fewinsert_positioncalls during the loop itself).Acceptance criteria
insert_positions(current_abs_start, current_abs_end)fromskip_matching_with_hint; letstart_matching/start_matching_greedydo per-iteration inserts.compare_ffiREPORT linesrust_bytes <= ffi_bytesconstraint per CLAUDE.md.skip_matching_with_hintno longer dominates self-time.Related