perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up) by polaz · Pull Request #231 · structured-world/structured-zstd

polaz · 2026-05-22T15:59:47Z

Summary

Donor-parity cleanups in the Fast strategy hot path identified while investigating the L1 ratio residual (still ~+7.43 % on decodecorpus-z000033). Neither commit closes the ratio gap on its own, but both move our encoder closer to donor shape and remove measurable overhead the previous code carried unnecessarily.

Part of #220.

Changes

`zstd/src/encoding/simple/fast_kernel/kernel.rs`

match_found carried two defensive bounds checks the donor does not have:

if match_pos + 4 > data_len { return false; }
if match_pos >= ip_pos     { return false; }

Both are redundant under the kernel's own invariants:

Hash table entries are only written for positions visited by the scan, which by construction stay strictly below ilimit = data_len - HASH_READ_SIZE = data_len - 8. The prefix_start_index >= 1 rule at the matcher boundary rejects the stale-zero initial entry.
Hash writes happen BEFORE probes inside the do-while body, and current0_prev < ip0_now by induction across shift steps — so matchIdx is always < ip_pos for entries written by this scan.

Donor ZSTD_match4Found_branch (zstd_fast.c:128-141) relies on the same invariants and emits exactly one prefix filter + one 4-byte equality compare. match_found is invoked twice per inner-loop iteration, so the savings compound: ~4 branches per iter dropped on the hot path.

A debug_assert!(match_pos < new_ip) at the explicit-match emit site surfaces invariant violations in test/fuzz builds without paying for them in release.

Bonus: pre-load rval unconditionally — donor's MEM_read32(ip2 - rep_offset1) always reads (with rep_offset1 = 0 the read at ip2 - 0 = ip2 is safe and the rep branch is correctly suppressed by the rep_offset1 > 0 guard inside the if).

`zstd/src/encoding/blocks/compressed.rs`

New append_literals helper routes literal-section appends through simd_copy::copy_bytes_overshooting for slices ≤ 32 bytes. That function has a byte-by-byte / overlapping-u64 exact-length tail path that LLVM inlines for short runtime sizes — no libc memmove call. Longer runs (> 32 bytes) fall through to extend_from_slice where libc memmove amortises across the copy.

Investigation context: flamegraph on compress/level_1_fast/decodecorpus.*pure_rust showed __memmove_avx_unaligned_erms at 60 % of bench CPU. After bench, this inline change accounts for ~0.5 % of measured time on the corpus — the memmove dominance turned out to be per-iteration allocation churn (hash table init memset 24 %, output Vec growth memcpy / realloc) rather than the literal-append path. Keeping the change anyway: it removes a real libc call per emitted sequence regardless of total impact, and the donor-parity shape is cleaner.

`zstd/src/decoding/mod.rs`

simd_copy module visibility bumped from mod to pub(crate) mod so the encode side can reach the helper.

Test plan

full nextest suite — 581 passed, 3 skipped
lint+typecheck clean
Bench on Linux server (i9-9900K, BMI2/AVX2): compress/level_1_fast/decodecorpus-z000033 pure_rust throughput 153.87 → 154.50 MiB/s (in noise). FFI side 351 MiB/s.
Ratio gap still ~+7.43 % — root cause continues under investigation via sequence comparator. Tracked in perf(encoder): Fast L1 +7.43% ratio gap vs donor on decodecorpus-z000033 #220 follow-up.

coderabbitai · 2026-05-22T15:59:54Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cafe2c6d-5391-4008-8352-d2da25b80445

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#220-fast-l1-ratio

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-22T16:02:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

… path (donor parity) L1 Fast on decodecorpus-z000033 throughput was ~153 MiB/s vs donor's ~351 MiB/s (2.3× slower) — far beyond cParams/algorithm differences could explain. Root cause: `match_found` carried TWO defensive bounds checks the donor doesn't have: if match_pos + 4 > data_len { return false; } if match_pos >= ip_pos { return false; } Both are redundant under the kernel's own invariants: 1. Hash table entries are only written for positions VISITED by the scan, which by construction stay strictly below `ilimit = data_len - HASH_READ_SIZE = data_len - 8`. So any in-window `match_idx >= prefix_start_index` satisfies `match_pos + 4 < data_len` automatically. The `prefix_start_index >= 1` rule at the matcher boundary handles the stale-zero initial entry. 2. Hash writes happen BEFORE probes inside the do-while body (line `hashTable[hash0] = current0` precedes `matchFound(...)`), and `current0 < ip0_now` by induction across shift steps. So `matchIdx` is always < `ip_pos` for entries written by this scan. `match_found` is invoked TWICE per inner-loop iteration (probe at ip0, probe at shifted ip0), so the savings compound: ~4 branches per iter dropped on a 1MB scan = millions of mispredict-prone branches eliminated. Bonus side change: pre-load `rval` unconditionally — donor's `MEM_read32(ip2 - rep_offset1)` always reads (with rep_offset1=0 `ip2 - 0 = ip2` is safe), no `rep_offset1 > 0` guard around the load. Removes one more branch from the inner loop. Added a `debug_assert!(match_pos < new_ip)` at the explicit-match emit site to surface invariant violations in test/fuzz builds without paying for them in release. The previous engineered test `match_found_rejects_stale_forward_entry` manually constructed a forward-pointing stale entry that the defensive check rejected; that test exercised behaviour we're now defining as caller responsibility. Replaced with `match_found_rejects_stale_entry_below_prefix_floor` covering the remaining prefix-floor filter, which IS still active and IS the donor-parity defensive contract.

…ass libc memmove) Flamegraph on `compress/level_1_fast/decodecorpus.*pure_rust` showed `__memmove_avx_unaligned_erms` at 60 % of bench CPU, with the actual `compress_block_fast` body at only 0.07 %. The chain: start_matching → handle_sequence (collect_block_parts closure) → parts.literals.extend_from_slice(literals) → ptr::copy_nonoverlapping (runtime size) → libc memmove Each emitted Sequence::Triple from the Fast kernel triggered one libc memcpy of 1-10 bytes (typical Fast L1 literal run). With thousands of sequences per block × 8 blocks per bench iter, the per-call libc overhead dominated the hot path. Donor C zstd uses inline `ZSTD_copy16` (single SIMD store, ~5 cycles) for the same operation in `ZSTD_storeSeq → ZSTD_safecopyLiterals`. Route literal appends through a new `append_literals` helper that forwards to `simd_copy::copy_bytes_overshooting` for slices ≤ 32 bytes. That function has a byte-by-byte / overlapping-u64 exact- length tail path that LLVM fully inlines for short runtime sizes — no libc call. Longer runs (> 32 bytes, rare on Fast L1) keep `extend_from_slice`; libc memmove amortises across the longer copy. `simd_copy` module visibility bumped from `mod` to `pub(crate) mod` so the encode side can reach it (same crate, no external surface change).

…prefix=0 ablation Tested lowering the sentinel to 0 (donor-parity move — donor uses prefixStartIndex=0 plus the kernel's ip0+=(ip0==prefixStart) bump to skip position 0). The test `skip_matching_with_none_hint_skips_hash_population` enforces a cross-block-isolation contract that depends on the sentinel-0 filter — lowering it to 0 lets stale empty-slot lookups (match_idx = 0 in the all-zero hash table) probe position 0 and emit spurious cross-block matches. The position-0 emit rate is too small to be worth breaking that contract — the L1 ratio gap on decodecorpus is dominated by something else (5716 ffi_only sequences in block 0 alone — far more than the few position-0 emits a donor-parity prefix would add). Document the tested-and-rejected lower bound so future hands don't repeat the ablation without checking the skip_matching contract test first.

…_low Donor `ZSTD_compressBlock_fast_noDict_generic` computes the prologue `maxRep` against `windowLow` (the absolute floor of in-window positions, = 0 at block 0). Our prologue used `prefix_start_index` (the sentinel-1 floor for FastHashTable's all-zero empty-slot filter), giving `max_rep = ip0 - 1 = 0` at the first iteration with ip0=1 — so the default `rep_offset1 = 1` was stashed into `offset_saved1` and the rep-at-ip2 probe was DISABLED for the entire first block. Cascade on real corpora: every iteration in block 0 missed the donor's 1-byte rep probe (`read32(ip2) == read32(ip2 - 1)`), which fires on any local 5-byte run. The kernel's cursor advanced via explicit-match shifts only, populating the hash table at different positions than donor; later explicit matches in the same block then ALSO diverged. On decodecorpus-z000033 Level(1) `compare_ffi_sequences` reported donor finding 36 short-distance matches (offsets 50-1800 bytes) in the first ~5 KB of block 0 that we missed entirely. Code reading on all other hot-path divergence candidates (hash primes, step doubling, probe order, cmov vs branch, min_gain) showed bit-for-bit donor parity. The prologue divergence was the root cause. Fix: thread `window_low` (= `history.len().saturating_sub(advertised_ window)`) into `compress_block_fast` alongside `prefix_start_index`. The two coincide for all blocks where `window_low >= 1`; they differ only at block 0 / pre-eviction blocks where the sentinel-1 floor inflates the hash-filter bound without donor justification. - `prefix_start_index` continues to drive `match_found`'s hash-filter predicate (`match_idx >= prefix_start_index`), sentinel-aware so uninitialized FastHashTable slots holding 0 are rejected. - `window_low` drives the prologue `max_rep` AND the backward- extension `match_pos > X` bound for both rep and explicit paths (donor expresses these against `prefixStart` directly). Includes regression test `block_zero_prologue_preserves_default_rep_ offset_one` documenting the contract via a uniform-byte fixture (the emit's offset must be 1, matching donor's rep-at-ip2 hit at iter 1). The first-emit literal-prefix length is NOT asserted — slot collisions on uniform-byte data let the explicit-match path catch up to the same offset=1 emit, so the unit test only documents the direction; the ratio-regression discriminator is the `compare_ffi` run on decodecorpus. All 589 structured-zstd tests pass.

…igation Adds the `kernel_trace` Cargo feature plus two example binaries to diagnose the Level(1) Fast strategy ratio divergence on decodecorpus-z000033 (issue #220, +7.43% vs C zstd). Feature is compile-time gated (production builds carry ZERO cost — the ktrace! macro expands to a no-op when feature is off). Runtime activation via `STRUCTURED_ZSTD_KERNEL_TRACE=1`. Traces emitted at each inner-loop decision point in `compress_block_fast`: - OUTER state on outer-loop entry (ip0/ip1/ip2/ip3/step/hashes/match_idx/rep) - PUT events for every hash-table writeback with (slot, position, site tag) - PROBE1 / PROBE2 events with match_idx at each explicit-match probe - MATCH events on every found match (path, ip0, match_idx, offset) Example `trace_fast_kernel` runs Level(1) on z000033 and dumps trace to stderr. Example `donor_cparams_check` queries donor's `ZSTD_getCParams(L1, 1MB, 0)` to confirm cParams parity. Findings from this instrumentation: - Our cursor visits positions {1, 3, ..., 33, 36, 39, ..., 1174, 1175, 1212, 1213, 1251, 1252, 1291, 1292} in block 0 before the first emit - Python simulator using donor's exact `ZSTD_compressBlock_fast_noDict_ generic` formula reproduces the same set bit-for-bit - Donor cParams query confirms identical `{windowLog=19, mls=7, TL=0}` - Yet the FFI comparator reports donor's first block-0 emit at literal_length=1280 — requiring a probe at ip0 ∈ [1280, 1284] which neither our cursor nor the simulator reaches This is an open mystery — donor's published source + cParams + step formula all predict identical cursor traversal to ours, but the actual compressed output diverges. Resolution requires physically instrumenting donor's `zstd_fast.c` (printf in the inner loop) and re-running with identical input. Captured as a follow-up; the trace machinery here will remain in place to speed up that next session. `compare_ffi_sequences` output stays unchanged (sequence-level diff identical to pre-instrumentation): rust_seqs=23783 ffi_seqs=27763 match=19461 (66.0%) — the ratio gap is unresolved.

Donor `kSearchStrength` is defined in `zstd_compress_internal.h:32` as `#define kSearchStrength 8`, yielding `kStepIncr = 1 << 7 = 128`. Our kernel had `SEARCH_STRENGTH = 6` (K_STEP_INCR = 32), making step doubling fire 4× more often than donor. Test fixture `start_matching_caps_offsets_at_window_log_not_inflated_max` switched to period-4 dict pattern. The previous period-64 pattern's match positions {0, 64, 128, 192} all fall below the sliding floor or in step-skip gaps under the donor-parity K_STEP_INCR=128, producing zero emittable matches. Adds `donor_compress_z000033` example as a one-shot diagnostic that calls the FFI ZSTD_compress on the same corpus at Level(1), used in conjunction with kernel_trace to diff cursor traversal.

polaz added 3 commits May 22, 2026 21:05

polaz force-pushed the perf/#220-fast-l1-ratio branch from fc90393 to 94f1251 Compare May 22, 2026 18:10

polaz added 3 commits May 22, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231

perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231
polaz wants to merge 6 commits into
mainfrom
perf/#220-fast-l1-ratio

polaz commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

polaz commented May 22, 2026

Summary

Changes

zstd/src/encoding/simple/fast_kernel/kernel.rs

zstd/src/encoding/blocks/compressed.rs

zstd/src/decoding/mod.rs

Test plan

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`zstd/src/encoding/simple/fast_kernel/kernel.rs`

`zstd/src/encoding/blocks/compressed.rs`

`zstd/src/decoding/mod.rs`

coderabbitai Bot commented May 22, 2026 •

edited

Loading