Skip to content

perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231

Draft
polaz wants to merge 6 commits into
mainfrom
perf/#220-fast-l1-ratio
Draft

perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231
polaz wants to merge 6 commits into
mainfrom
perf/#220-fast-l1-ratio

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 22, 2026

Summary

Donor-parity cleanups in the Fast strategy hot path identified while investigating the L1 ratio residual (still ~+7.43 % on decodecorpus-z000033). Neither commit closes the ratio gap on its own, but both move our encoder closer to donor shape and remove measurable overhead the previous code carried unnecessarily.

Part of #220.

Changes

zstd/src/encoding/simple/fast_kernel/kernel.rs

match_found carried two defensive bounds checks the donor does not have:

if match_pos + 4 > data_len { return false; }
if match_pos >= ip_pos     { return false; }

Both are redundant under the kernel's own invariants:

  1. Hash table entries are only written for positions visited by the scan, which by construction stay strictly below ilimit = data_len - HASH_READ_SIZE = data_len - 8. The prefix_start_index >= 1 rule at the matcher boundary rejects the stale-zero initial entry.
  2. Hash writes happen BEFORE probes inside the do-while body, and current0_prev < ip0_now by induction across shift steps — so matchIdx is always < ip_pos for entries written by this scan.

Donor ZSTD_match4Found_branch (zstd_fast.c:128-141) relies on the same invariants and emits exactly one prefix filter + one 4-byte equality compare. match_found is invoked twice per inner-loop iteration, so the savings compound: ~4 branches per iter dropped on the hot path.

A debug_assert!(match_pos < new_ip) at the explicit-match emit site surfaces invariant violations in test/fuzz builds without paying for them in release.

Bonus: pre-load rval unconditionally — donor's MEM_read32(ip2 - rep_offset1) always reads (with rep_offset1 = 0 the read at ip2 - 0 = ip2 is safe and the rep branch is correctly suppressed by the rep_offset1 > 0 guard inside the if).

zstd/src/encoding/blocks/compressed.rs

New append_literals helper routes literal-section appends through simd_copy::copy_bytes_overshooting for slices ≤ 32 bytes. That function has a byte-by-byte / overlapping-u64 exact-length tail path that LLVM inlines for short runtime sizes — no libc memmove call. Longer runs (> 32 bytes) fall through to extend_from_slice where libc memmove amortises across the copy.

Investigation context: flamegraph on compress/level_1_fast/decodecorpus.*pure_rust showed __memmove_avx_unaligned_erms at 60 % of bench CPU. After bench, this inline change accounts for ~0.5 % of measured time on the corpus — the memmove dominance turned out to be per-iteration allocation churn (hash table init memset 24 %, output Vec growth memcpy / realloc) rather than the literal-append path. Keeping the change anyway: it removes a real libc call per emitted sequence regardless of total impact, and the donor-parity shape is cleaner.

zstd/src/decoding/mod.rs

simd_copy module visibility bumped from mod to pub(crate) mod so the encode side can reach the helper.

Test plan

  • full nextest suite — 581 passed, 3 skipped
  • lint+typecheck clean
  • Bench on Linux server (i9-9900K, BMI2/AVX2): compress/level_1_fast/decodecorpus-z000033 pure_rust throughput 153.87 → 154.50 MiB/s (in noise). FFI side 351 MiB/s.
  • Ratio gap still ~+7.43 % — root cause continues under investigation via sequence comparator. Tracked in perf(encoder): Fast L1 +7.43% ratio gap vs donor on decodecorpus-z000033 #220 follow-up.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cafe2c6d-5391-4008-8352-d2da25b80445

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#220-fast-l1-ratio

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

polaz added 3 commits May 22, 2026 21:05
… path (donor parity)

L1 Fast on decodecorpus-z000033 throughput was ~153 MiB/s vs donor's
~351 MiB/s (2.3× slower) — far beyond cParams/algorithm differences
could explain. Root cause: `match_found` carried TWO defensive
bounds checks the donor doesn't have:

  if match_pos + 4 > data_len { return false; }
  if match_pos >= ip_pos     { return false; }

Both are redundant under the kernel's own invariants:

1. Hash table entries are only written for positions VISITED by the
   scan, which by construction stay strictly below `ilimit =
   data_len - HASH_READ_SIZE = data_len - 8`. So any in-window
   `match_idx >= prefix_start_index` satisfies `match_pos + 4 <
   data_len` automatically. The `prefix_start_index >= 1` rule at
   the matcher boundary handles the stale-zero initial entry.

2. Hash writes happen BEFORE probes inside the do-while body (line
   `hashTable[hash0] = current0` precedes `matchFound(...)`), and
   `current0 < ip0_now` by induction across shift steps. So
   `matchIdx` is always < `ip_pos` for entries written by this
   scan.

`match_found` is invoked TWICE per inner-loop iteration (probe at
ip0, probe at shifted ip0), so the savings compound: ~4 branches
per iter dropped on a 1MB scan = millions of mispredict-prone
branches eliminated.

Bonus side change: pre-load `rval` unconditionally — donor's
`MEM_read32(ip2 - rep_offset1)` always reads (with rep_offset1=0
`ip2 - 0 = ip2` is safe), no `rep_offset1 > 0` guard around the
load. Removes one more branch from the inner loop.

Added a `debug_assert!(match_pos < new_ip)` at the explicit-match
emit site to surface invariant violations in test/fuzz builds
without paying for them in release.

The previous engineered test `match_found_rejects_stale_forward_entry`
manually constructed a forward-pointing stale entry that the
defensive check rejected; that test exercised behaviour we're now
defining as caller responsibility. Replaced with
`match_found_rejects_stale_entry_below_prefix_floor` covering the
remaining prefix-floor filter, which IS still active and IS the
donor-parity defensive contract.
…ass libc memmove)

Flamegraph on `compress/level_1_fast/decodecorpus.*pure_rust`
showed `__memmove_avx_unaligned_erms` at 60 % of bench CPU, with
the actual `compress_block_fast` body at only 0.07 %. The chain:

  start_matching → handle_sequence (collect_block_parts closure)
    → parts.literals.extend_from_slice(literals)
    → ptr::copy_nonoverlapping (runtime size)
    → libc memmove

Each emitted Sequence::Triple from the Fast kernel triggered one
libc memcpy of 1-10 bytes (typical Fast L1 literal run). With
thousands of sequences per block × 8 blocks per bench iter, the
per-call libc overhead dominated the hot path. Donor C zstd uses
inline `ZSTD_copy16` (single SIMD store, ~5 cycles) for the same
operation in `ZSTD_storeSeq → ZSTD_safecopyLiterals`.

Route literal appends through a new `append_literals` helper that
forwards to `simd_copy::copy_bytes_overshooting` for slices ≤ 32
bytes. That function has a byte-by-byte / overlapping-u64 exact-
length tail path that LLVM fully inlines for short runtime sizes —
no libc call. Longer runs (> 32 bytes, rare on Fast L1) keep
`extend_from_slice`; libc memmove amortises across the longer
copy.

`simd_copy` module visibility bumped from `mod` to `pub(crate) mod`
so the encode side can reach it (same crate, no external surface
change).
…prefix=0 ablation

Tested lowering the sentinel to 0 (donor-parity move — donor uses
prefixStartIndex=0 plus the kernel's ip0+=(ip0==prefixStart) bump
to skip position 0). The test
`skip_matching_with_none_hint_skips_hash_population` enforces a
cross-block-isolation contract that depends on the sentinel-0
filter — lowering it to 0 lets stale empty-slot lookups (match_idx
= 0 in the all-zero hash table) probe position 0 and emit
spurious cross-block matches.

The position-0 emit rate is too small to be worth breaking that
contract — the L1 ratio gap on decodecorpus is dominated by
something else (5716 ffi_only sequences in block 0 alone — far
more than the few position-0 emits a donor-parity prefix would
add).

Document the tested-and-rejected lower bound so future hands don't
repeat the ablation without checking the skip_matching contract
test first.
@polaz polaz force-pushed the perf/#220-fast-l1-ratio branch from fc90393 to 94f1251 Compare May 22, 2026 18:10
polaz added 3 commits May 22, 2026 22:09
…_low

Donor `ZSTD_compressBlock_fast_noDict_generic` computes the prologue
`maxRep` against `windowLow` (the absolute floor of in-window positions,
= 0 at block 0). Our prologue used `prefix_start_index` (the sentinel-1
floor for FastHashTable's all-zero empty-slot filter), giving `max_rep
= ip0 - 1 = 0` at the first iteration with ip0=1 — so the default
`rep_offset1 = 1` was stashed into `offset_saved1` and the rep-at-ip2
probe was DISABLED for the entire first block.

Cascade on real corpora: every iteration in block 0 missed the donor's
1-byte rep probe (`read32(ip2) == read32(ip2 - 1)`), which fires on any
local 5-byte run. The kernel's cursor advanced via explicit-match
shifts only, populating the hash table at different positions than
donor; later explicit matches in the same block then ALSO diverged.

On decodecorpus-z000033 Level(1) `compare_ffi_sequences` reported
donor finding 36 short-distance matches (offsets 50-1800 bytes) in
the first ~5 KB of block 0 that we missed entirely. Code reading on
all other hot-path divergence candidates (hash primes, step doubling,
probe order, cmov vs branch, min_gain) showed bit-for-bit donor
parity. The prologue divergence was the root cause.

Fix: thread `window_low` (= `history.len().saturating_sub(advertised_
window)`) into `compress_block_fast` alongside `prefix_start_index`.
The two coincide for all blocks where `window_low >= 1`; they differ
only at block 0 / pre-eviction blocks where the sentinel-1 floor
inflates the hash-filter bound without donor justification.

- `prefix_start_index` continues to drive `match_found`'s hash-filter
  predicate (`match_idx >= prefix_start_index`), sentinel-aware so
  uninitialized FastHashTable slots holding 0 are rejected.
- `window_low` drives the prologue `max_rep` AND the backward-
  extension `match_pos > X` bound for both rep and explicit paths
  (donor expresses these against `prefixStart` directly).

Includes regression test `block_zero_prologue_preserves_default_rep_
offset_one` documenting the contract via a uniform-byte fixture (the
emit's offset must be 1, matching donor's rep-at-ip2 hit at iter 1).
The first-emit literal-prefix length is NOT asserted — slot
collisions on uniform-byte data let the explicit-match path catch up
to the same offset=1 emit, so the unit test only documents the
direction; the ratio-regression discriminator is the `compare_ffi`
run on decodecorpus.

All 589 structured-zstd tests pass.
…igation

Adds the `kernel_trace` Cargo feature plus two example binaries to
diagnose the Level(1) Fast strategy ratio divergence on
decodecorpus-z000033 (issue #220, +7.43% vs C zstd).

Feature is compile-time gated (production builds carry ZERO cost — the
ktrace! macro expands to a no-op when feature is off). Runtime activation
via `STRUCTURED_ZSTD_KERNEL_TRACE=1`.

Traces emitted at each inner-loop decision point in
`compress_block_fast`:
- OUTER state on outer-loop entry (ip0/ip1/ip2/ip3/step/hashes/match_idx/rep)
- PUT events for every hash-table writeback with (slot, position, site tag)
- PROBE1 / PROBE2 events with match_idx at each explicit-match probe
- MATCH events on every found match (path, ip0, match_idx, offset)

Example `trace_fast_kernel` runs Level(1) on z000033 and dumps trace to
stderr. Example `donor_cparams_check` queries donor's
`ZSTD_getCParams(L1, 1MB, 0)` to confirm cParams parity.

Findings from this instrumentation:
- Our cursor visits positions {1, 3, ..., 33, 36, 39, ..., 1174, 1175,
  1212, 1213, 1251, 1252, 1291, 1292} in block 0 before the first emit
- Python simulator using donor's exact `ZSTD_compressBlock_fast_noDict_
  generic` formula reproduces the same set bit-for-bit
- Donor cParams query confirms identical `{windowLog=19, mls=7, TL=0}`
- Yet the FFI comparator reports donor's first block-0 emit at
  literal_length=1280 — requiring a probe at ip0 ∈ [1280, 1284] which
  neither our cursor nor the simulator reaches

This is an open mystery — donor's published source + cParams + step
formula all predict identical cursor traversal to ours, but the actual
compressed output diverges. Resolution requires physically instrumenting
donor's `zstd_fast.c` (printf in the inner loop) and re-running with
identical input. Captured as a follow-up; the trace machinery here will
remain in place to speed up that next session.

`compare_ffi_sequences` output stays unchanged (sequence-level diff
identical to pre-instrumentation): rust_seqs=23783 ffi_seqs=27763
match=19461 (66.0%) — the ratio gap is unresolved.
Donor `kSearchStrength` is defined in `zstd_compress_internal.h:32` as
`#define kSearchStrength 8`, yielding `kStepIncr = 1 << 7 = 128`. Our
kernel had `SEARCH_STRENGTH = 6` (K_STEP_INCR = 32), making step
doubling fire 4× more often than donor.

Test fixture `start_matching_caps_offsets_at_window_log_not_inflated_max`
switched to period-4 dict pattern. The previous period-64 pattern's
match positions {0, 64, 128, 192} all fall below the sliding floor or
in step-skip gaps under the donor-parity K_STEP_INCR=128, producing
zero emittable matches.

Adds `donor_compress_z000033` example as a one-shot diagnostic that
calls the FFI ZSTD_compress on the same corpus at Level(1), used in
conjunction with kernel_trace to diff cursor traversal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant