perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231
Draft
polaz wants to merge 6 commits into
Draft
perf(fast): donor-parity hot-path cleanups in Fast kernel + inline short-literal append (#220 follow-up)#231polaz wants to merge 6 commits into
polaz wants to merge 6 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
… path (donor parity)
L1 Fast on decodecorpus-z000033 throughput was ~153 MiB/s vs donor's
~351 MiB/s (2.3× slower) — far beyond cParams/algorithm differences
could explain. Root cause: `match_found` carried TWO defensive
bounds checks the donor doesn't have:
if match_pos + 4 > data_len { return false; }
if match_pos >= ip_pos { return false; }
Both are redundant under the kernel's own invariants:
1. Hash table entries are only written for positions VISITED by the
scan, which by construction stay strictly below `ilimit =
data_len - HASH_READ_SIZE = data_len - 8`. So any in-window
`match_idx >= prefix_start_index` satisfies `match_pos + 4 <
data_len` automatically. The `prefix_start_index >= 1` rule at
the matcher boundary handles the stale-zero initial entry.
2. Hash writes happen BEFORE probes inside the do-while body (line
`hashTable[hash0] = current0` precedes `matchFound(...)`), and
`current0 < ip0_now` by induction across shift steps. So
`matchIdx` is always < `ip_pos` for entries written by this
scan.
`match_found` is invoked TWICE per inner-loop iteration (probe at
ip0, probe at shifted ip0), so the savings compound: ~4 branches
per iter dropped on a 1MB scan = millions of mispredict-prone
branches eliminated.
Bonus side change: pre-load `rval` unconditionally — donor's
`MEM_read32(ip2 - rep_offset1)` always reads (with rep_offset1=0
`ip2 - 0 = ip2` is safe), no `rep_offset1 > 0` guard around the
load. Removes one more branch from the inner loop.
Added a `debug_assert!(match_pos < new_ip)` at the explicit-match
emit site to surface invariant violations in test/fuzz builds
without paying for them in release.
The previous engineered test `match_found_rejects_stale_forward_entry`
manually constructed a forward-pointing stale entry that the
defensive check rejected; that test exercised behaviour we're now
defining as caller responsibility. Replaced with
`match_found_rejects_stale_entry_below_prefix_floor` covering the
remaining prefix-floor filter, which IS still active and IS the
donor-parity defensive contract.
…ass libc memmove)
Flamegraph on `compress/level_1_fast/decodecorpus.*pure_rust`
showed `__memmove_avx_unaligned_erms` at 60 % of bench CPU, with
the actual `compress_block_fast` body at only 0.07 %. The chain:
start_matching → handle_sequence (collect_block_parts closure)
→ parts.literals.extend_from_slice(literals)
→ ptr::copy_nonoverlapping (runtime size)
→ libc memmove
Each emitted Sequence::Triple from the Fast kernel triggered one
libc memcpy of 1-10 bytes (typical Fast L1 literal run). With
thousands of sequences per block × 8 blocks per bench iter, the
per-call libc overhead dominated the hot path. Donor C zstd uses
inline `ZSTD_copy16` (single SIMD store, ~5 cycles) for the same
operation in `ZSTD_storeSeq → ZSTD_safecopyLiterals`.
Route literal appends through a new `append_literals` helper that
forwards to `simd_copy::copy_bytes_overshooting` for slices ≤ 32
bytes. That function has a byte-by-byte / overlapping-u64 exact-
length tail path that LLVM fully inlines for short runtime sizes —
no libc call. Longer runs (> 32 bytes, rare on Fast L1) keep
`extend_from_slice`; libc memmove amortises across the longer
copy.
`simd_copy` module visibility bumped from `mod` to `pub(crate) mod`
so the encode side can reach it (same crate, no external surface
change).
…prefix=0 ablation Tested lowering the sentinel to 0 (donor-parity move — donor uses prefixStartIndex=0 plus the kernel's ip0+=(ip0==prefixStart) bump to skip position 0). The test `skip_matching_with_none_hint_skips_hash_population` enforces a cross-block-isolation contract that depends on the sentinel-0 filter — lowering it to 0 lets stale empty-slot lookups (match_idx = 0 in the all-zero hash table) probe position 0 and emit spurious cross-block matches. The position-0 emit rate is too small to be worth breaking that contract — the L1 ratio gap on decodecorpus is dominated by something else (5716 ffi_only sequences in block 0 alone — far more than the few position-0 emits a donor-parity prefix would add). Document the tested-and-rejected lower bound so future hands don't repeat the ablation without checking the skip_matching contract test first.
fc90393 to
94f1251
Compare
…_low Donor `ZSTD_compressBlock_fast_noDict_generic` computes the prologue `maxRep` against `windowLow` (the absolute floor of in-window positions, = 0 at block 0). Our prologue used `prefix_start_index` (the sentinel-1 floor for FastHashTable's all-zero empty-slot filter), giving `max_rep = ip0 - 1 = 0` at the first iteration with ip0=1 — so the default `rep_offset1 = 1` was stashed into `offset_saved1` and the rep-at-ip2 probe was DISABLED for the entire first block. Cascade on real corpora: every iteration in block 0 missed the donor's 1-byte rep probe (`read32(ip2) == read32(ip2 - 1)`), which fires on any local 5-byte run. The kernel's cursor advanced via explicit-match shifts only, populating the hash table at different positions than donor; later explicit matches in the same block then ALSO diverged. On decodecorpus-z000033 Level(1) `compare_ffi_sequences` reported donor finding 36 short-distance matches (offsets 50-1800 bytes) in the first ~5 KB of block 0 that we missed entirely. Code reading on all other hot-path divergence candidates (hash primes, step doubling, probe order, cmov vs branch, min_gain) showed bit-for-bit donor parity. The prologue divergence was the root cause. Fix: thread `window_low` (= `history.len().saturating_sub(advertised_ window)`) into `compress_block_fast` alongside `prefix_start_index`. The two coincide for all blocks where `window_low >= 1`; they differ only at block 0 / pre-eviction blocks where the sentinel-1 floor inflates the hash-filter bound without donor justification. - `prefix_start_index` continues to drive `match_found`'s hash-filter predicate (`match_idx >= prefix_start_index`), sentinel-aware so uninitialized FastHashTable slots holding 0 are rejected. - `window_low` drives the prologue `max_rep` AND the backward- extension `match_pos > X` bound for both rep and explicit paths (donor expresses these against `prefixStart` directly). Includes regression test `block_zero_prologue_preserves_default_rep_ offset_one` documenting the contract via a uniform-byte fixture (the emit's offset must be 1, matching donor's rep-at-ip2 hit at iter 1). The first-emit literal-prefix length is NOT asserted — slot collisions on uniform-byte data let the explicit-match path catch up to the same offset=1 emit, so the unit test only documents the direction; the ratio-regression discriminator is the `compare_ffi` run on decodecorpus. All 589 structured-zstd tests pass.
…igation Adds the `kernel_trace` Cargo feature plus two example binaries to diagnose the Level(1) Fast strategy ratio divergence on decodecorpus-z000033 (issue #220, +7.43% vs C zstd). Feature is compile-time gated (production builds carry ZERO cost — the ktrace! macro expands to a no-op when feature is off). Runtime activation via `STRUCTURED_ZSTD_KERNEL_TRACE=1`. Traces emitted at each inner-loop decision point in `compress_block_fast`: - OUTER state on outer-loop entry (ip0/ip1/ip2/ip3/step/hashes/match_idx/rep) - PUT events for every hash-table writeback with (slot, position, site tag) - PROBE1 / PROBE2 events with match_idx at each explicit-match probe - MATCH events on every found match (path, ip0, match_idx, offset) Example `trace_fast_kernel` runs Level(1) on z000033 and dumps trace to stderr. Example `donor_cparams_check` queries donor's `ZSTD_getCParams(L1, 1MB, 0)` to confirm cParams parity. Findings from this instrumentation: - Our cursor visits positions {1, 3, ..., 33, 36, 39, ..., 1174, 1175, 1212, 1213, 1251, 1252, 1291, 1292} in block 0 before the first emit - Python simulator using donor's exact `ZSTD_compressBlock_fast_noDict_ generic` formula reproduces the same set bit-for-bit - Donor cParams query confirms identical `{windowLog=19, mls=7, TL=0}` - Yet the FFI comparator reports donor's first block-0 emit at literal_length=1280 — requiring a probe at ip0 ∈ [1280, 1284] which neither our cursor nor the simulator reaches This is an open mystery — donor's published source + cParams + step formula all predict identical cursor traversal to ours, but the actual compressed output diverges. Resolution requires physically instrumenting donor's `zstd_fast.c` (printf in the inner loop) and re-running with identical input. Captured as a follow-up; the trace machinery here will remain in place to speed up that next session. `compare_ffi_sequences` output stays unchanged (sequence-level diff identical to pre-instrumentation): rust_seqs=23783 ffi_seqs=27763 match=19461 (66.0%) — the ratio gap is unresolved.
Donor `kSearchStrength` is defined in `zstd_compress_internal.h:32` as
`#define kSearchStrength 8`, yielding `kStepIncr = 1 << 7 = 128`. Our
kernel had `SEARCH_STRENGTH = 6` (K_STEP_INCR = 32), making step
doubling fire 4× more often than donor.
Test fixture `start_matching_caps_offsets_at_window_log_not_inflated_max`
switched to period-4 dict pattern. The previous period-64 pattern's
match positions {0, 64, 128, 192} all fall below the sliding floor or
in step-skip gaps under the donor-parity K_STEP_INCR=128, producing
zero emittable matches.
Adds `donor_compress_z000033` example as a one-shot diagnostic that
calls the FFI ZSTD_compress on the same corpus at Level(1), used in
conjunction with kernel_trace to diff cursor traversal.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Donor-parity cleanups in the Fast strategy hot path identified while investigating the L1 ratio residual (still ~+7.43 % on decodecorpus-z000033). Neither commit closes the ratio gap on its own, but both move our encoder closer to donor shape and remove measurable overhead the previous code carried unnecessarily.
Part of #220.
Changes
zstd/src/encoding/simple/fast_kernel/kernel.rsmatch_foundcarried two defensive bounds checks the donor does not have:Both are redundant under the kernel's own invariants:
ilimit = data_len - HASH_READ_SIZE = data_len - 8. Theprefix_start_index >= 1rule at the matcher boundary rejects the stale-zero initial entry.current0_prev < ip0_nowby induction across shift steps — somatchIdxis always< ip_posfor entries written by this scan.Donor
ZSTD_match4Found_branch(zstd_fast.c:128-141) relies on the same invariants and emits exactly one prefix filter + one 4-byte equality compare.match_foundis invoked twice per inner-loop iteration, so the savings compound: ~4 branches per iter dropped on the hot path.A
debug_assert!(match_pos < new_ip)at the explicit-match emit site surfaces invariant violations in test/fuzz builds without paying for them in release.Bonus: pre-load
rvalunconditionally — donor'sMEM_read32(ip2 - rep_offset1)always reads (with rep_offset1 = 0 the read atip2 - 0 = ip2is safe and the rep branch is correctly suppressed by therep_offset1 > 0guard inside theif).zstd/src/encoding/blocks/compressed.rsNew
append_literalshelper routes literal-section appends throughsimd_copy::copy_bytes_overshootingfor slices ≤ 32 bytes. That function has a byte-by-byte / overlapping-u64 exact-length tail path that LLVM inlines for short runtime sizes — no libc memmove call. Longer runs (> 32 bytes) fall through toextend_from_slicewhere libc memmove amortises across the copy.Investigation context: flamegraph on
compress/level_1_fast/decodecorpus.*pure_rustshowed__memmove_avx_unaligned_ermsat 60 % of bench CPU. After bench, this inline change accounts for ~0.5 % of measured time on the corpus — the memmove dominance turned out to be per-iteration allocation churn (hash table init memset 24 %, output Vec growth memcpy / realloc) rather than the literal-append path. Keeping the change anyway: it removes a real libc call per emitted sequence regardless of total impact, and the donor-parity shape is cleaner.zstd/src/decoding/mod.rssimd_copymodule visibility bumped frommodtopub(crate) modso the encode side can reach the helper.Test plan
compress/level_1_fast/decodecorpus-z000033 pure_rustthroughput 153.87 → 154.50 MiB/s (in noise). FFI side 351 MiB/s.