perf(decode): 8-slot software prefetch pipeline for sequence execution (#208)#227
Conversation
Port the donor ZSTD_decompressSequencesLong_body shape to the fused decode+execute path: 4-slot software pipeline that decodes 4 sequences ahead of execute, prefetching each match source line at decode time. By the time buffer.repeat() reaches the actual load for seq[i], the prefetch issued 4 iterations earlier has had time to pull the line into L1/L2 — hiding DRAM latency on long-distance matches whose source is beyond cache residency. decode_buffer.rs: - New prefetch_lookahead_match_source(start_idx): issues the prefetch hint at a logical buffer position, bound-checks against buffer.len() so out-of-range indices (intra-block self-overlap, wrapping_sub underflow on a malformed sequence, dictionary-sourced matches predating the current frame) silently drop the hint. Prefetches up to two 64 B cache lines per call (donor PREFETCH parity), bounded by PREFETCH_EXTENT = 128. sequence_section_decoder.rs: - 4-slot software pipeline: prefill 4 → steady (decode N, prefetch using a read-only offset estimate, execute N−4) → drain 4. - Ring stores raw [Sequence; 4]. do_offset_history stays in execute_one_sequence so a mid-loop error preserves the legacy 'partial output, no history rewind' semantics — the in-flight ring slots haven't mutated offset_hist yet. - For prefetch addressing only, a separate estimate_actual_offset resolves the seq's actual_offset against the current offset_hist WITHOUT mutating it. The estimate is exact for seq.of >= 4 (every fresh non-repcode offset — the dominant case on long-distance workloads); for repcode 1..=3 it mirrors do_offset_history_repcode's selection table read-only, lagging by at most ADVANCE sequences on close-range matches whose source is already L1-warm. - Opt-in by num_sequences >= 2 * ADVANCE. Short blocks fall back to the fused single-pass loop (existing in-loop prefetch retained). - Compile-time guard on ADVANCE.is_power_of_two() so the i & ADVANCE_MASK ring indexing stays equivalent to i % ADVANCE. Donor uses STORED_SEQS = 8; this PR starts at 4 to keep ring on the stack within register pressure. Future iterations can dial up after profiling confirms headroom. Closes #208.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a 4-stage lookahead predecode/prefetch pipeline for long-distance match reads: new ChangesMatch-source prefetch pipeline for long-distance matches
Sequence DiagramsequenceDiagram
participant Decoder as decode_and_execute_sequences
participant Ring as SequenceRing[8]
participant Shadow as shadow_hist
participant Prefetch as prefetch_lookahead_match_source
participant Executor as execute_one_sequence_pipelined / repeat
rect rgba(100, 150, 200, 0.5)
Note over Decoder,Ring: Prefill phase (i = 0 .. ADVANCE-1)
loop i = 0 to ADVANCE-1
Decoder->>Ring: decode_sequence[i] -> ring_slot[i]
Decoder->>Shadow: resolve offset for sequence[i] -> resolved_offset[i]
alt resolved_offset > threshold
Decoder->>Prefetch: prefetch match source for resolved_offset[i]
end
end
end
rect rgba(100, 150, 200, 0.5)
Note over Decoder,Executor: Steady-state pipeline
loop j = ADVANCE to num_sequences - 1
Decoder->>Ring: decode_sequence[j] -> ring_slot[j % ADVANCE]
Decoder->>Shadow: resolve offset for sequence[j]
alt resolved_offset > threshold
Decoder->>Prefetch: prefetch match source for resolved_offset[j]
end
Decoder->>Executor: execute sequence[j - ADVANCE] using resolved_offset (no history mutation per-iteration)
end
end
rect rgba(100, 150, 200, 0.5)
Note over Decoder,Executor: Drain remaining ring slots
loop drain remaining slots
Decoder->>Executor: execute remaining sequence[k]
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Two doc consistency findings from the pre-squash review: - execute_one_sequence comment said 'the pipeline above intentionally defers...' but the pipelined decode loop is BELOW this helper (execute_one_sequence is a nested fn defined at the top of the outer function; the loop using it comes after). Reword as 'pipelined decode loop below' so the directional reference matches the file layout. - prefetch_lookahead_match_source's body comment claimed 'donor PREFETCH_L1 parity' but the implementation calls prefetch_slice_t1, which uses _MM_HINT_T1 (L2 destination) on x86. Document the deviation explicitly: ADVANCE iterations of decode sit between prefetch issue and the matching repeat() load, so an L1-bound line risks eviction before consumption — L2 has the capacity to keep the line resident across that window. The pre-existing in-loop prefetch_match_source makes the same T1 choice for the same reason; surfacing it here prevents the same review comment next iteration.
There was a problem hiding this comment.
Pull request overview
This PR ports the donor zstd “long sequences” shape into the fused decode+execute path by introducing a 4-slot software pipeline that decodes sequences ahead of execution and issues prefetch hints early to better hide DRAM latency on long-distance match copies.
Changes:
- Adds a 4-slot decode-ahead / execute-behind ring pipeline in
decode_and_execute_sequences, with prefill/steady-state/drain phases. - Introduces a new
DecodeBuffer::prefetch_lookahead_match_source(start_idx)helper that bounds-checks a logical buffer index and prefetches up to 128 bytes (two cache lines). - Keeps offset-history mutation in the execution step and adds a read-only repcode offset estimator used only for prefetch addressing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| zstd/src/decoding/sequence_section_decoder.rs | Implements the 4-slot software pipeline and offset-estimation logic for lookahead prefetching. |
| zstd/src/decoding/decode_buffer.rs | Adds a bounded lookahead prefetch helper for match-source addresses computed ahead of execution. |
…uard Two findings from the post-squash review: - Issue #208 explicitly calls for the prefetch pipeline to be triggered only when the match offset exceeds an L1/L2 residency proxy (~32 KiB), so short-offset matches whose source is already warm don't burn issue-port pressure. The pipeline was prefetching every sequence unconditionally — add a PREFETCH_OFFSET_THRESHOLD (32 KiB) gate to both the prefill and steady-state loops, using the existing read-only est_offset (exact for seq.of >= 4 which covers every fresh long-distance match the gate targets). - The prefill loop's "if k + 1 < num_sequences" FSE-state-update guard was dead under the outer "num_sequences >= ADVANCE * 2" precondition (k < ADVANCE implies k + 1 <= ADVANCE < num_sequences), so the update fires every iteration regardless. Drop the guard and update the comment so the redundancy doesn't read as a pending edge case to a future maintainer; the steady-state loop keeps its conditional because i + 1 == num_sequences is reachable there.
…-stable prefetch_pos / match_start derived `seq.ll`, `seq.ml` straight from the bitstream. On a malformed/corrupt frame those values can be huge enough to overflow usize when added, panicking debug builds (fuzz, CI). The result feeds only the prefetch hint — and prefetch_lookahead_match_source bound-checks the logical position against buffer.len() so a wrap-derived garbage index is silently dropped. Switch the two cursor increments (prefill loop + steady-state loop, four sites in total) to wrapping_add so a malformed input becomes a bogus-but-dropped prefetch instead of a panic.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 230-342: The pipelined branch is entered solely on num_sequences
>= ADVANCE * 2 even when no sequence actually needs a prefetch (all est_offset <
PREFETCH_OFFSET_THRESHOLD), causing unnecessary ring/prefill/drain overhead;
change the logic around the ADVANCE pipeline (the prefill loop that calls
decode_one_sequence_inline and estimate_actual_offset, the steady-state loop
using ring and execute_one_sequence, and the final drain) to detect whether any
estimated offset crosses PREFETCH_OFFSET_THRESHOLD (track a bool like
saw_prefetch_needed while pre-filling/decoding the first ADVANCE sequences), and
only take the pipelined path (using ring, prefetch_pos,
buffer.prefetch_lookahead_match_source, etc.) when that flag is true—otherwise
fall back to the single-pass loop that decodes and immediately calls
execute_one_sequence for each sequence without prefill/ring overhead. Ensure the
same FSE state updates (ll_dec/ml_dec/of_dec.update_state_fast and
br.ensure_bits) remain consistent when choosing either path.
- Around line 205-227: The decode-ahead path uses the live offset_hist in
estimate_actual_offset (and the adjacent logic around the
PREFETCH_OFFSET_THRESHOLD check) but never advances the decode-side shadow for
ADVANCE-queued sequences, so repcodes after a large new offset see stale history
and can suppress long-distance prefetches; fix by introducing and advancing a
decode-side shadow history copy (the same structure used by the live resolver)
during decode-ahead/ADVANCE handling and use that shadow both inside
estimate_actual_offset and for the PREFETCH_OFFSET_THRESHOLD decision (and the
equivalent checks at the other nearby blocks referenced) so the prefetch
estimate and threshold check reflect queued updates consistently with
do_offset_history_repcode behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3ee2d816-89fb-4437-ad1e-af51584b3dfa
📒 Files selected for processing (2)
zstd/src/decoding/decode_buffer.rszstd/src/decoding/sequence_section_decoder.rs
… path Two CR Major findings on the prefetch pipeline: 1. Stale repcode history could suppress the long-distance prefetch. The previous estimate read the live offset_hist but never advanced it for the ADVANCE-queued sequences, so a fresh huge offset immediately followed by a repcode (1..=3) would see pre-update history and resolve to the OLD value — exactly the scenario the threshold gate then refused to prefetch, killing the gain on the workloads this pipeline targets. Replace estimate_actual_offset() with a local shadow_hist:[u32;3] advanced by do_offset_history() per decode-ahead. The real offset_hist still only mutates inside execute_one_sequence (the single committing site, preserving rollback semantics), but the shadow gives the prefetch path the exact post-resolution value for every repcode case. estimate_actual_offset is now removed. 2. Pipelined path ran on large all-short-offset blocks too. num_sequences >= ADVANCE * 2 was the only gate, so a long block of close-range matches paid ring/prefill/drain overhead while issuing zero prefetches. Track saw_prefetch_needed during prefill — flips true on the first actual_offset >= PREFETCH_OFFSET_THRESHOLD. If it stays false through all ADVANCE prefills (the block is dominated by warm short-offset matches), drain the queued ring sequences in order and continue with the single-pass loop for the rest, skipping the ring overhead entirely. If even one prefill sequence crosses the threshold, stay on the pipelined steady-state path.
…ail-slice guard Three Copilot consistency findings on the previous shadow_hist patch: - The PREFETCH_OFFSET_THRESHOLD comment still described the gate as reading the `est_offset` variable that no longer exists — the current code computes `actual_offset` via do_offset_history on the shadow hist. Reword to match. - The shadow_hist allocation note called it a "usize-triple" but it is actually `[u32; 3]` (12 bytes, not 24). Fix the size claim so cost reasoning is accurate. - prefetch_lookahead_match_source's else branch had an explicit `if idx < s2.len()` guard around the s2 prefetch path. With the top-of-function `start_idx >= self.buffer.len()` early-return and `buffer.len() == s1.len() + s2.len()`, the else branch (entered only when start_idx >= s1.len()) guarantees idx < s2.len() by construction — the explicit guard was a dead branch in a hot helper. Drop it, add a comment explaining the invariant.
…donor parity) Bench against main showed the pipeline as it stood (gated only by num_sequences >= 8 + per-seq actual_offset >= 32 KiB) regressed high-entropy / random-data decode by 2-5% while the target long-window workload (decodecorpus) was essentially zero. Reading donor `ZSTD_decompressBlock_internal` revealed we'd missed the front-of-block selector entirely: donor decides Long-vs-Short pipeline ONCE per block from the offsets FSE table shape, not per-sequence. Implement the donor `ZSTD_getOffsetInfo` equivalent: - Walk fse.offsets.decode up-front; count entries whose offset code is > 22 (raw offset >= 2^23 = 8 MiB). - Scale the count to donor's OffFSELog = 8 (256-entry reference) via `raw << (OFFSET_FSE_LOG - table_log)` so a finely-tuned table still registers meaningful share. - Engage the Long pipeline only when `num_sequences >= ADVANCE * 2` AND `scaled_share >= 7` (~2.73 % of the table — donor's 64-bit `minShare`). Otherwise drop to the single-pass loop. Side cleanups the new gate makes redundant: - The mid-block `saw_prefetch_needed` opt-out is gone — the FSE-table decision is taken before any sequence decode and applies to the whole block, addressing the CR Major (the previous opt-out couldn't engage the pipeline for blocks whose long-distance offsets only appeared past the first ADVANCE prefills). - The per-sequence `PREFETCH_OFFSET_THRESHOLD` constant is gone — donor `ZSTD_decompressSequencesLong_body` prefetches every queued slot unconditionally inside the Long pipeline. Per-seq gating risked skipping the prefetch for the slot whose source was the actual cache miss; gone for donor parity + simpler control flow.
…he ring wrap Previously `prefetch_lookahead_match_source` issued the prefetch hint from exactly one of the wrap-aware ring buffer's two slices. When the match source landed near the end of `s1` and the tail in `s1` was shorter than `PREFETCH_EXTENT` (128 B), the helper prefetched only what fit in s1 and the rest of the line silently went unfetched — the donor "up to two cache lines per match" intent collapsed to one or even zero useful prefetch hints at exactly the wrap boundary. Top up from `s2[0..]` when the s1 tail is shorter than the target extent, so the wrap case ends up with the same total prefetch coverage as a non-wrap case.
…on table rebuild The per-block walk of the offsets FSE decode table (counting entries whose symbol > 22) was paid even on Repeat-mode blocks where the table didn't change between blocks. On a corpus where REPEAT dominates after the first non-trivial block (typical for compressed streams that lock onto an FSE distribution and stick with it), this was 32–256 entry comparisons per block for zero new information. Move the share computation to a `compute_offsets_long_share` helper called from `maybe_update_fse_tables` ONLY when the FSE / Predefined branches actually rebuild the offsets table. Cache the result in a new `FSEScratch::offsets_long_share` field; Repeat-mode blocks read the cached value as-is — donor parity (donor's ddict-cold path also caches the share signal) plus removal of the per-block O(table_size) overhead from the pipeline-gate decision. The pipeline gate becomes `num_sequences >= ADVANCE*2 && fse.offsets_long_share >= MIN_LONG_OFFSET_SHARE` — one branch + one load, no table walk.
…ch_source coverage Three doc / scope updates plus one new test module section in response to CR notes on the prefetch pipeline: - Reword the 'offset_hist mutates here and only here' comment to acknowledge that the post-loop rollback path also assigns to it (recovery, not the normal mutation site that the comment was about). - Reword the 'AT MOST two prefetch issues per match' comment to acknowledge the wrap-boundary case where the 128 B budget is split across s1_tail + s2[0..], emitting up to four cache-line prefetches total (still bounded, still L1, still under the helper's MAX_LINES = 4 ceiling). - Add three unit tests for prefetch_lookahead_match_source covering in-range probes, out-of-range / empty-buffer probes (early-return bound check), and the wrap-boundary layout where as_slices() returns two non-empty halves (exercises the prefetch_first_line_l1 short-tail fallback path).
…ring (A+B+C) Three donor-vs-ours overhead trims in the pipelined decode hot path, each gated by the existing FSE-long-share pipeline-engage decision so the short-block fused path keeps its existing repeat() semantics unchanged. A. New `DecodeBuffer::repeat_lookahead_prefetched` variant. The regular `repeat` issues an in-loop `prefetch_match_source` (T1 hint) at the match source — sensible for the no-pipeline path that has no other prefetch source. When the long pipeline is active, the lookahead PREFETCH_L1 was already issued ADVANCE iterations earlier; the second in-loop prefetch is pure issue-port pressure on top of a now-warm L1 line. The pipelined variant drops that second prefetch. B. Pre-resolved offset stored in the ring slot. Ring goes from `[Sequence; 8]` to `[(Sequence, u32); 8]`. The decode-ahead phase already walks `shadow_hist` via `do_offset_history` to resolve the prefetch's source offset exactly; we now store that resolved offset alongside the raw sequence. The execute phase consumes the pre-resolved value via a new `execute_one_sequence_pipelined` helper and skips `do_offset_history` entirely — saves one function call and one cache write on the real `offset_hist` per sequence (real hist was the only mutable shared state in the execute path; pipelined now leaves it strictly local until the single end-of-block sync). The committing point becomes `*offset_hist = shadow_hist` ONCE after a successful drain. Rollback on post-loop bitstream failure still restores from `saved_offset_hist` — the new shape just makes that restore a no-op on the pipelined path (real hist was never mutated mid-loop) while preserving correctness on the non-pipelined fallback. C. `repeat_lookahead_prefetched` also drops the per-call `self.buffer.reserve(match_length)`. The pipelined caller already reserved `MAX_BLOCK_SIZE` upfront before entering the pipeline, so each per-sequence reserve was a redundant capacity-check branch. The variant relies on that upfront reserve exclusively — saves one branch per executed sequence.
…of offsets_long_share Three CR findings on the A+B+C pipelined-repeat patch: - repeat_lookahead_prefetched skipped buffer.reserve(match_length). On malformed input where match_length expands past the upfront reserve(MAX_BLOCK_SIZE), extend_from_within_unchecked* assumes the destination capacity and only debug_assert checks it — release builds would have written out of bounds. The reserve is amortised by the caller's upfront reservation anyway, so this is just a cheap capacity-check branch on the hot path, not an allocation. Restore it; only the in-loop prefetch_match_source stays skipped (that's the actual win). - Rename the SKIP_PREFETCH const param to match what it actually controls now that reserve is unconditional. - offsets_long_share doc said the cached value counts codes '≥ LONG_OFFSET_CODE_THRESHOLD' but the helper uses strictly '>' (donor parity — '> 22' means codes 23..MAX, i.e. raw offsets ≥ 2^23). Fix the doc string. - DecoderScratch::reset() reset every FSE-related field except offsets_long_share, leaving a stale gate signal across scratch reuse. A new frame opening with Repeat-mode tables would inherit the previous frame's pipeline-engage decision. Reset alongside ll_rle / ml_rle / of_rle.
…hape + offsets_long_share tests Three doc + test additions in response to CR notes on the A+B+C pipelined-repeat patch: - The execute_one_sequence_pipelined doc claimed it skipped both prefetch and reserve, but repeat_lookahead_prefetched keeps the reserve (memory-safety against malformed match_length). Rewrite the doc so only the in-loop prefetch is described as skipped. - The on-stack ring comment still said '[Sequence; 8] = 96 bytes' after the A+B+C refactor stored '(Sequence, u32)' pairs (128 B on a 64-bit target). Update the size + element type description. - Add four unit tests for compute_offsets_long_share covering: pure short-offset tables resolve to zero share; symbols at the threshold (=22) don't count (strict-greater predicate, donor parity); raw count scales by (OFFSET_FSE_LOG - table_log) so a single long-code entry in a 32-entry table lands at share 8; raw count passes through unscaled when table is already at OFFSET_FSE_LOG = 8. Uses a synthetic_offsets_table helper that bypasses build_from_probabilities — the share helper only reads decode[*].symbol + accuracy_log, so we can construct the table by hand without a full bitstream.
Dictionary-init flow (`Dictionary::decode_dict` → `DecoderScratch::init_from_dict` → `FSEScratch::reinit_from`) builds the offsets FSE table from the dictionary's entropy section without ever calling the sequence-decoder path that updates the cache, so `other.offsets_long_share` was the default 0. Verbatim-copying that into `self.offsets_long_share` left Repeat-mode blocks (which reuse the dictionary's offsets table) gated out of the long-pipeline path regardless of the table's actual offset distribution. Recompute the share inside `reinit_from` from the just-copied offsets table via the existing `compute_offsets_long_share` helper (bumped from private to `pub(crate)`). The walk is bounded by `OF_MAX_LOG = 8` (≤ 256 entries) and runs once per scratch reinit — negligible vs the rest of the per-frame setup.
The pipelined branch deferred the real-`offset_hist` commit to a single `*offset_hist = shadow_hist` at the end of a successful drain. On an `execute_one_sequence_pipelined` Err propagated via `?` (NotEnoughBytesForSequence, ZeroOffset, OOB match copy) that commit never ran — the function exited with `*offset_hist` still at its pre-block value while the buffer had N-1 partial writes. That diverged from the non-pipelined path: `execute_one_sequence` mutates real hist in lockstep per executed sequence, so on mid-loop Err the caller sees N-1 hist updates + N-1 buffer writes (in lockstep, internally consistent). The pipelined path left the two out of sync, breaking scratch reuse after any Err. Wrap the entire pipelined work in an IIFE that returns Result, so a single `if let Err(e) = pipeline_result` site at the end catches mid-loop Errs uniformly and routes them through the same buffer-checkpoint-rollback path the post-loop bitstream validation already used. `*offset_hist = saved_offset_hist` after the rollback is a no-op on the pipelined path (real hist was never touched mid-loop) but stays explicit so future refactors that move the commit earlier remain safe.
| if use_long_pipeline { | ||
| // The pipelined branch must roll `offset_hist` back to | ||
| // `saved_offset_hist` on ANY mid-loop error, not just the | ||
| // post-loop bitstream-validation path. Without this, an | ||
| // `execute_one_sequence_pipelined` Err (NotEnoughBytesForSequence | ||
| // / ZeroOffset / OOB match) propagated via `?` would exit with | ||
| // `*offset_hist` still at its pre-block value while the buffer | ||
| // had N-1 partial writes — diverging from the non-pipelined | ||
| // path (which mutates hist in lockstep per executed sequence) | ||
| // and leaving scratch internally inconsistent for any | ||
| // post-Err reuse. Wrap the entire pipelined work in an IIFE so | ||
| // a single rollback site catches all mid-loop Errs uniformly. | ||
| let pipeline_result: Result<(), DecompressBlockError> = (|| { | ||
| // `prefetch_pos` is the logical buffer index (same frame as | ||
| // `buffer.len()`) at which the NEXT not-yet-decoded sequence | ||
| // will start pushing literals. We pre-decode `ADVANCE` ahead, so we | ||
| // accumulate (ll + ml) per decoded seq to keep this position | ||
| // synchronised with where execute will eventually be. | ||
| let mut prefetch_pos: usize = old_buffer_size; | ||
| // Shadow copy of `offset_hist`, advanced by | ||
| // `do_offset_history` for every decoded-ahead sequence. The | ||
| // REAL `offset_hist` is only mutated inside | ||
| // `execute_one_sequence` (preserving the legacy 'partial | ||
| // output, no rewound history' rollback contract), but the | ||
| // prefetch needs the exact post-resolution offset for repcode | ||
| // 1..=3 cases that read history — a stale read would skip | ||
| // the long-distance prefetch precisely when a fresh huge | ||
| // offset is followed by a repcode that aliases it. The | ||
| // shadow is a local `[u32; 3]` (12 bytes) so the simulation cost | ||
| // is negligible. | ||
| let mut shadow_hist: [u32; 3] = *offset_hist; | ||
| // Stack ring of `(decoded_seq, resolved_offset)` pairs. The | ||
| // decode-ahead phase resolves repcodes against `shadow_hist` | ||
| // and stores the resolved offset alongside the raw sequence, | ||
| // so the execute phase consumes a pre-resolved offset and | ||
| // skips `do_offset_history` entirely — saves one function | ||
| // call + one cache write on real `offset_hist` per sequence. | ||
| // The real `offset_hist` is updated ONCE from `shadow_hist` | ||
| // after a successful drain (below); on a malformed-block | ||
| // rollback the saved snapshot is restored, so real hist is | ||
| // never observed in a partial mid-pipeline state. | ||
| let mut ring: [(Sequence, u32); ADVANCE] = [( | ||
| Sequence { | ||
| ll: 0, | ||
| ml: 0, | ||
| of: 0, | ||
| }, | ||
| 0u32, | ||
| ); ADVANCE]; | ||
|
|
||
| // Pre-fill the ring. The outer `num_sequences >= ADVANCE * 2` | ||
| // gate guarantees `num_sequences > ADVANCE`, so the FSE | ||
| // state update is needed after every prefill decode — no | ||
| // `isLastSeq` guard required here, only in the steady-state | ||
| // loop where `i + 1 == num_sequences` is reachable. | ||
| for slot in ring.iter_mut() { | ||
| let seq = | ||
| decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br); | ||
| // EXACT actual_offset via shadow history. | ||
| let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist); | ||
| // wrapping_add: prefetch_pos / seq.ll / seq.ml are | ||
| // derived from the bitstream, so a malformed frame can | ||
| // present values that would overflow usize and panic | ||
| // under debug. The result feeds only the prefetch | ||
| // hint — `prefetch_lookahead_match_source` bound-checks | ||
| // the logical position against `buffer.len()` and drops | ||
| // wrap-derived garbage indices, so the wrap is harmless | ||
| // here while keeping the decoder fuzz-stable. | ||
| let match_start = prefetch_pos.wrapping_add(seq.ll as usize); | ||
| let source_idx = match_start.wrapping_sub(actual_offset as usize); | ||
| buffer.prefetch_lookahead_match_source(source_idx); | ||
| prefetch_pos = match_start.wrapping_add(seq.ml as usize); | ||
| *slot = (seq, actual_offset); | ||
| br.ensure_bits(max_update_bits); | ||
| ll_dec.update_state_fast(&mut br); | ||
| ml_dec.update_state_fast(&mut br); | ||
| of_dec.update_state_fast(&mut br); | ||
| } | ||
|
|
||
| // Steady state: decode next, prefetch its source, execute | ||
| // the oldest slot in the ring (with its pre-resolved offset). | ||
| for i in ADVANCE..num_sequences { | ||
| let seq = | ||
| decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br); | ||
| let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist); | ||
| let match_start = prefetch_pos.wrapping_add(seq.ll as usize); | ||
| let source_idx = match_start.wrapping_sub(actual_offset as usize); | ||
| buffer.prefetch_lookahead_match_source(source_idx); | ||
| prefetch_pos = match_start.wrapping_add(seq.ml as usize); | ||
|
|
||
| let slot = i & ADVANCE_MASK; | ||
| let (exec_seq, exec_offset) = ring[slot]; | ||
| ring[slot] = (seq, actual_offset); | ||
|
|
||
| execute_one_sequence_pipelined( | ||
| buffer, | ||
| literals_buffer, | ||
| &mut lit_cur, | ||
| literals_buffer_len, | ||
| exec_seq, | ||
| exec_offset, | ||
| )?; | ||
| seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml); | ||
|
|
||
| if i + 1 < num_sequences { | ||
| br.ensure_bits(max_update_bits); | ||
| ll_dec.update_state_fast(&mut br); | ||
| ml_dec.update_state_fast(&mut br); | ||
| of_dec.update_state_fast(&mut br); | ||
| } | ||
| } | ||
|
|
||
| // Drain: execute remaining ADVANCE sequences with their | ||
| // pre-resolved offsets. Iteration order matches the ring | ||
| // slot they occupy from the steady-state loop's final write. | ||
| for k in 0..ADVANCE { | ||
| let slot = (num_sequences + k) & ADVANCE_MASK; | ||
| let (exec_seq, exec_offset) = ring[slot]; | ||
| execute_one_sequence_pipelined( | ||
| buffer, | ||
| literals_buffer, | ||
| &mut lit_cur, | ||
| literals_buffer_len, | ||
| exec_seq, | ||
| exec_offset, | ||
| )?; | ||
| seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml); | ||
| } | ||
|
|
||
| // Single committing point for real offset history on the | ||
| // pipelined success path. Shadow walked every queued | ||
| // sequence already; copy that state back so the next | ||
| // block sees the post-block repcodes. Rollback on a later | ||
| // bitstream-failure overwrites this with | ||
| // `saved_offset_hist`, undoing the commit. | ||
| *offset_hist = shadow_hist; | ||
| Ok(()) | ||
| })(); | ||
| if let Err(e) = pipeline_result { | ||
| // Mid-loop execute Err: rollback buffer + hist so post-Err | ||
| // scratch reuse stays consistent. `*offset_hist` is still | ||
| // at its pre-block value (the success-only commit above | ||
| // never ran), so restoring from `saved_offset_hist` is | ||
| // effectively a no-op on the hist side — the explicit | ||
| // assignment makes the intent unambiguous and protects | ||
| // against any future refactor that moves the commit | ||
| // earlier in the pipelined flow. | ||
| if buffer.try_restore_checkpoint(buffer_checkpoint) { | ||
| *offset_hist = saved_offset_hist; | ||
| } | ||
| return Err(e); |
There was a problem hiding this comment.
The rollback-on-Err shape here is intentional donor parity, not a divergence.
Donor ZSTD_decompressSequencesLong_body (lib/decompress/zstd_decompress_block.c:1832-1990) commits repcodes ONCE at end of block, AFTER RETURN_ERROR_IF(!BIT_endOfDStream(...)):
/* save reps for next block */
{ U32 i; for (i=0; i<ZSTD_REP_NUM; i++) dctx->entropy.rep[i] = (U32)(seqState.prevOffset[i]); }If any mid-loop ZSTD_execSequence returns an error (oneSeqSize is error → if (ZSTD_isError(oneSeqSize)) return oneSeqSize;), or the post-loop bitstream-end check fails, that save never runs and dctx->entropy.rep[i] stays at its pre-block value. The same applies to ZSTD_decompressSequences_body (zstd_decompress_block.c:1717-1797) — same end-of-block save pattern, same skip on early return.
Our non-pipelined execute_one_sequence currently mutates real offset_hist per sequence (an older pattern from the pre-fused-executor era). That's the path that diverges from donor — not the pipelined path I just landed. Bringing the non-pipelined path to donor parity is in scope of a separate cleanup, not this PR.
On the try_restore_checkpoint cap-changed edge case: the conditional if buffer.try_restore_checkpoint(...) { *offset_hist = saved_offset_hist; } is the same shape used by the existing post-loop bitstream-validation rollback (lines 426-428, pre-existing). Both branches share the "only restore hist when buffer rollback actually succeeded" rule so the hist+buffer pair stays in lockstep — restoring hist while leaving buffer partial would be the actual inconsistency. The cap-changed case implies the frame is already malformed past MAX_BLOCK_SIZE; the partial data left in the buffer is discarded by the propagated Err anyway.
Summary
Port the donor
ZSTD_decompressSequencesLong_bodyshape to the fused decode+execute path: 8-slot software pipeline (donorSTORED_SEQS = 8) that decodes ADVANCE sequences ahead of execute, prefetching each match source line at decode time. By the timebuffer.repeat()reaches the actual load for seq[i], the prefetch issued ADVANCE iterations earlier has had time to pull the line into L1 — hiding DRAM latency on long-distance matches whose source is beyond cache residency.Closes #208.
Changes
zstd/src/decoding/decode_buffer.rsprefetch_lookahead_match_source(start_idx)— issues the prefetch hint at a logical buffer position, bound-checks againstbuffer.len()so out-of-range indices (intra-block self-overlap,wrapping_subunderflow on a malformed sequence, dictionary-sourced matches predating the current frame) silently drop the hint.as_slices()boundary, prefetches the s1 tail and tops up from s2[0..] so the donor's "up to two cache lines per match" intent doesn't collapse to one (or zero) at the wrap. Short-tail (< 64 B) cases fall back toprefetch_first_line_l1so the cache line containingstart_idxis always hinted.prefetch_slice(_MM_HINT_T0/pldl1keep→ L1) matching donorPREFETCH_L1.zstd/src/decoding/sequence_section_decoder.rsshadow_histmirror ofoffset_hist, execute N−ADVANCE) → drain ADVANCE.[Sequence; 8].do_offset_historyruns both on the localshadow_histduring decode-ahead (for exact, not estimated, repcode resolution in the prefetch path) AND on the realoffset_histinexecute_one_sequence. The realoffset_histis rolled back fromsaved_offset_histif the post-loop bitstream check rejects the block, preserving the legacy "partial output, no rewound history" rollback contract on errors.ZSTD_getOffsetInfoparity gate: pipeline only engages whennum_sequences >= 2 * ADVANCEAND the FSE offsets table shape shows enough long-offset codes — donor'sminShare(7/256 on 64-bit, 20/256 on 32-bit). The share is computed once when the offsets table is rebuilt (FSE / Predefined modes inmaybe_update_fse_tables) and cached inFSEScratch::offsets_long_share; Repeat-mode blocks reuse the cached value without re-walking the table.ADVANCE.is_power_of_two()so thei & ADVANCE_MASKring indexing stays equivalent toi % ADVANCEifADVANCEis ever changed.zstd/src/decoding/prefetch.rsprefetch_first_line_l1helper that issues exactly one L1 prefetch at the first byte of a slice regardless of slice length.prefetch_sliceno-ops on slices < CACHE_LINE; that's right for bulk prefetch but wrong for the wrap-boundary case where the cache line containingstart_idxis still the line we need warmed. x86_64 / x86 / aarch64 implementations.Test plan
Summary by CodeRabbit