perf(decode): 8-slot software prefetch pipeline for sequence execution (#208) by polaz · Pull Request #227 · structured-world/structured-zstd

polaz · 2026-05-22T09:19:01Z

Summary

Port the donor ZSTD_decompressSequencesLong_body shape to the fused decode+execute path: 8-slot software pipeline (donor STORED_SEQS = 8) that decodes ADVANCE sequences ahead of execute, prefetching each match source line at decode time. By the time buffer.repeat() reaches the actual load for seq[i], the prefetch issued ADVANCE iterations earlier has had time to pull the line into L1 — hiding DRAM latency on long-distance matches whose source is beyond cache residency.

Closes #208.

Changes

`zstd/src/decoding/decode_buffer.rs`

New prefetch_lookahead_match_source(start_idx) — issues the prefetch hint at a logical buffer position, bound-checks against buffer.len() so out-of-range indices (intra-block self-overlap, wrapping_sub underflow on a malformed sequence, dictionary-sourced matches predating the current frame) silently drop the hint.
Wrap-boundary fallback: when the match source straddles the ring's as_slices() boundary, prefetches the s1 tail and tops up from s2[0..] so the donor's "up to two cache lines per match" intent doesn't collapse to one (or zero) at the wrap. Short-tail (< 64 B) cases fall back to prefetch_first_line_l1 so the cache line containing start_idx is always hinted.
Routed through prefetch_slice (_MM_HINT_T0 / pldl1keep → L1) matching donor PREFETCH_L1.

`zstd/src/decoding/sequence_section_decoder.rs`

8-slot software pipeline: prefill ADVANCE → steady (decode N, prefetch using a shadow_hist mirror of offset_hist, execute N−ADVANCE) → drain ADVANCE.
Ring stores raw [Sequence; 8]. do_offset_history runs both on the local shadow_hist during decode-ahead (for exact, not estimated, repcode resolution in the prefetch path) AND on the real offset_hist in execute_one_sequence. The real offset_hist is rolled back from saved_offset_hist if the post-loop bitstream check rejects the block, preserving the legacy "partial output, no rewound history" rollback contract on errors.
Donor ZSTD_getOffsetInfo parity gate: pipeline only engages when num_sequences >= 2 * ADVANCE AND the FSE offsets table shape shows enough long-offset codes — donor's minShare (7/256 on 64-bit, 20/256 on 32-bit). The share is computed once when the offsets table is rebuilt (FSE / Predefined modes in maybe_update_fse_tables) and cached in FSEScratch::offsets_long_share; Repeat-mode blocks reuse the cached value without re-walking the table.
Compile-time guard on ADVANCE.is_power_of_two() so the i & ADVANCE_MASK ring indexing stays equivalent to i % ADVANCE if ADVANCE is ever changed.

`zstd/src/decoding/prefetch.rs`

New prefetch_first_line_l1 helper that issues exactly one L1 prefetch at the first byte of a slice regardless of slice length. prefetch_slice no-ops on slices < CACHE_LINE; that's right for bulk prefetch but wrong for the wrap-boundary case where the cache line containing start_idx is still the line we need warmed. x86_64 / x86 / aarch64 implementations.

Test plan

full nextest suite — 576 passed, 3 skipped
lint+typecheck clean
Bench on Linux server (i9-9900K, BMI2/AVX2) on level_3 + level_19 corpora — mixed results, bench variance dominates over the small expected gains on this hardware
Flamegraph on large-window corpus to confirm DRAM stall reduction

Summary by CodeRabbit

Performance
- Pipelined sequence execution for large blocks to boost decompression throughput
- Lookahead prefetching and L1-first-line prefetching to reduce memory latency and improve cache efficiency
New Features
- Tunable pipeline gating based on offset-table characteristics for steadier hot-path execution
Tests
- Unit tests validating lookahead prefetch behavior and offset-share computation
Documentation
- Clarified offset-history and pipeline-gate semantics in decoder docs

Port the donor ZSTD_decompressSequencesLong_body shape to the fused decode+execute path: 4-slot software pipeline that decodes 4 sequences ahead of execute, prefetching each match source line at decode time. By the time buffer.repeat() reaches the actual load for seq[i], the prefetch issued 4 iterations earlier has had time to pull the line into L1/L2 — hiding DRAM latency on long-distance matches whose source is beyond cache residency. decode_buffer.rs: - New prefetch_lookahead_match_source(start_idx): issues the prefetch hint at a logical buffer position, bound-checks against buffer.len() so out-of-range indices (intra-block self-overlap, wrapping_sub underflow on a malformed sequence, dictionary-sourced matches predating the current frame) silently drop the hint. Prefetches up to two 64 B cache lines per call (donor PREFETCH parity), bounded by PREFETCH_EXTENT = 128. sequence_section_decoder.rs: - 4-slot software pipeline: prefill 4 → steady (decode N, prefetch using a read-only offset estimate, execute N−4) → drain 4. - Ring stores raw [Sequence; 4]. do_offset_history stays in execute_one_sequence so a mid-loop error preserves the legacy 'partial output, no history rewind' semantics — the in-flight ring slots haven't mutated offset_hist yet. - For prefetch addressing only, a separate estimate_actual_offset resolves the seq's actual_offset against the current offset_hist WITHOUT mutating it. The estimate is exact for seq.of >= 4 (every fresh non-repcode offset — the dominant case on long-distance workloads); for repcode 1..=3 it mirrors do_offset_history_repcode's selection table read-only, lagging by at most ADVANCE sequences on close-range matches whose source is already L1-warm. - Opt-in by num_sequences >= 2 * ADVANCE. Short blocks fall back to the fused single-pass loop (existing in-loop prefetch retained). - Compile-time guard on ADVANCE.is_power_of_two() so the i & ADVANCE_MASK ring indexing stays equivalent to i % ADVANCE. Donor uses STORED_SEQS = 8; this PR starts at 4 to keep ring on the stack within register pressure. Future iterations can dial up after profiling confirms headroom. Closes #208.

coderabbitai · 2026-05-22T09:19:08Z

Warning

Rate limit exceeded

@polaz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 56 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f6e34e23-f57d-43f9-a518-5f8fbd2c80b6

📥 Commits

Reviewing files that changed from the base of the PR and between 72e0754 and 64ea74a.

📒 Files selected for processing (1)

zstd/src/decoding/sequence_section_decoder.rs

📝 Walkthrough

Walkthrough

Adds a 4-stage lookahead predecode/prefetch pipeline for long-distance match reads: new DecodeBuffer::prefetch_lookahead_match_source issues L1/L2 prefetches, decode_and_execute_sequences conditionally uses an 8-slot decode-ahead ring with resolved offsets and lookahead prefetching, and supporting primitives and metadata caching are added.

Changes

Match-source prefetch pipeline for long-distance matches

Layer / File(s)	Summary
DecodeBuffer repeat refactor & lookahead prefetch `zstd/src/decoding/decode_buffer.rs`	Refactors `repeat` into `repeat_inner<const SKIP_PREFETCH>`, adds `repeat_lookahead_prefetched`, implements `prefetch_lookahead_match_source` with wrap-aware `as_slices()` prefetching (up to 128 bytes), and adds unit tests for in-range, out-of-range, empty, and wrap cases.
Prefetch primitive implementations `zstd/src/decoding/prefetch.rs`	Adds `prefetch_first_line_l1(slice)` wrapper and per-architecture `prefetch_first_line_l1_impl(ptr)` implementations for `x86_64`, `x86` (SSE), and `aarch64`, plus a no-op fallback.
FSE scratch: offsets_long_share cache `zstd/src/decoding/scratch.rs`	Adds `offsets_long_share: u32` to `FSEScratch`, initializes/reset to 0 in constructors and reset, and recomputes it during `reinit_from`.
Sequence decoder pipeline and metadata `zstd/src/decoding/sequence_section_decoder.rs`	Documents `offset_hist` mutation semantics, adds `execute_one_sequence_pipelined`, implements conditional 8-slot ring decode-ahead pipeline with `shadow_hist` offset resolution and lookahead prefetch, adds `compute_offsets_long_share`, and updates `maybe_update_fse_tables` to populate the cached value.

Sequence Diagram

sequenceDiagram
  participant Decoder as decode_and_execute_sequences
  participant Ring as SequenceRing[8]
  participant Shadow as shadow_hist
  participant Prefetch as prefetch_lookahead_match_source
  participant Executor as execute_one_sequence_pipelined / repeat

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Ring: Prefill phase (i = 0 .. ADVANCE-1)
    loop i = 0 to ADVANCE-1
      Decoder->>Ring: decode_sequence[i] -> ring_slot[i]
      Decoder->>Shadow: resolve offset for sequence[i] -> resolved_offset[i]
      alt resolved_offset > threshold
        Decoder->>Prefetch: prefetch match source for resolved_offset[i]
      end
    end
  end

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Executor: Steady-state pipeline
    loop j = ADVANCE to num_sequences - 1
      Decoder->>Ring: decode_sequence[j] -> ring_slot[j % ADVANCE]
      Decoder->>Shadow: resolve offset for sequence[j]
      alt resolved_offset > threshold
        Decoder->>Prefetch: prefetch match source for resolved_offset[j]
      end
      Decoder->>Executor: execute sequence[j - ADVANCE] using resolved_offset (no history mutation per-iteration)
    end
  end

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Executor: Drain remaining ring slots
    loop drain remaining slots
      Decoder->>Executor: execute remaining sequence[k]
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

structured-world/structured-zstd#42: Also modifies decode_buffer.rs prefetching and integrates prefetch into the decode path.
structured-world/structured-zstd#90: Related changes to prefetch helpers and prefetch_slice_t1 used around the same hot paths.
structured-world/structured-zstd#194: Prior work altering the decode hot path and checkpoint/rollback behavior adjacent to this pipeline change.

🐰
I hop ahead to warm cold lanes,
I plant soft hints where cache remains,
Four little steps, a tiny ring,
Then bytes arrive and decoders sing,
I nibble latency, speed gains.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(decode): 8-slot software prefetch pipeline for sequence execution' directly and accurately describes the main change: introducing an 8-slot prefetch pipeline for performance improvement in sequence decoding.
Linked Issues check	✅ Passed	The PR successfully addresses all coding requirements from issue `#208`: implements the multi-stage software prefetch pipeline (8-slot vs. 4-stage proposed), adds lookahead prefetch helpers with bounds checking, gates the pipeline based on offset-table long-share metrics and sequence count thresholds, and maintains correctness with proper offset history handling across error paths.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the prefetch pipeline: decode_buffer refactoring for lookahead prefetch support, sequence_section_decoder pipelined execution logic, prefetch.rs helper functions, and scratch.rs cache field for pipeline gating. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#208-match-prefetch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-22T09:21:11Z

Codecov Report

❌ Patch coverage is 91.48936% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/sequence_section_decoder.rs	92.97%	13 Missing ⚠️
zstd/src/decoding/decode_buffer.rs	87.95%	10 Missing ⚠️
zstd/src/decoding/prefetch.rs	88.88%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Two doc consistency findings from the pre-squash review: - execute_one_sequence comment said 'the pipeline above intentionally defers...' but the pipelined decode loop is BELOW this helper (execute_one_sequence is a nested fn defined at the top of the outer function; the loop using it comes after). Reword as 'pipelined decode loop below' so the directional reference matches the file layout. - prefetch_lookahead_match_source's body comment claimed 'donor PREFETCH_L1 parity' but the implementation calls prefetch_slice_t1, which uses _MM_HINT_T1 (L2 destination) on x86. Document the deviation explicitly: ADVANCE iterations of decode sit between prefetch issue and the matching repeat() load, so an L1-bound line risks eviction before consumption — L2 has the capacity to keep the line resident across that window. The pre-existing in-loop prefetch_match_source makes the same T1 choice for the same reason; surfacing it here prevents the same review comment next iteration.

Copilot

Pull request overview

This PR ports the donor zstd “long sequences” shape into the fused decode+execute path by introducing a 4-slot software pipeline that decodes sequences ahead of execution and issues prefetch hints early to better hide DRAM latency on long-distance match copies.

Changes:

Adds a 4-slot decode-ahead / execute-behind ring pipeline in decode_and_execute_sequences, with prefill/steady-state/drain phases.
Introduces a new DecodeBuffer::prefetch_lookahead_match_source(start_idx) helper that bounds-checks a logical buffer index and prefetches up to 128 bytes (two cache lines).
Keeps offset-history mutation in the execution step and adds a read-only repcode offset estimator used only for prefetch addressing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
zstd/src/decoding/sequence_section_decoder.rs	Implements the 4-slot software pipeline and offset-estimation logic for lookahead prefetching.
zstd/src/decoding/decode_buffer.rs	Adds a bounded lookahead prefetch helper for match-source addresses computed ahead of execution.

…uard Two findings from the post-squash review: - Issue #208 explicitly calls for the prefetch pipeline to be triggered only when the match offset exceeds an L1/L2 residency proxy (~32 KiB), so short-offset matches whose source is already warm don't burn issue-port pressure. The pipeline was prefetching every sequence unconditionally — add a PREFETCH_OFFSET_THRESHOLD (32 KiB) gate to both the prefill and steady-state loops, using the existing read-only est_offset (exact for seq.of >= 4 which covers every fresh long-distance match the gate targets). - The prefill loop's "if k + 1 < num_sequences" FSE-state-update guard was dead under the outer "num_sequences >= ADVANCE * 2" precondition (k < ADVANCE implies k + 1 <= ADVANCE < num_sequences), so the update fires every iteration regardless. Drop the guard and update the comment so the redundancy doesn't read as a pending edge case to a future maintainer; the steady-state loop keeps its conditional because i + 1 == num_sequences is reachable there.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

…-stable prefetch_pos / match_start derived `seq.ll`, `seq.ml` straight from the bitstream. On a malformed/corrupt frame those values can be huge enough to overflow usize when added, panicking debug builds (fuzz, CI). The result feeds only the prefetch hint — and prefetch_lookahead_match_source bound-checks the logical position against buffer.len() so a wrap-derived garbage index is silently dropped. Switch the two cursor increments (prefill loop + steady-state loop, four sites in total) to wrapping_add so a malformed input becomes a bogus-but-dropped prefetch instead of a panic.

polaz · 2026-05-22T10:49:03Z

@coderabbitai review

coderabbitai · 2026-05-22T10:49:31Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 230-342: The pipelined branch is entered solely on num_sequences
>= ADVANCE * 2 even when no sequence actually needs a prefetch (all est_offset <
PREFETCH_OFFSET_THRESHOLD), causing unnecessary ring/prefill/drain overhead;
change the logic around the ADVANCE pipeline (the prefill loop that calls
decode_one_sequence_inline and estimate_actual_offset, the steady-state loop
using ring and execute_one_sequence, and the final drain) to detect whether any
estimated offset crosses PREFETCH_OFFSET_THRESHOLD (track a bool like
saw_prefetch_needed while pre-filling/decoding the first ADVANCE sequences), and
only take the pipelined path (using ring, prefetch_pos,
buffer.prefetch_lookahead_match_source, etc.) when that flag is true—otherwise
fall back to the single-pass loop that decodes and immediately calls
execute_one_sequence for each sequence without prefill/ring overhead. Ensure the
same FSE state updates (ll_dec/ml_dec/of_dec.update_state_fast and
br.ensure_bits) remain consistent when choosing either path.
- Around line 205-227: The decode-ahead path uses the live offset_hist in
estimate_actual_offset (and the adjacent logic around the
PREFETCH_OFFSET_THRESHOLD check) but never advances the decode-side shadow for
ADVANCE-queued sequences, so repcodes after a large new offset see stale history
and can suppress long-distance prefetches; fix by introducing and advancing a
decode-side shadow history copy (the same structure used by the live resolver)
during decode-ahead/ADVANCE handling and use that shadow both inside
estimate_actual_offset and for the PREFETCH_OFFSET_THRESHOLD decision (and the
equivalent checks at the other nearby blocks referenced) so the prefetch
estimate and threshold check reflect queued updates consistently with
do_offset_history_repcode behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ee2d816-89fb-4437-ad1e-af51584b3dfa

📥 Commits

Reviewing files that changed from the base of the PR and between c22b8b1 and a3f4a78.

📒 Files selected for processing (2)

zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/sequence_section_decoder.rs

… path Two CR Major findings on the prefetch pipeline: 1. Stale repcode history could suppress the long-distance prefetch. The previous estimate read the live offset_hist but never advanced it for the ADVANCE-queued sequences, so a fresh huge offset immediately followed by a repcode (1..=3) would see pre-update history and resolve to the OLD value — exactly the scenario the threshold gate then refused to prefetch, killing the gain on the workloads this pipeline targets. Replace estimate_actual_offset() with a local shadow_hist:[u32;3] advanced by do_offset_history() per decode-ahead. The real offset_hist still only mutates inside execute_one_sequence (the single committing site, preserving rollback semantics), but the shadow gives the prefetch path the exact post-resolution value for every repcode case. estimate_actual_offset is now removed. 2. Pipelined path ran on large all-short-offset blocks too. num_sequences >= ADVANCE * 2 was the only gate, so a long block of close-range matches paid ring/prefill/drain overhead while issuing zero prefetches. Track saw_prefetch_needed during prefill — flips true on the first actual_offset >= PREFETCH_OFFSET_THRESHOLD. If it stays false through all ADVANCE prefills (the block is dominated by warm short-offset matches), drain the queued ring sequences in order and continue with the single-pass loop for the rest, skipping the ring overhead entirely. If even one prefill sequence crosses the threshold, stay on the pipelined steady-state path.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

…ail-slice guard Three Copilot consistency findings on the previous shadow_hist patch: - The PREFETCH_OFFSET_THRESHOLD comment still described the gate as reading the `est_offset` variable that no longer exists — the current code computes `actual_offset` via do_offset_history on the shadow hist. Reword to match. - The shadow_hist allocation note called it a "usize-triple" but it is actually `[u32; 3]` (12 bytes, not 24). Fix the size claim so cost reasoning is accurate. - prefetch_lookahead_match_source's else branch had an explicit `if idx < s2.len()` guard around the s2 prefetch path. With the top-of-function `start_idx >= self.buffer.len()` early-return and `buffer.len() == s1.len() + s2.len()`, the else branch (entered only when start_idx >= s1.len()) guarantees idx < s2.len() by construction — the explicit guard was a dead branch in a hot helper. Drop it, add a comment explaining the invariant.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

…donor parity) Bench against main showed the pipeline as it stood (gated only by num_sequences >= 8 + per-seq actual_offset >= 32 KiB) regressed high-entropy / random-data decode by 2-5% while the target long-window workload (decodecorpus) was essentially zero. Reading donor `ZSTD_decompressBlock_internal` revealed we'd missed the front-of-block selector entirely: donor decides Long-vs-Short pipeline ONCE per block from the offsets FSE table shape, not per-sequence. Implement the donor `ZSTD_getOffsetInfo` equivalent: - Walk fse.offsets.decode up-front; count entries whose offset code is > 22 (raw offset >= 2^23 = 8 MiB). - Scale the count to donor's OffFSELog = 8 (256-entry reference) via `raw << (OFFSET_FSE_LOG - table_log)` so a finely-tuned table still registers meaningful share. - Engage the Long pipeline only when `num_sequences >= ADVANCE * 2` AND `scaled_share >= 7` (~2.73 % of the table — donor's 64-bit `minShare`). Otherwise drop to the single-pass loop. Side cleanups the new gate makes redundant: - The mid-block `saw_prefetch_needed` opt-out is gone — the FSE-table decision is taken before any sequence decode and applies to the whole block, addressing the CR Major (the previous opt-out couldn't engage the pipeline for blocks whose long-distance offsets only appeared past the first ADVANCE prefills). - The per-sequence `PREFETCH_OFFSET_THRESHOLD` constant is gone — donor `ZSTD_decompressSequencesLong_body` prefetches every queued slot unconditionally inside the Long pipeline. Per-seq gating risked skipping the prefetch for the slot whose source was the actual cache miss; gone for donor parity + simpler control flow.

…he ring wrap Previously `prefetch_lookahead_match_source` issued the prefetch hint from exactly one of the wrap-aware ring buffer's two slices. When the match source landed near the end of `s1` and the tail in `s1` was shorter than `PREFETCH_EXTENT` (128 B), the helper prefetched only what fit in s1 and the rest of the line silently went unfetched — the donor "up to two cache lines per match" intent collapsed to one or even zero useful prefetch hints at exactly the wrap boundary. Top up from `s2[0..]` when the s1 tail is shorter than the target extent, so the wrap case ends up with the same total prefetch coverage as a non-wrap case.

…on table rebuild The per-block walk of the offsets FSE decode table (counting entries whose symbol > 22) was paid even on Repeat-mode blocks where the table didn't change between blocks. On a corpus where REPEAT dominates after the first non-trivial block (typical for compressed streams that lock onto an FSE distribution and stick with it), this was 32–256 entry comparisons per block for zero new information. Move the share computation to a `compute_offsets_long_share` helper called from `maybe_update_fse_tables` ONLY when the FSE / Predefined branches actually rebuild the offsets table. Cache the result in a new `FSEScratch::offsets_long_share` field; Repeat-mode blocks read the cached value as-is — donor parity (donor's ddict-cold path also caches the share signal) plus removal of the per-block O(table_size) overhead from the pipeline-gate decision. The pipeline gate becomes `num_sequences >= ADVANCE*2 && fse.offsets_long_share >= MIN_LONG_OFFSET_SHARE` — one branch + one load, no table walk.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

…ch_source coverage Three doc / scope updates plus one new test module section in response to CR notes on the prefetch pipeline: - Reword the 'offset_hist mutates here and only here' comment to acknowledge that the post-loop rollback path also assigns to it (recovery, not the normal mutation site that the comment was about). - Reword the 'AT MOST two prefetch issues per match' comment to acknowledge the wrap-boundary case where the 128 B budget is split across s1_tail + s2[0..], emitting up to four cache-line prefetches total (still bounded, still L1, still under the helper's MAX_LINES = 4 ceiling). - Add three unit tests for prefetch_lookahead_match_source covering in-range probes, out-of-range / empty-buffer probes (early-return bound check), and the wrap-boundary layout where as_slices() returns two non-empty halves (exercises the prefetch_first_line_l1 short-tail fallback path).

…ring (A+B+C) Three donor-vs-ours overhead trims in the pipelined decode hot path, each gated by the existing FSE-long-share pipeline-engage decision so the short-block fused path keeps its existing repeat() semantics unchanged. A. New `DecodeBuffer::repeat_lookahead_prefetched` variant. The regular `repeat` issues an in-loop `prefetch_match_source` (T1 hint) at the match source — sensible for the no-pipeline path that has no other prefetch source. When the long pipeline is active, the lookahead PREFETCH_L1 was already issued ADVANCE iterations earlier; the second in-loop prefetch is pure issue-port pressure on top of a now-warm L1 line. The pipelined variant drops that second prefetch. B. Pre-resolved offset stored in the ring slot. Ring goes from `[Sequence; 8]` to `[(Sequence, u32); 8]`. The decode-ahead phase already walks `shadow_hist` via `do_offset_history` to resolve the prefetch's source offset exactly; we now store that resolved offset alongside the raw sequence. The execute phase consumes the pre-resolved value via a new `execute_one_sequence_pipelined` helper and skips `do_offset_history` entirely — saves one function call and one cache write on the real `offset_hist` per sequence (real hist was the only mutable shared state in the execute path; pipelined now leaves it strictly local until the single end-of-block sync). The committing point becomes `*offset_hist = shadow_hist` ONCE after a successful drain. Rollback on post-loop bitstream failure still restores from `saved_offset_hist` — the new shape just makes that restore a no-op on the pipelined path (real hist was never mutated mid-loop) while preserving correctness on the non-pipelined fallback. C. `repeat_lookahead_prefetched` also drops the per-call `self.buffer.reserve(match_length)`. The pipelined caller already reserved `MAX_BLOCK_SIZE` upfront before entering the pipeline, so each per-sequence reserve was a redundant capacity-check branch. The variant relies on that upfront reserve exclusively — saves one branch per executed sequence.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

…of offsets_long_share Three CR findings on the A+B+C pipelined-repeat patch: - repeat_lookahead_prefetched skipped buffer.reserve(match_length). On malformed input where match_length expands past the upfront reserve(MAX_BLOCK_SIZE), extend_from_within_unchecked* assumes the destination capacity and only debug_assert checks it — release builds would have written out of bounds. The reserve is amortised by the caller's upfront reservation anyway, so this is just a cheap capacity-check branch on the hot path, not an allocation. Restore it; only the in-loop prefetch_match_source stays skipped (that's the actual win). - Rename the SKIP_PREFETCH const param to match what it actually controls now that reserve is unconditional. - offsets_long_share doc said the cached value counts codes '≥ LONG_OFFSET_CODE_THRESHOLD' but the helper uses strictly '>' (donor parity — '> 22' means codes 23..MAX, i.e. raw offsets ≥ 2^23). Fix the doc string. - DecoderScratch::reset() reset every FSE-related field except offsets_long_share, leaving a stale gate signal across scratch reuse. A new frame opening with Repeat-mode tables would inherit the previous frame's pipeline-engage decision. Reset alongside ll_rle / ml_rle / of_rle.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

…hape + offsets_long_share tests Three doc + test additions in response to CR notes on the A+B+C pipelined-repeat patch: - The execute_one_sequence_pipelined doc claimed it skipped both prefetch and reserve, but repeat_lookahead_prefetched keeps the reserve (memory-safety against malformed match_length). Rewrite the doc so only the in-loop prefetch is described as skipped. - The on-stack ring comment still said '[Sequence; 8] = 96 bytes' after the A+B+C refactor stored '(Sequence, u32)' pairs (128 B on a 64-bit target). Update the size + element type description. - Add four unit tests for compute_offsets_long_share covering: pure short-offset tables resolve to zero share; symbols at the threshold (=22) don't count (strict-greater predicate, donor parity); raw count scales by (OFFSET_FSE_LOG - table_log) so a single long-code entry in a 32-entry table lands at share 8; raw count passes through unscaled when table is already at OFFSET_FSE_LOG = 8. Uses a synthetic_offsets_table helper that bypasses build_from_probabilities — the share helper only reads decode[*].symbol + accuracy_log, so we can construct the table by hand without a full bitstream.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Dictionary-init flow (`Dictionary::decode_dict` → `DecoderScratch::init_from_dict` → `FSEScratch::reinit_from`) builds the offsets FSE table from the dictionary's entropy section without ever calling the sequence-decoder path that updates the cache, so `other.offsets_long_share` was the default 0. Verbatim-copying that into `self.offsets_long_share` left Repeat-mode blocks (which reuse the dictionary's offsets table) gated out of the long-pipeline path regardless of the table's actual offset distribution. Recompute the share inside `reinit_from` from the just-copied offsets table via the existing `compute_offsets_long_share` helper (bumped from private to `pub(crate)`). The walk is bounded by `OF_MAX_LOG = 8` (≤ 256 entries) and runs once per scratch reinit — negligible vs the rest of the per-frame setup.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

The pipelined branch deferred the real-`offset_hist` commit to a single `*offset_hist = shadow_hist` at the end of a successful drain. On an `execute_one_sequence_pipelined` Err propagated via `?` (NotEnoughBytesForSequence, ZeroOffset, OOB match copy) that commit never ran — the function exited with `*offset_hist` still at its pre-block value while the buffer had N-1 partial writes. That diverged from the non-pipelined path: `execute_one_sequence` mutates real hist in lockstep per executed sequence, so on mid-loop Err the caller sees N-1 hist updates + N-1 buffer writes (in lockstep, internally consistent). The pipelined path left the two out of sync, breaking scratch reuse after any Err. Wrap the entire pipelined work in an IIFE that returns Result, so a single `if let Err(e) = pipeline_result` site at the end catches mid-loop Errs uniformly and routes them through the same buffer-checkpoint-rollback path the post-loop bitstream validation already used. `*offset_hist = saved_offset_hist` after the rollback is a no-op on the pipelined path (real hist was never touched mid-loop) but stays explicit so future refactors that move the commit earlier remain safe.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

polaz · 2026-05-22T17:57:09Z

+    if use_long_pipeline {
+        // The pipelined branch must roll `offset_hist` back to
+        // `saved_offset_hist` on ANY mid-loop error, not just the
+        // post-loop bitstream-validation path. Without this, an
+        // `execute_one_sequence_pipelined` Err (NotEnoughBytesForSequence
+        // / ZeroOffset / OOB match) propagated via `?` would exit with
+        // `*offset_hist` still at its pre-block value while the buffer
+        // had N-1 partial writes — diverging from the non-pipelined
+        // path (which mutates hist in lockstep per executed sequence)
+        // and leaving scratch internally inconsistent for any
+        // post-Err reuse. Wrap the entire pipelined work in an IIFE so
+        // a single rollback site catches all mid-loop Errs uniformly.
+        let pipeline_result: Result<(), DecompressBlockError> = (|| {
+            // `prefetch_pos` is the logical buffer index (same frame as
+            // `buffer.len()`) at which the NEXT not-yet-decoded sequence
+            // will start pushing literals. We pre-decode `ADVANCE` ahead, so we
+            // accumulate (ll + ml) per decoded seq to keep this position
+            // synchronised with where execute will eventually be.
+            let mut prefetch_pos: usize = old_buffer_size;
+            // Shadow copy of `offset_hist`, advanced by
+            // `do_offset_history` for every decoded-ahead sequence. The
+            // REAL `offset_hist` is only mutated inside
+            // `execute_one_sequence` (preserving the legacy 'partial
+            // output, no rewound history' rollback contract), but the
+            // prefetch needs the exact post-resolution offset for repcode
+            // 1..=3 cases that read history — a stale read would skip
+            // the long-distance prefetch precisely when a fresh huge
+            // offset is followed by a repcode that aliases it. The
+            // shadow is a local `[u32; 3]` (12 bytes) so the simulation cost
+            // is negligible.
+            let mut shadow_hist: [u32; 3] = *offset_hist;
+            // Stack ring of `(decoded_seq, resolved_offset)` pairs. The
+            // decode-ahead phase resolves repcodes against `shadow_hist`
+            // and stores the resolved offset alongside the raw sequence,
+            // so the execute phase consumes a pre-resolved offset and
+            // skips `do_offset_history` entirely — saves one function
+            // call + one cache write on real `offset_hist` per sequence.
+            // The real `offset_hist` is updated ONCE from `shadow_hist`
+            // after a successful drain (below); on a malformed-block
+            // rollback the saved snapshot is restored, so real hist is
+            // never observed in a partial mid-pipeline state.
+            let mut ring: [(Sequence, u32); ADVANCE] = [(
+                Sequence {
+                    ll: 0,
+                    ml: 0,
+                    of: 0,
+                },
+                0u32,
+            ); ADVANCE];
+
+            // Pre-fill the ring. The outer `num_sequences >= ADVANCE * 2`
+            // gate guarantees `num_sequences > ADVANCE`, so the FSE
+            // state update is needed after every prefill decode — no
+            // `isLastSeq` guard required here, only in the steady-state
+            // loop where `i + 1 == num_sequences` is reachable.
+            for slot in ring.iter_mut() {
+                let seq =
+                    decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br);
+                // EXACT actual_offset via shadow history.
+                let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist);
+                // wrapping_add: prefetch_pos / seq.ll / seq.ml are
+                // derived from the bitstream, so a malformed frame can
+                // present values that would overflow usize and panic
+                // under debug. The result feeds only the prefetch
+                // hint — `prefetch_lookahead_match_source` bound-checks
+                // the logical position against `buffer.len()` and drops
+                // wrap-derived garbage indices, so the wrap is harmless
+                // here while keeping the decoder fuzz-stable.
+                let match_start = prefetch_pos.wrapping_add(seq.ll as usize);
+                let source_idx = match_start.wrapping_sub(actual_offset as usize);
+                buffer.prefetch_lookahead_match_source(source_idx);
+                prefetch_pos = match_start.wrapping_add(seq.ml as usize);
+                *slot = (seq, actual_offset);
+                br.ensure_bits(max_update_bits);
+                ll_dec.update_state_fast(&mut br);
+                ml_dec.update_state_fast(&mut br);
+                of_dec.update_state_fast(&mut br);
+            }
+
+            // Steady state: decode next, prefetch its source, execute
+            // the oldest slot in the ring (with its pre-resolved offset).
+            for i in ADVANCE..num_sequences {
+                let seq =
+                    decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br);
+                let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist);
+                let match_start = prefetch_pos.wrapping_add(seq.ll as usize);
+                let source_idx = match_start.wrapping_sub(actual_offset as usize);
+                buffer.prefetch_lookahead_match_source(source_idx);
+                prefetch_pos = match_start.wrapping_add(seq.ml as usize);
+
+                let slot = i & ADVANCE_MASK;
+                let (exec_seq, exec_offset) = ring[slot];
+                ring[slot] = (seq, actual_offset);
+
+                execute_one_sequence_pipelined(
+                    buffer,
+                    literals_buffer,
+                    &mut lit_cur,
+                    literals_buffer_len,
+                    exec_seq,
+                    exec_offset,
+                )?;
+                seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml);
+
+                if i + 1 < num_sequences {
+                    br.ensure_bits(max_update_bits);
+                    ll_dec.update_state_fast(&mut br);
+                    ml_dec.update_state_fast(&mut br);
+                    of_dec.update_state_fast(&mut br);
+                }
+            }
+
+            // Drain: execute remaining ADVANCE sequences with their
+            // pre-resolved offsets. Iteration order matches the ring
+            // slot they occupy from the steady-state loop's final write.
+            for k in 0..ADVANCE {
+                let slot = (num_sequences + k) & ADVANCE_MASK;
+                let (exec_seq, exec_offset) = ring[slot];
+                execute_one_sequence_pipelined(
+                    buffer,
+                    literals_buffer,
+                    &mut lit_cur,
+                    literals_buffer_len,
+                    exec_seq,
+                    exec_offset,
+                )?;
+                seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml);
+            }
+
+            // Single committing point for real offset history on the
+            // pipelined success path. Shadow walked every queued
+            // sequence already; copy that state back so the next
+            // block sees the post-block repcodes. Rollback on a later
+            // bitstream-failure overwrites this with
+            // `saved_offset_hist`, undoing the commit.
+            *offset_hist = shadow_hist;
+            Ok(())
+        })();
+        if let Err(e) = pipeline_result {
+            // Mid-loop execute Err: rollback buffer + hist so post-Err
+            // scratch reuse stays consistent. `*offset_hist` is still
+            // at its pre-block value (the success-only commit above
+            // never ran), so restoring from `saved_offset_hist` is
+            // effectively a no-op on the hist side — the explicit
+            // assignment makes the intent unambiguous and protects
+            // against any future refactor that moves the commit
+            // earlier in the pipelined flow.
+            if buffer.try_restore_checkpoint(buffer_checkpoint) {
+                *offset_hist = saved_offset_hist;
+            }
+            return Err(e);


The rollback-on-Err shape here is intentional donor parity, not a divergence.

Donor ZSTD_decompressSequencesLong_body (lib/decompress/zstd_decompress_block.c:1832-1990) commits repcodes ONCE at end of block, AFTER RETURN_ERROR_IF(!BIT_endOfDStream(...)):

/* save reps for next block */ { U32 i; for (i=0; i<ZSTD_REP_NUM; i++) dctx->entropy.rep[i] = (U32)(seqState.prevOffset[i]); }

If any mid-loop ZSTD_execSequence returns an error (oneSeqSize is error → if (ZSTD_isError(oneSeqSize)) return oneSeqSize;), or the post-loop bitstream-end check fails, that save never runs and dctx->entropy.rep[i] stays at its pre-block value. The same applies to ZSTD_decompressSequences_body (zstd_decompress_block.c:1717-1797) — same end-of-block save pattern, same skip on early return.

Our non-pipelined execute_one_sequence currently mutates real offset_hist per sequence (an older pattern from the pre-fused-executor era). That's the path that diverges from donor — not the pipelined path I just landed. Bringing the non-pipelined path to donor parity is in scope of a separate cleanup, not this PR.

On the try_restore_checkpoint cap-changed edge case: the conditional if buffer.try_restore_checkpoint(...) { *offset_hist = saved_offset_hist; } is the same shape used by the existing post-loop bitstream-validation rollback (lines 426-428, pre-existing). Both branches share the "only restore hist when buffer rollback actually succeeded" rule so the hist+buffer pair stays in lockstep — restoring hist while leaving buffer partial would be the actual inconsistency. The cap-changed case implies the frame is already malformed past MAX_BLOCK_SIZE; the partial data left in the buffer is discarded by the propagated Err anyway.

Copilot AI review requested due to automatic review settings May 22, 2026 09:19

Copilot started reviewing on behalf of polaz May 22, 2026 09:19 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

polaz added 2 commits May 22, 2026 12:29

Merge branch 'main' into perf/#208-match-prefetch

07d0e26

Copilot AI review requested due to automatic review settings May 22, 2026 09:39

Copilot started reviewing on behalf of polaz May 22, 2026 09:39 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested a review from Copilot May 22, 2026 10:49

Copilot started reviewing on behalf of polaz May 22, 2026 10:49 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested a review from Copilot May 22, 2026 11:06

Copilot started reviewing on behalf of polaz May 22, 2026 11:06 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

polaz requested a review from Copilot May 22, 2026 11:37

Copilot started reviewing on behalf of polaz May 22, 2026 11:37 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs

polaz added 2 commits May 22, 2026 14:51

polaz requested a review from Copilot May 22, 2026 12:07

polaz requested a review from Copilot May 22, 2026 13:48

Copilot started reviewing on behalf of polaz May 22, 2026 13:48 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/decode_buffer.rs

Comment thread zstd/src/decoding/decode_buffer.rs

polaz added 2 commits May 22, 2026 17:18

Copilot AI review requested due to automatic review settings May 22, 2026 14:23

Copilot started reviewing on behalf of polaz May 22, 2026 14:23 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs Outdated

Comment thread zstd/src/decoding/scratch.rs Outdated

Comment thread zstd/src/decoding/scratch.rs

polaz added 2 commits May 22, 2026 17:32

Merge branch 'main' into perf/#208-match-prefetch

45be345

Copilot AI review requested due to automatic review settings May 22, 2026 15:08

Copilot started reviewing on behalf of polaz May 22, 2026 15:08 View session

polaz mentioned this pull request May 22, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested a review from Copilot May 22, 2026 16:28

Copilot started reviewing on behalf of polaz May 22, 2026 16:29 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/scratch.rs

polaz requested a review from Copilot May 22, 2026 16:52

Copilot started reviewing on behalf of polaz May 22, 2026 16:52 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

polaz requested a review from Copilot May 22, 2026 17:20

Copilot started reviewing on behalf of polaz May 22, 2026 17:20 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

polaz merged commit d7bb566 into main May 22, 2026
25 of 26 checks passed

polaz deleted the perf/#208-match-prefetch branch May 22, 2026 18:04

sw-release-bot Bot mentioned this pull request May 22, 2026

chore: release v0.0.23 #203

Open

Conversation

polaz commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

zstd/src/decoding/decode_buffer.rs

zstd/src/decoding/sequence_section_decoder.rs

zstd/src/decoding/prefetch.rs

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

polaz commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

polaz commented May 22, 2026 •

edited by coderabbitai Bot

Loading

`zstd/src/decoding/decode_buffer.rs`

`zstd/src/decoding/sequence_section_decoder.rs`

`zstd/src/decoding/prefetch.rs`

coderabbitai Bot commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading