Skip to content

perf(decode): 8-slot software prefetch pipeline for sequence execution (#208)#227

Merged
polaz merged 23 commits into
mainfrom
perf/#208-match-prefetch
May 22, 2026
Merged

perf(decode): 8-slot software prefetch pipeline for sequence execution (#208)#227
polaz merged 23 commits into
mainfrom
perf/#208-match-prefetch

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 22, 2026

Summary

Port the donor ZSTD_decompressSequencesLong_body shape to the fused decode+execute path: 8-slot software pipeline (donor STORED_SEQS = 8) that decodes ADVANCE sequences ahead of execute, prefetching each match source line at decode time. By the time buffer.repeat() reaches the actual load for seq[i], the prefetch issued ADVANCE iterations earlier has had time to pull the line into L1 — hiding DRAM latency on long-distance matches whose source is beyond cache residency.

Closes #208.

Changes

zstd/src/decoding/decode_buffer.rs

  • New prefetch_lookahead_match_source(start_idx) — issues the prefetch hint at a logical buffer position, bound-checks against buffer.len() so out-of-range indices (intra-block self-overlap, wrapping_sub underflow on a malformed sequence, dictionary-sourced matches predating the current frame) silently drop the hint.
  • Wrap-boundary fallback: when the match source straddles the ring's as_slices() boundary, prefetches the s1 tail and tops up from s2[0..] so the donor's "up to two cache lines per match" intent doesn't collapse to one (or zero) at the wrap. Short-tail (< 64 B) cases fall back to prefetch_first_line_l1 so the cache line containing start_idx is always hinted.
  • Routed through prefetch_slice (_MM_HINT_T0 / pldl1keep → L1) matching donor PREFETCH_L1.

zstd/src/decoding/sequence_section_decoder.rs

  • 8-slot software pipeline: prefill ADVANCE → steady (decode N, prefetch using a shadow_hist mirror of offset_hist, execute N−ADVANCE) → drain ADVANCE.
  • Ring stores raw [Sequence; 8]. do_offset_history runs both on the local shadow_hist during decode-ahead (for exact, not estimated, repcode resolution in the prefetch path) AND on the real offset_hist in execute_one_sequence. The real offset_hist is rolled back from saved_offset_hist if the post-loop bitstream check rejects the block, preserving the legacy "partial output, no rewound history" rollback contract on errors.
  • Donor ZSTD_getOffsetInfo parity gate: pipeline only engages when num_sequences >= 2 * ADVANCE AND the FSE offsets table shape shows enough long-offset codes — donor's minShare (7/256 on 64-bit, 20/256 on 32-bit). The share is computed once when the offsets table is rebuilt (FSE / Predefined modes in maybe_update_fse_tables) and cached in FSEScratch::offsets_long_share; Repeat-mode blocks reuse the cached value without re-walking the table.
  • Compile-time guard on ADVANCE.is_power_of_two() so the i & ADVANCE_MASK ring indexing stays equivalent to i % ADVANCE if ADVANCE is ever changed.

zstd/src/decoding/prefetch.rs

  • New prefetch_first_line_l1 helper that issues exactly one L1 prefetch at the first byte of a slice regardless of slice length. prefetch_slice no-ops on slices < CACHE_LINE; that's right for bulk prefetch but wrong for the wrap-boundary case where the cache line containing start_idx is still the line we need warmed. x86_64 / x86 / aarch64 implementations.

Test plan

  • full nextest suite — 576 passed, 3 skipped
  • lint+typecheck clean
  • Bench on Linux server (i9-9900K, BMI2/AVX2) on level_3 + level_19 corpora — mixed results, bench variance dominates over the small expected gains on this hardware
  • Flamegraph on large-window corpus to confirm DRAM stall reduction

Summary by CodeRabbit

  • Performance
    • Pipelined sequence execution for large blocks to boost decompression throughput
    • Lookahead prefetching and L1-first-line prefetching to reduce memory latency and improve cache efficiency
  • New Features
    • Tunable pipeline gating based on offset-table characteristics for steadier hot-path execution
  • Tests
    • Unit tests validating lookahead prefetch behavior and offset-share computation
  • Documentation
    • Clarified offset-history and pipeline-gate semantics in decoder docs

Review Change Stack

Port the donor ZSTD_decompressSequencesLong_body shape to the fused
decode+execute path: 4-slot software pipeline that decodes 4
sequences ahead of execute, prefetching each match source line at
decode time. By the time buffer.repeat() reaches the actual load for
seq[i], the prefetch issued 4 iterations earlier has had time to
pull the line into L1/L2 — hiding DRAM latency on long-distance
matches whose source is beyond cache residency.

decode_buffer.rs:

- New prefetch_lookahead_match_source(start_idx): issues the
  prefetch hint at a logical buffer position, bound-checks against
  buffer.len() so out-of-range indices (intra-block self-overlap,
  wrapping_sub underflow on a malformed sequence, dictionary-sourced
  matches predating the current frame) silently drop the hint.
  Prefetches up to two 64 B cache lines per call (donor PREFETCH
  parity), bounded by PREFETCH_EXTENT = 128.

sequence_section_decoder.rs:

- 4-slot software pipeline: prefill 4 → steady (decode N, prefetch
  using a read-only offset estimate, execute N−4) → drain 4.
- Ring stores raw [Sequence; 4]. do_offset_history stays in
  execute_one_sequence so a mid-loop error preserves the legacy
  'partial output, no history rewind' semantics — the in-flight
  ring slots haven't mutated offset_hist yet.
- For prefetch addressing only, a separate estimate_actual_offset
  resolves the seq's actual_offset against the current offset_hist
  WITHOUT mutating it. The estimate is exact for seq.of >= 4 (every
  fresh non-repcode offset — the dominant case on long-distance
  workloads); for repcode 1..=3 it mirrors do_offset_history_repcode's
  selection table read-only, lagging by at most ADVANCE sequences
  on close-range matches whose source is already L1-warm.
- Opt-in by num_sequences >= 2 * ADVANCE. Short blocks fall back to
  the fused single-pass loop (existing in-loop prefetch retained).
- Compile-time guard on ADVANCE.is_power_of_two() so the
  i & ADVANCE_MASK ring indexing stays equivalent to i % ADVANCE.

Donor uses STORED_SEQS = 8; this PR starts at 4 to keep ring on the
stack within register pressure. Future iterations can dial up after
profiling confirms headroom.

Closes #208.
Copilot AI review requested due to automatic review settings May 22, 2026 09:19
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Warning

Rate limit exceeded

@polaz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 56 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f6e34e23-f57d-43f9-a518-5f8fbd2c80b6

📥 Commits

Reviewing files that changed from the base of the PR and between 72e0754 and 64ea74a.

📒 Files selected for processing (1)
  • zstd/src/decoding/sequence_section_decoder.rs
📝 Walkthrough

Walkthrough

Adds a 4-stage lookahead predecode/prefetch pipeline for long-distance match reads: new DecodeBuffer::prefetch_lookahead_match_source issues L1/L2 prefetches, decode_and_execute_sequences conditionally uses an 8-slot decode-ahead ring with resolved offsets and lookahead prefetching, and supporting primitives and metadata caching are added.

Changes

Match-source prefetch pipeline for long-distance matches

Layer / File(s) Summary
DecodeBuffer repeat refactor & lookahead prefetch
zstd/src/decoding/decode_buffer.rs
Refactors repeat into repeat_inner<const SKIP_PREFETCH>, adds repeat_lookahead_prefetched, implements prefetch_lookahead_match_source with wrap-aware as_slices() prefetching (up to 128 bytes), and adds unit tests for in-range, out-of-range, empty, and wrap cases.
Prefetch primitive implementations
zstd/src/decoding/prefetch.rs
Adds prefetch_first_line_l1(slice) wrapper and per-architecture prefetch_first_line_l1_impl(ptr) implementations for x86_64, x86 (SSE), and aarch64, plus a no-op fallback.
FSE scratch: offsets_long_share cache
zstd/src/decoding/scratch.rs
Adds offsets_long_share: u32 to FSEScratch, initializes/reset to 0 in constructors and reset, and recomputes it during reinit_from.
Sequence decoder pipeline and metadata
zstd/src/decoding/sequence_section_decoder.rs
Documents offset_hist mutation semantics, adds execute_one_sequence_pipelined, implements conditional 8-slot ring decode-ahead pipeline with shadow_hist offset resolution and lookahead prefetch, adds compute_offsets_long_share, and updates maybe_update_fse_tables to populate the cached value.

Sequence Diagram

sequenceDiagram
  participant Decoder as decode_and_execute_sequences
  participant Ring as SequenceRing[8]
  participant Shadow as shadow_hist
  participant Prefetch as prefetch_lookahead_match_source
  participant Executor as execute_one_sequence_pipelined / repeat

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Ring: Prefill phase (i = 0 .. ADVANCE-1)
    loop i = 0 to ADVANCE-1
      Decoder->>Ring: decode_sequence[i] -> ring_slot[i]
      Decoder->>Shadow: resolve offset for sequence[i] -> resolved_offset[i]
      alt resolved_offset > threshold
        Decoder->>Prefetch: prefetch match source for resolved_offset[i]
      end
    end
  end

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Executor: Steady-state pipeline
    loop j = ADVANCE to num_sequences - 1
      Decoder->>Ring: decode_sequence[j] -> ring_slot[j % ADVANCE]
      Decoder->>Shadow: resolve offset for sequence[j]
      alt resolved_offset > threshold
        Decoder->>Prefetch: prefetch match source for resolved_offset[j]
      end
      Decoder->>Executor: execute sequence[j - ADVANCE] using resolved_offset (no history mutation per-iteration)
    end
  end

  rect rgba(100, 150, 200, 0.5)
    Note over Decoder,Executor: Drain remaining ring slots
    loop drain remaining slots
      Decoder->>Executor: execute remaining sequence[k]
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🐰
I hop ahead to warm cold lanes,
I plant soft hints where cache remains,
Four little steps, a tiny ring,
Then bytes arrive and decoders sing,
I nibble latency, speed gains.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(decode): 8-slot software prefetch pipeline for sequence execution' directly and accurately describes the main change: introducing an 8-slot prefetch pipeline for performance improvement in sequence decoding.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from issue #208: implements the multi-stage software prefetch pipeline (8-slot vs. 4-stage proposed), adds lookahead prefetch helpers with bounds checking, gates the pipeline based on offset-table long-share metrics and sequence count thresholds, and maintains correctness with proper offset history handling across error paths.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the prefetch pipeline: decode_buffer refactoring for lookahead prefetch support, sequence_section_decoder pipelined execution logic, prefetch.rs helper functions, and scratch.rs cache field for pipeline gating. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#208-match-prefetch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 91.48936% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/sequence_section_decoder.rs 92.97% 13 Missing ⚠️
zstd/src/decoding/decode_buffer.rs 87.95% 10 Missing ⚠️
zstd/src/decoding/prefetch.rs 88.88% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Two doc consistency findings from the pre-squash review:

- execute_one_sequence comment said 'the pipeline above intentionally
  defers...' but the pipelined decode loop is BELOW this helper
  (execute_one_sequence is a nested fn defined at the top of the
  outer function; the loop using it comes after). Reword as
  'pipelined decode loop below' so the directional reference
  matches the file layout.

- prefetch_lookahead_match_source's body comment claimed 'donor
  PREFETCH_L1 parity' but the implementation calls
  prefetch_slice_t1, which uses _MM_HINT_T1 (L2 destination) on
  x86. Document the deviation explicitly: ADVANCE iterations of
  decode sit between prefetch issue and the matching repeat()
  load, so an L1-bound line risks eviction before consumption —
  L2 has the capacity to keep the line resident across that
  window. The pre-existing in-loop prefetch_match_source makes
  the same T1 choice for the same reason; surfacing it here
  prevents the same review comment next iteration.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports the donor zstd “long sequences” shape into the fused decode+execute path by introducing a 4-slot software pipeline that decodes sequences ahead of execution and issues prefetch hints early to better hide DRAM latency on long-distance match copies.

Changes:

  • Adds a 4-slot decode-ahead / execute-behind ring pipeline in decode_and_execute_sequences, with prefill/steady-state/drain phases.
  • Introduces a new DecodeBuffer::prefetch_lookahead_match_source(start_idx) helper that bounds-checks a logical buffer index and prefetches up to 128 bytes (two cache lines).
  • Keeps offset-history mutation in the execution step and adds a read-only repcode offset estimator used only for prefetch addressing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
zstd/src/decoding/sequence_section_decoder.rs Implements the 4-slot software pipeline and offset-estimation logic for lookahead prefetching.
zstd/src/decoding/decode_buffer.rs Adds a bounded lookahead prefetch helper for match-source addresses computed ahead of execution.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
polaz added 2 commits May 22, 2026 12:29
…uard

Two findings from the post-squash review:

- Issue #208 explicitly calls for the prefetch pipeline to be
  triggered only when the match offset exceeds an L1/L2 residency
  proxy (~32 KiB), so short-offset matches whose source is already
  warm don't burn issue-port pressure. The pipeline was prefetching
  every sequence unconditionally — add a PREFETCH_OFFSET_THRESHOLD
  (32 KiB) gate to both the prefill and steady-state loops, using
  the existing read-only est_offset (exact for seq.of >= 4 which
  covers every fresh long-distance match the gate targets).

- The prefill loop's "if k + 1 < num_sequences" FSE-state-update
  guard was dead under the outer "num_sequences >= ADVANCE * 2"
  precondition (k < ADVANCE implies k + 1 <= ADVANCE < num_sequences),
  so the update fires every iteration regardless. Drop the guard
  and update the comment so the redundancy doesn't read as a
  pending edge case to a future maintainer; the steady-state loop
  keeps its conditional because i + 1 == num_sequences is reachable
  there.
Copilot AI review requested due to automatic review settings May 22, 2026 09:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
…-stable

prefetch_pos / match_start derived `seq.ll`, `seq.ml` straight from
the bitstream. On a malformed/corrupt frame those values can be
huge enough to overflow usize when added, panicking debug builds
(fuzz, CI). The result feeds only the prefetch hint — and
prefetch_lookahead_match_source bound-checks the logical position
against buffer.len() so a wrap-derived garbage index is silently
dropped. Switch the two cursor increments (prefill loop +
steady-state loop, four sites in total) to wrapping_add so a
malformed input becomes a bogus-but-dropped prefetch instead of a
panic.
@polaz
Copy link
Copy Markdown
Member Author

polaz commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 230-342: The pipelined branch is entered solely on num_sequences
>= ADVANCE * 2 even when no sequence actually needs a prefetch (all est_offset <
PREFETCH_OFFSET_THRESHOLD), causing unnecessary ring/prefill/drain overhead;
change the logic around the ADVANCE pipeline (the prefill loop that calls
decode_one_sequence_inline and estimate_actual_offset, the steady-state loop
using ring and execute_one_sequence, and the final drain) to detect whether any
estimated offset crosses PREFETCH_OFFSET_THRESHOLD (track a bool like
saw_prefetch_needed while pre-filling/decoding the first ADVANCE sequences), and
only take the pipelined path (using ring, prefetch_pos,
buffer.prefetch_lookahead_match_source, etc.) when that flag is true—otherwise
fall back to the single-pass loop that decodes and immediately calls
execute_one_sequence for each sequence without prefill/ring overhead. Ensure the
same FSE state updates (ll_dec/ml_dec/of_dec.update_state_fast and
br.ensure_bits) remain consistent when choosing either path.
- Around line 205-227: The decode-ahead path uses the live offset_hist in
estimate_actual_offset (and the adjacent logic around the
PREFETCH_OFFSET_THRESHOLD check) but never advances the decode-side shadow for
ADVANCE-queued sequences, so repcodes after a large new offset see stale history
and can suppress long-distance prefetches; fix by introducing and advancing a
decode-side shadow history copy (the same structure used by the live resolver)
during decode-ahead/ADVANCE handling and use that shadow both inside
estimate_actual_offset and for the PREFETCH_OFFSET_THRESHOLD decision (and the
equivalent checks at the other nearby blocks referenced) so the prefetch
estimate and threshold check reflect queued updates consistently with
do_offset_history_repcode behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ee2d816-89fb-4437-ad1e-af51584b3dfa

📥 Commits

Reviewing files that changed from the base of the PR and between c22b8b1 and a3f4a78.

📒 Files selected for processing (2)
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/sequence_section_decoder.rs

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
… path

Two CR Major findings on the prefetch pipeline:

1. Stale repcode history could suppress the long-distance prefetch.
   The previous estimate read the live offset_hist but never advanced
   it for the ADVANCE-queued sequences, so a fresh huge offset
   immediately followed by a repcode (1..=3) would see pre-update
   history and resolve to the OLD value — exactly the scenario the
   threshold gate then refused to prefetch, killing the gain on the
   workloads this pipeline targets.

   Replace estimate_actual_offset() with a local shadow_hist:[u32;3]
   advanced by do_offset_history() per decode-ahead. The real
   offset_hist still only mutates inside execute_one_sequence (the
   single committing site, preserving rollback semantics), but the
   shadow gives the prefetch path the exact post-resolution value
   for every repcode case. estimate_actual_offset is now removed.

2. Pipelined path ran on large all-short-offset blocks too.
   num_sequences >= ADVANCE * 2 was the only gate, so a long block
   of close-range matches paid ring/prefill/drain overhead while
   issuing zero prefetches.

   Track saw_prefetch_needed during prefill — flips true on the first
   actual_offset >= PREFETCH_OFFSET_THRESHOLD. If it stays false
   through all ADVANCE prefills (the block is dominated by warm
   short-offset matches), drain the queued ring sequences in order
   and continue with the single-pass loop for the rest, skipping
   the ring overhead entirely. If even one prefill sequence crosses
   the threshold, stay on the pipelined steady-state path.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs Outdated
…ail-slice guard

Three Copilot consistency findings on the previous shadow_hist patch:

- The PREFETCH_OFFSET_THRESHOLD comment still described the gate as
  reading the `est_offset` variable that no longer exists — the
  current code computes `actual_offset` via do_offset_history on the
  shadow hist. Reword to match.
- The shadow_hist allocation note called it a "usize-triple" but it
  is actually `[u32; 3]` (12 bytes, not 24). Fix the size claim so
  cost reasoning is accurate.
- prefetch_lookahead_match_source's else branch had an explicit
  `if idx < s2.len()` guard around the s2 prefetch path. With the
  top-of-function `start_idx >= self.buffer.len()` early-return and
  `buffer.len() == s1.len() + s2.len()`, the else branch (entered
  only when start_idx >= s1.len()) guarantees idx < s2.len() by
  construction — the explicit guard was a dead branch in a hot
  helper. Drop it, add a comment explaining the invariant.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs
polaz added 2 commits May 22, 2026 14:51
…donor parity)

Bench against main showed the pipeline as it stood (gated only by
num_sequences >= 8 + per-seq actual_offset >= 32 KiB) regressed
high-entropy / random-data decode by 2-5% while the target
long-window workload (decodecorpus) was essentially zero. Reading
donor `ZSTD_decompressBlock_internal` revealed we'd missed the
front-of-block selector entirely: donor decides Long-vs-Short
pipeline ONCE per block from the offsets FSE table shape, not
per-sequence.

Implement the donor `ZSTD_getOffsetInfo` equivalent:

- Walk fse.offsets.decode up-front; count entries whose offset code
  is > 22 (raw offset >= 2^23 = 8 MiB).
- Scale the count to donor's OffFSELog = 8 (256-entry reference)
  via `raw << (OFFSET_FSE_LOG - table_log)` so a finely-tuned table
  still registers meaningful share.
- Engage the Long pipeline only when `num_sequences >= ADVANCE * 2`
  AND `scaled_share >= 7` (~2.73 % of the table — donor's 64-bit
  `minShare`). Otherwise drop to the single-pass loop.

Side cleanups the new gate makes redundant:

- The mid-block `saw_prefetch_needed` opt-out is gone — the FSE-table
  decision is taken before any sequence decode and applies to the
  whole block, addressing the CR Major (the previous opt-out
  couldn't engage the pipeline for blocks whose long-distance
  offsets only appeared past the first ADVANCE prefills).
- The per-sequence `PREFETCH_OFFSET_THRESHOLD` constant is gone —
  donor `ZSTD_decompressSequencesLong_body` prefetches every queued
  slot unconditionally inside the Long pipeline. Per-seq gating
  risked skipping the prefetch for the slot whose source was the
  actual cache miss; gone for donor parity + simpler control flow.
…he ring wrap

Previously `prefetch_lookahead_match_source` issued the prefetch
hint from exactly one of the wrap-aware ring buffer's two slices.
When the match source landed near the end of `s1` and the tail
in `s1` was shorter than `PREFETCH_EXTENT` (128 B), the helper
prefetched only what fit in s1 and the rest of the line silently
went unfetched — the donor "up to two cache lines per match"
intent collapsed to one or even zero useful prefetch hints at
exactly the wrap boundary.

Top up from `s2[0..]` when the s1 tail is shorter than the
target extent, so the wrap case ends up with the same total
prefetch coverage as a non-wrap case.
@polaz polaz requested a review from Copilot May 22, 2026 12:07
…on table rebuild

The per-block walk of the offsets FSE decode table (counting entries
whose symbol > 22) was paid even on Repeat-mode blocks where the
table didn't change between blocks. On a corpus where REPEAT
dominates after the first non-trivial block (typical for compressed
streams that lock onto an FSE distribution and stick with it), this
was 32–256 entry comparisons per block for zero new information.

Move the share computation to a `compute_offsets_long_share` helper
called from `maybe_update_fse_tables` ONLY when the FSE / Predefined
branches actually rebuild the offsets table. Cache the result in a
new `FSEScratch::offsets_long_share` field; Repeat-mode blocks read
the cached value as-is — donor parity (donor's ddict-cold path also
caches the share signal) plus removal of the per-block O(table_size)
overhead from the pipeline-gate decision.

The pipeline gate becomes `num_sequences >= ADVANCE*2 && fse.offsets_long_share >= MIN_LONG_OFFSET_SHARE`
— one branch + one load, no table walk.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/decode_buffer.rs
Comment thread zstd/src/decoding/decode_buffer.rs
polaz added 2 commits May 22, 2026 17:18
…ch_source coverage

Three doc / scope updates plus one new test module section in response
to CR notes on the prefetch pipeline:

- Reword the 'offset_hist mutates here and only here' comment to
  acknowledge that the post-loop rollback path also assigns to it
  (recovery, not the normal mutation site that the comment was
  about).
- Reword the 'AT MOST two prefetch issues per match' comment to
  acknowledge the wrap-boundary case where the 128 B budget is
  split across s1_tail + s2[0..], emitting up to four cache-line
  prefetches total (still bounded, still L1, still under the
  helper's MAX_LINES = 4 ceiling).
- Add three unit tests for prefetch_lookahead_match_source covering
  in-range probes, out-of-range / empty-buffer probes (early-return
  bound check), and the wrap-boundary layout where as_slices()
  returns two non-empty halves (exercises the prefetch_first_line_l1
  short-tail fallback path).
…ring (A+B+C)

Three donor-vs-ours overhead trims in the pipelined decode hot path,
each gated by the existing FSE-long-share pipeline-engage decision so
the short-block fused path keeps its existing repeat() semantics
unchanged.

A. New `DecodeBuffer::repeat_lookahead_prefetched` variant.
   The regular `repeat` issues an in-loop `prefetch_match_source`
   (T1 hint) at the match source — sensible for the no-pipeline path
   that has no other prefetch source. When the long pipeline is
   active, the lookahead PREFETCH_L1 was already issued ADVANCE
   iterations earlier; the second in-loop prefetch is pure
   issue-port pressure on top of a now-warm L1 line. The pipelined
   variant drops that second prefetch.

B. Pre-resolved offset stored in the ring slot.
   Ring goes from `[Sequence; 8]` to `[(Sequence, u32); 8]`. The
   decode-ahead phase already walks `shadow_hist` via
   `do_offset_history` to resolve the prefetch's source offset
   exactly; we now store that resolved offset alongside the raw
   sequence. The execute phase consumes the pre-resolved value via
   a new `execute_one_sequence_pipelined` helper and skips
   `do_offset_history` entirely — saves one function call and one
   cache write on the real `offset_hist` per sequence (real hist
   was the only mutable shared state in the execute path; pipelined
   now leaves it strictly local until the single end-of-block sync).
   The committing point becomes `*offset_hist = shadow_hist` ONCE
   after a successful drain. Rollback on post-loop bitstream failure
   still restores from `saved_offset_hist` — the new shape just
   makes that restore a no-op on the pipelined path (real hist was
   never mutated mid-loop) while preserving correctness on the
   non-pipelined fallback.

C. `repeat_lookahead_prefetched` also drops the per-call
   `self.buffer.reserve(match_length)`. The pipelined caller
   already reserved `MAX_BLOCK_SIZE` upfront before entering the
   pipeline, so each per-sequence reserve was a redundant
   capacity-check branch. The variant relies on that upfront reserve
   exclusively — saves one branch per executed sequence.
Copilot AI review requested due to automatic review settings May 22, 2026 14:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/decoding/decode_buffer.rs Outdated
Comment thread zstd/src/decoding/scratch.rs Outdated
Comment thread zstd/src/decoding/scratch.rs
polaz added 2 commits May 22, 2026 17:32
…of offsets_long_share

Three CR findings on the A+B+C pipelined-repeat patch:

- repeat_lookahead_prefetched skipped buffer.reserve(match_length).
  On malformed input where match_length expands past the upfront
  reserve(MAX_BLOCK_SIZE), extend_from_within_unchecked* assumes
  the destination capacity and only debug_assert checks it —
  release builds would have written out of bounds. The reserve is
  amortised by the caller's upfront reservation anyway, so this is
  just a cheap capacity-check branch on the hot path, not an
  allocation. Restore it; only the in-loop prefetch_match_source
  stays skipped (that's the actual win).
- Rename the SKIP_PREFETCH const param to match what it actually
  controls now that reserve is unconditional.
- offsets_long_share doc said the cached value counts codes
  '≥ LONG_OFFSET_CODE_THRESHOLD' but the helper uses strictly '>'
  (donor parity — '> 22' means codes 23..MAX, i.e. raw offsets ≥
  2^23). Fix the doc string.
- DecoderScratch::reset() reset every FSE-related field except
  offsets_long_share, leaving a stale gate signal across scratch
  reuse. A new frame opening with Repeat-mode tables would inherit
  the previous frame's pipeline-engage decision. Reset alongside
  ll_rle / ml_rle / of_rle.
Copilot AI review requested due to automatic review settings May 22, 2026 15:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
…hape + offsets_long_share tests

Three doc + test additions in response to CR notes on the A+B+C
pipelined-repeat patch:

- The execute_one_sequence_pipelined doc claimed it skipped both
  prefetch and reserve, but repeat_lookahead_prefetched keeps the
  reserve (memory-safety against malformed match_length). Rewrite
  the doc so only the in-loop prefetch is described as skipped.
- The on-stack ring comment still said '[Sequence; 8] = 96 bytes'
  after the A+B+C refactor stored '(Sequence, u32)' pairs (128 B
  on a 64-bit target). Update the size + element type description.
- Add four unit tests for compute_offsets_long_share covering:
  pure short-offset tables resolve to zero share; symbols at the
  threshold (=22) don't count (strict-greater predicate, donor
  parity); raw count scales by (OFFSET_FSE_LOG - table_log) so a
  single long-code entry in a 32-entry table lands at share 8;
  raw count passes through unscaled when table is already at
  OFFSET_FSE_LOG = 8. Uses a synthetic_offsets_table helper that
  bypasses build_from_probabilities — the share helper only reads
  decode[*].symbol + accuracy_log, so we can construct the table
  by hand without a full bitstream.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/decoding/scratch.rs
Dictionary-init flow (`Dictionary::decode_dict` → `DecoderScratch::init_from_dict` → `FSEScratch::reinit_from`) builds the offsets FSE table from the dictionary's entropy section without ever calling the sequence-decoder path that updates the cache, so `other.offsets_long_share` was the default 0. Verbatim-copying that into `self.offsets_long_share` left Repeat-mode blocks (which reuse the dictionary's offsets table) gated out of the long-pipeline path regardless of the table's actual offset distribution.

Recompute the share inside `reinit_from` from the just-copied offsets table via the existing `compute_offsets_long_share` helper (bumped from private to `pub(crate)`). The walk is bounded by `OF_MAX_LOG = 8` (≤ 256 entries) and runs once per scratch reinit — negligible vs the rest of the per-frame setup.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated
The pipelined branch deferred the real-`offset_hist` commit to a
single `*offset_hist = shadow_hist` at the end of a successful
drain. On an `execute_one_sequence_pipelined` Err propagated via
`?` (NotEnoughBytesForSequence, ZeroOffset, OOB match copy) that
commit never ran — the function exited with `*offset_hist` still
at its pre-block value while the buffer had N-1 partial writes.

That diverged from the non-pipelined path: `execute_one_sequence`
mutates real hist in lockstep per executed sequence, so on
mid-loop Err the caller sees N-1 hist updates + N-1 buffer writes
(in lockstep, internally consistent). The pipelined path left the
two out of sync, breaking scratch reuse after any Err.

Wrap the entire pipelined work in an IIFE that returns Result, so
a single `if let Err(e) = pipeline_result` site at the end
catches mid-loop Errs uniformly and routes them through the same
buffer-checkpoint-rollback path the post-loop bitstream validation
already used. `*offset_hist = saved_offset_hist` after the
rollback is a no-op on the pipelined path (real hist was never
touched mid-loop) but stays explicit so future refactors that
move the commit earlier remain safe.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment on lines +256 to +406
if use_long_pipeline {
// The pipelined branch must roll `offset_hist` back to
// `saved_offset_hist` on ANY mid-loop error, not just the
// post-loop bitstream-validation path. Without this, an
// `execute_one_sequence_pipelined` Err (NotEnoughBytesForSequence
// / ZeroOffset / OOB match) propagated via `?` would exit with
// `*offset_hist` still at its pre-block value while the buffer
// had N-1 partial writes — diverging from the non-pipelined
// path (which mutates hist in lockstep per executed sequence)
// and leaving scratch internally inconsistent for any
// post-Err reuse. Wrap the entire pipelined work in an IIFE so
// a single rollback site catches all mid-loop Errs uniformly.
let pipeline_result: Result<(), DecompressBlockError> = (|| {
// `prefetch_pos` is the logical buffer index (same frame as
// `buffer.len()`) at which the NEXT not-yet-decoded sequence
// will start pushing literals. We pre-decode `ADVANCE` ahead, so we
// accumulate (ll + ml) per decoded seq to keep this position
// synchronised with where execute will eventually be.
let mut prefetch_pos: usize = old_buffer_size;
// Shadow copy of `offset_hist`, advanced by
// `do_offset_history` for every decoded-ahead sequence. The
// REAL `offset_hist` is only mutated inside
// `execute_one_sequence` (preserving the legacy 'partial
// output, no rewound history' rollback contract), but the
// prefetch needs the exact post-resolution offset for repcode
// 1..=3 cases that read history — a stale read would skip
// the long-distance prefetch precisely when a fresh huge
// offset is followed by a repcode that aliases it. The
// shadow is a local `[u32; 3]` (12 bytes) so the simulation cost
// is negligible.
let mut shadow_hist: [u32; 3] = *offset_hist;
// Stack ring of `(decoded_seq, resolved_offset)` pairs. The
// decode-ahead phase resolves repcodes against `shadow_hist`
// and stores the resolved offset alongside the raw sequence,
// so the execute phase consumes a pre-resolved offset and
// skips `do_offset_history` entirely — saves one function
// call + one cache write on real `offset_hist` per sequence.
// The real `offset_hist` is updated ONCE from `shadow_hist`
// after a successful drain (below); on a malformed-block
// rollback the saved snapshot is restored, so real hist is
// never observed in a partial mid-pipeline state.
let mut ring: [(Sequence, u32); ADVANCE] = [(
Sequence {
ll: 0,
ml: 0,
of: 0,
},
0u32,
); ADVANCE];

// Pre-fill the ring. The outer `num_sequences >= ADVANCE * 2`
// gate guarantees `num_sequences > ADVANCE`, so the FSE
// state update is needed after every prefill decode — no
// `isLastSeq` guard required here, only in the steady-state
// loop where `i + 1 == num_sequences` is reachable.
for slot in ring.iter_mut() {
let seq =
decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br);
// EXACT actual_offset via shadow history.
let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist);
// wrapping_add: prefetch_pos / seq.ll / seq.ml are
// derived from the bitstream, so a malformed frame can
// present values that would overflow usize and panic
// under debug. The result feeds only the prefetch
// hint — `prefetch_lookahead_match_source` bound-checks
// the logical position against `buffer.len()` and drops
// wrap-derived garbage indices, so the wrap is harmless
// here while keeping the decoder fuzz-stable.
let match_start = prefetch_pos.wrapping_add(seq.ll as usize);
let source_idx = match_start.wrapping_sub(actual_offset as usize);
buffer.prefetch_lookahead_match_source(source_idx);
prefetch_pos = match_start.wrapping_add(seq.ml as usize);
*slot = (seq, actual_offset);
br.ensure_bits(max_update_bits);
ll_dec.update_state_fast(&mut br);
ml_dec.update_state_fast(&mut br);
of_dec.update_state_fast(&mut br);
}

// Steady state: decode next, prefetch its source, execute
// the oldest slot in the ring (with its pre-resolved offset).
for i in ADVANCE..num_sequences {
let seq =
decode_one_sequence_inline(&mut ll_dec, &mut ml_dec, &mut of_dec, &mut br);
let actual_offset = do_offset_history(seq.of, seq.ll, &mut shadow_hist);
let match_start = prefetch_pos.wrapping_add(seq.ll as usize);
let source_idx = match_start.wrapping_sub(actual_offset as usize);
buffer.prefetch_lookahead_match_source(source_idx);
prefetch_pos = match_start.wrapping_add(seq.ml as usize);

let slot = i & ADVANCE_MASK;
let (exec_seq, exec_offset) = ring[slot];
ring[slot] = (seq, actual_offset);

execute_one_sequence_pipelined(
buffer,
literals_buffer,
&mut lit_cur,
literals_buffer_len,
exec_seq,
exec_offset,
)?;
seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml);

if i + 1 < num_sequences {
br.ensure_bits(max_update_bits);
ll_dec.update_state_fast(&mut br);
ml_dec.update_state_fast(&mut br);
of_dec.update_state_fast(&mut br);
}
}

// Drain: execute remaining ADVANCE sequences with their
// pre-resolved offsets. Iteration order matches the ring
// slot they occupy from the steady-state loop's final write.
for k in 0..ADVANCE {
let slot = (num_sequences + k) & ADVANCE_MASK;
let (exec_seq, exec_offset) = ring[slot];
execute_one_sequence_pipelined(
buffer,
literals_buffer,
&mut lit_cur,
literals_buffer_len,
exec_seq,
exec_offset,
)?;
seq_sum = seq_sum.wrapping_add(exec_seq.ll).wrapping_add(exec_seq.ml);
}

// Single committing point for real offset history on the
// pipelined success path. Shadow walked every queued
// sequence already; copy that state back so the next
// block sees the post-block repcodes. Rollback on a later
// bitstream-failure overwrites this with
// `saved_offset_hist`, undoing the commit.
*offset_hist = shadow_hist;
Ok(())
})();
if let Err(e) = pipeline_result {
// Mid-loop execute Err: rollback buffer + hist so post-Err
// scratch reuse stays consistent. `*offset_hist` is still
// at its pre-block value (the success-only commit above
// never ran), so restoring from `saved_offset_hist` is
// effectively a no-op on the hist side — the explicit
// assignment makes the intent unambiguous and protects
// against any future refactor that moves the commit
// earlier in the pipelined flow.
if buffer.try_restore_checkpoint(buffer_checkpoint) {
*offset_hist = saved_offset_hist;
}
return Err(e);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback-on-Err shape here is intentional donor parity, not a divergence.

Donor ZSTD_decompressSequencesLong_body (lib/decompress/zstd_decompress_block.c:1832-1990) commits repcodes ONCE at end of block, AFTER RETURN_ERROR_IF(!BIT_endOfDStream(...)):

/* save reps for next block */
{ U32 i; for (i=0; i<ZSTD_REP_NUM; i++) dctx->entropy.rep[i] = (U32)(seqState.prevOffset[i]); }

If any mid-loop ZSTD_execSequence returns an error (oneSeqSize is error → if (ZSTD_isError(oneSeqSize)) return oneSeqSize;), or the post-loop bitstream-end check fails, that save never runs and dctx->entropy.rep[i] stays at its pre-block value. The same applies to ZSTD_decompressSequences_body (zstd_decompress_block.c:1717-1797) — same end-of-block save pattern, same skip on early return.

Our non-pipelined execute_one_sequence currently mutates real offset_hist per sequence (an older pattern from the pre-fused-executor era). That's the path that diverges from donor — not the pipelined path I just landed. Bringing the non-pipelined path to donor parity is in scope of a separate cleanup, not this PR.

On the try_restore_checkpoint cap-changed edge case: the conditional if buffer.try_restore_checkpoint(...) { *offset_hist = saved_offset_hist; } is the same shape used by the existing post-loop bitstream-validation rollback (lines 426-428, pre-existing). Both branches share the "only restore hist when buffer rollback actually succeeded" rule so the hist+buffer pair stays in lockstep — restoring hist while leaving buffer partial would be the actual inconsistency. The cap-changed case implies the frame is already malformed past MAX_BLOCK_SIZE; the partial data left in the buffer is discarded by the propagated Err anyway.

@polaz polaz merged commit d7bb566 into main May 22, 2026
25 of 26 checks passed
@polaz polaz deleted the perf/#208-match-prefetch branch May 22, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decode): match prefetch 4-stage pipeline for long-distance matches

2 participants