Skip to content

perf(decoding): branchless offset history, stride prefetch, BMI2 pext #69

@polaz

Description

@polaz

Problem

Three independent micro-optimizations in the decode hot path that together yield 15-30% throughput improvement on sequence-heavy data.

1. Branchless offset history (`sequence_execution.rs:68-124`)

`do_offset_history()` currently uses 6 branches (nested match + if/else) to resolve repeat offsets. The decision space is only 2 bits: `offset_value` (1-3) × `lit_len > 0`. This is a branchless lookup table candidate.

Current code (simplified):
```rust
match offset_value {
1 if lit_len == 0 => { /* swap scratch[0], scratch[1] / }
1 => { /
use scratch[0] / }
2 if lit_len == 0 => { /
swap scratch[0], scratch[2] / }
2 => { /
use scratch[1], rotate / }
3 => { /
use scratch[2] - 1, rotate */ }
_ => unreachable!()
}
```

Target: Pre-computed lookup table indexed by `(offset_value - 1) * 2 + (lit_len > 0) as usize`. Each entry encodes which scratch slot to use and how to rotate. Eliminates branch mispredictions on data with mixed offset patterns.

Expected gain: 10-20% on sequence-heavy blocks with varied offset distribution.

2. Stride prefetch pipeline (`prefetch.rs` + `sequence_execution.rs`)

Current `prefetch_slice()` issues a single `_mm_prefetch` for the first 64 bytes of each literal block. For blocks > 64B, subsequent cache lines miss.

Improvements:

  • Multi-line prefetch: For literals > 64B, prefetch every 64B (up to 4 cache lines ahead).
  • N+2 sequence prefetch: Prefetch literal data for the sequence 2 iterations ahead, not the current one. This gives the prefetch time to resolve before the data is needed.
  • Match offset prefetch: Issue `_MM_HINT_T1` (L2) prefetch for the back-reference source address before the match copy.
  • ARM: Use `__prefetch` / `prfm pldl1keep` (AArch64 equivalent).

Expected gain: 5-15% on data with large literal blocks (logs, JSON, structured text).

3. BMI2 `pext` for FSE multi-field extraction (`bit_reader_reverse.rs`)

`get_bits_triple(n1, n2, n3)` currently performs 3 sequential mask+shift operations. On BMI2-capable CPUs, `_pext_u64` (Parallel Bit Extract) uses three BMI2 extraction ops (_pext_u64 per field) to reduce shift/mask overhead in the hot path.

Current:
```rust
let f1 = container & mask(n1);
let f2 = (container >> n1) & mask(n2);
let f3 = (container >> (n1 + n2)) & mask(n3);
```

With pext:
```rust
let f1 = _pext_u64(container, mask1);
let f2 = _pext_u64(container, mask2);
let f3 = _pext_u64(container, mask3);
```

Caveat: AMD Zen1/Zen2 `pext` is microcoded (~18 cycles). Must use runtime dispatch and fall back to current approach on pre-Zen3 AMD. Intel Haswell+ and AMD Zen3+ = 1 cycle.

Expected gain: 5-10% on FSE-heavy decode paths (BMI2 + Zen3/Intel only).

Acceptance criteria

  • Branchless offset history: no branches in `do_offset_history()`, verified via `cargo-asm` or disassembly.
  • Stride prefetch: multi-line prefetch for literals > 64B, N+2 lookahead in sequence loop.
  • BMI2 pext: runtime-dispatched, falls back to current approach on non-BMI2 or slow-pext CPUs.
  • All three changes pass existing decode test suite with byte-exact output.
  • Combined benchmark improvement measurable on sequence-heavy corpus.

Files involved

  • `zstd/src/decoding/sequence_execution.rs` (offset history, prefetch integration)
  • `zstd/src/decoding/prefetch.rs` (stride prefetch, ARM support)
  • `zstd/src/bit_io/bit_reader_reverse.rs` (pext path)

Dependencies

Estimate

1d 4h

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions