Problem
Three independent micro-optimizations in the decode hot path that together yield 15-30% throughput improvement on sequence-heavy data.
1. Branchless offset history (`sequence_execution.rs:68-124`)
`do_offset_history()` currently uses 6 branches (nested match + if/else) to resolve repeat offsets. The decision space is only 2 bits: `offset_value` (1-3) × `lit_len > 0`. This is a branchless lookup table candidate.
Current code (simplified):
```rust
match offset_value {
1 if lit_len == 0 => { /* swap scratch[0], scratch[1] / }
1 => { / use scratch[0] / }
2 if lit_len == 0 => { / swap scratch[0], scratch[2] / }
2 => { / use scratch[1], rotate / }
3 => { / use scratch[2] - 1, rotate */ }
_ => unreachable!()
}
```
Target: Pre-computed lookup table indexed by `(offset_value - 1) * 2 + (lit_len > 0) as usize`. Each entry encodes which scratch slot to use and how to rotate. Eliminates branch mispredictions on data with mixed offset patterns.
Expected gain: 10-20% on sequence-heavy blocks with varied offset distribution.
2. Stride prefetch pipeline (`prefetch.rs` + `sequence_execution.rs`)
Current `prefetch_slice()` issues a single `_mm_prefetch` for the first 64 bytes of each literal block. For blocks > 64B, subsequent cache lines miss.
Improvements:
- Multi-line prefetch: For literals > 64B, prefetch every 64B (up to 4 cache lines ahead).
- N+2 sequence prefetch: Prefetch literal data for the sequence 2 iterations ahead, not the current one. This gives the prefetch time to resolve before the data is needed.
- Match offset prefetch: Issue `_MM_HINT_T1` (L2) prefetch for the back-reference source address before the match copy.
- ARM: Use `__prefetch` / `prfm pldl1keep` (AArch64 equivalent).
Expected gain: 5-15% on data with large literal blocks (logs, JSON, structured text).
3. BMI2 `pext` for FSE multi-field extraction (`bit_reader_reverse.rs`)
`get_bits_triple(n1, n2, n3)` currently performs 3 sequential mask+shift operations. On BMI2-capable CPUs, `_pext_u64` (Parallel Bit Extract) uses three BMI2 extraction ops (_pext_u64 per field) to reduce shift/mask overhead in the hot path.
Current:
```rust
let f1 = container & mask(n1);
let f2 = (container >> n1) & mask(n2);
let f3 = (container >> (n1 + n2)) & mask(n3);
```
With pext:
```rust
let f1 = _pext_u64(container, mask1);
let f2 = _pext_u64(container, mask2);
let f3 = _pext_u64(container, mask3);
```
Caveat: AMD Zen1/Zen2 `pext` is microcoded (~18 cycles). Must use runtime dispatch and fall back to current approach on pre-Zen3 AMD. Intel Haswell+ and AMD Zen3+ = 1 cycle.
Expected gain: 5-10% on FSE-heavy decode paths (BMI2 + Zen3/Intel only).
Acceptance criteria
Files involved
- `zstd/src/decoding/sequence_execution.rs` (offset history, prefetch integration)
- `zstd/src/decoding/prefetch.rs` (stride prefetch, ARM support)
- `zstd/src/bit_io/bit_reader_reverse.rs` (pext path)
Dependencies
Estimate
1d 4h
Problem
Three independent micro-optimizations in the decode hot path that together yield 15-30% throughput improvement on sequence-heavy data.
1. Branchless offset history (`sequence_execution.rs:68-124`)
`do_offset_history()` currently uses 6 branches (nested match + if/else) to resolve repeat offsets. The decision space is only 2 bits: `offset_value` (1-3) × `lit_len > 0`. This is a branchless lookup table candidate.
Current code (simplified):
```rust
match offset_value {
1 if lit_len == 0 => { /* swap scratch[0], scratch[1] / }
1 => { / use scratch[0] / }
2 if lit_len == 0 => { / swap scratch[0], scratch[2] / }
2 => { / use scratch[1], rotate / }
3 => { / use scratch[2] - 1, rotate */ }
_ => unreachable!()
}
```
Target: Pre-computed lookup table indexed by `(offset_value - 1) * 2 + (lit_len > 0) as usize`. Each entry encodes which scratch slot to use and how to rotate. Eliminates branch mispredictions on data with mixed offset patterns.
Expected gain: 10-20% on sequence-heavy blocks with varied offset distribution.
2. Stride prefetch pipeline (`prefetch.rs` + `sequence_execution.rs`)
Current `prefetch_slice()` issues a single `_mm_prefetch` for the first 64 bytes of each literal block. For blocks > 64B, subsequent cache lines miss.
Improvements:
Expected gain: 5-15% on data with large literal blocks (logs, JSON, structured text).
3. BMI2 `pext` for FSE multi-field extraction (`bit_reader_reverse.rs`)
`get_bits_triple(n1, n2, n3)` currently performs 3 sequential mask+shift operations. On BMI2-capable CPUs, `_pext_u64` (Parallel Bit Extract) uses three BMI2 extraction ops (
_pext_u64per field) to reduce shift/mask overhead in the hot path.Current:
```rust
let f1 = container & mask(n1);
let f2 = (container >> n1) & mask(n2);
let f3 = (container >> (n1 + n2)) & mask(n3);
```
With pext:
```rust
let f1 = _pext_u64(container, mask1);
let f2 = _pext_u64(container, mask2);
let f3 = _pext_u64(container, mask3);
```
Caveat: AMD Zen1/Zen2 `pext` is microcoded (~18 cycles). Must use runtime dispatch and fall back to current approach on pre-Zen3 AMD. Intel Haswell+ and AMD Zen3+ = 1 cycle.
Expected gain: 5-10% on FSE-heavy decode paths (BMI2 + Zen3/Intel only).
Acceptance criteria
Files involved
Dependencies
Estimate
1d 4h