perf(decoding): branchless offset history, stride prefetch, BMI2 pext

## Problem

Three independent micro-optimizations in the decode hot path that together yield 15-30% throughput improvement on sequence-heavy data.

## 1. Branchless offset history (\`sequence_execution.rs:68-124\`)

\`do_offset_history()\` currently uses 6 branches (nested match + if/else) to resolve repeat offsets. The decision space is only 2 bits: \`offset_value\` (1-3) × \`lit_len > 0\`. This is a branchless lookup table candidate.

**Current code (simplified):**
\`\`\`rust
match offset_value {
    1 if lit_len == 0 => { /* swap scratch[0], scratch[1] */ }
    1 => { /* use scratch[0] */ }
    2 if lit_len == 0 => { /* swap scratch[0], scratch[2] */ }
    2 => { /* use scratch[1], rotate */ }
    3 => { /* use scratch[2] - 1, rotate */ }
    _ => unreachable!()
}
\`\`\`

**Target:** Pre-computed lookup table indexed by \`(offset_value - 1) * 2 + (lit_len > 0) as usize\`. Each entry encodes which scratch slot to use and how to rotate. Eliminates branch mispredictions on data with mixed offset patterns.

**Expected gain:** 10-20% on sequence-heavy blocks with varied offset distribution.

## 2. Stride prefetch pipeline (\`prefetch.rs\` + \`sequence_execution.rs\`)

Current \`prefetch_slice()\` issues a single \`_mm_prefetch\` for the first 64 bytes of each literal block. For blocks > 64B, subsequent cache lines miss.

**Improvements:**
- **Multi-line prefetch:** For literals > 64B, prefetch every 64B (up to 4 cache lines ahead).
- **N+2 sequence prefetch:** Prefetch literal data for the sequence 2 iterations ahead, not the current one. This gives the prefetch time to resolve before the data is needed.
- **Match offset prefetch:** Issue \`_MM_HINT_T1\` (L2) prefetch for the back-reference source address before the match copy.
- **ARM:** Use \`__prefetch\` / \`prfm pldl1keep\` (AArch64 equivalent).

**Expected gain:** 5-15% on data with large literal blocks (logs, JSON, structured text).

## 3. BMI2 \`pext\` for FSE multi-field extraction (\`bit_reader_reverse.rs\`)

\`get_bits_triple(n1, n2, n3)\` currently performs 3 sequential mask+shift operations. On BMI2-capable CPUs, \`_pext_u64\` (Parallel Bit Extract) uses three BMI2 extraction ops (`_pext_u64` per field) to reduce shift/mask overhead in the hot path.

**Current:**
\`\`\`rust
let f1 = container & mask(n1);
let f2 = (container >> n1) & mask(n2);
let f3 = (container >> (n1 + n2)) & mask(n3);
\`\`\`

**With pext:**
\`\`\`rust
let f1 = _pext_u64(container, mask1);
let f2 = _pext_u64(container, mask2);
let f3 = _pext_u64(container, mask3);
\`\`\`

**Caveat:** AMD Zen1/Zen2 \`pext\` is microcoded (~18 cycles). Must use runtime dispatch and fall back to current approach on pre-Zen3 AMD. Intel Haswell+ and AMD Zen3+ = 1 cycle.

**Expected gain:** 5-10% on FSE-heavy decode paths (BMI2 + Zen3/Intel only).

## Acceptance criteria
- [ ] Branchless offset history: no branches in \`do_offset_history()\`, verified via \`cargo-asm\` or disassembly.
- [ ] Stride prefetch: multi-line prefetch for literals > 64B, N+2 lookahead in sequence loop.
- [ ] BMI2 pext: runtime-dispatched, falls back to current approach on non-BMI2 or slow-pext CPUs.
- [ ] All three changes pass existing decode test suite with byte-exact output.
- [ ] Combined benchmark improvement measurable on sequence-heavy corpus.

## Files involved
- \`zstd/src/decoding/sequence_execution.rs\` (offset history, prefetch integration)
- \`zstd/src/decoding/prefetch.rs\` (stride prefetch, ARM support)
- \`zstd/src/bit_io/bit_reader_reverse.rs\` (pext path)

## Dependencies
- Independent from #56, #66, #67, #68.

## Estimate
1d 4h




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decoding): branchless offset history, stride prefetch, BMI2 pext #69

Problem

1. Branchless offset history (`sequence_execution.rs:68-124`)

2. Stride prefetch pipeline (`prefetch.rs` + `sequence_execution.rs`)

3. BMI2 `pext` for FSE multi-field extraction (`bit_reader_reverse.rs`)

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(decoding): branchless offset history, stride prefetch, BMI2 pext #69

Description

Problem

1. Branchless offset history (`sequence_execution.rs:68-124`)

2. Stride prefetch pipeline (`prefetch.rs` + `sequence_execution.rs`)

3. BMI2 `pext` for FSE multi-field extraction (`bit_reader_reverse.rs`)

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions