Context
Follow-up to #199 / #201 (HUF 4-stream burst port). Step F from the original
adaptation table, deferred from #201 scope as a stretch goal pending profile
evidence.
Hypothesis
The donor burst body in decompress_literals runs ~4 packed-table reads + 4
left-shifts + 4 sentinel trailing_zeros per outer iteration. LLVM should
generate near-optimal code for this on aarch64. On x86-64 with BMI2 (pdep,
pext, bzhi, andn), hand-rolled inline asm could:
- Use
pext for the symbol-bits extraction (instead of bits[s] >> table_shift)
- Use
lzcnt / tzcnt for the sentinel position
- Schedule the 4-stream interleave to issue port-balanced uops
This is a stretch experiment — only worth the maintenance cost if profile
on a Zen3/Zen4/Sapphire Rapids box shows the burst body as a measurable
fraction of decode_literals self-time AND the LLVM-generated code leaves
≥5% on the table.
Gating criteria (must be met before opening PR)
- Run
cargo flamegraph on decompress/level_3_dfast/decodecorpus-z000033/c_stream
on x86-64 BMI2 hardware
- Identify if the burst body is ≥10% of decode_literals self-time
- Manually inspect the LLVM-generated code for the burst body via
cargo asm --no-color -p structured-zstd decoding::literals_section_decoder
- Confirm ≥5% headroom over what LLVM produces
Scope (if gates pass)
- Add
#[cfg(target_arch = "x86_64")] #[target_feature(enable = "bmi2")]
variant of the 4-stream burst body
- Hide the asm behind
is_x86_feature_detected!("bmi2") runtime gate at
decode_literals entry, with the current safe Rust path as fallback
- Bench on Zen3 / Zen4 / Sapphire Rapids
- Document required tooling (assembler version, target_feature set)
Anti-criteria (if gates fail)
If profile shows the burst body is sub-5% of decode_literals time, OR
LLVM output is already at the data-dependency floor, close this issue
without opening a PR — maintenance cost of hand-rolled asm exceeds the
expected gain.
Files involved
- zstd/src/decoding/literals_section_decoder.rs (burst body)
References
PR #201 "Out of scope" Step F, Issue #199 adaptation table.
Context
Follow-up to #199 / #201 (HUF 4-stream burst port). Step F from the original
adaptation table, deferred from #201 scope as a stretch goal pending profile
evidence.
Hypothesis
The donor burst body in
decompress_literalsruns ~4 packed-table reads + 4left-shifts + 4 sentinel
trailing_zerosper outer iteration. LLVM shouldgenerate near-optimal code for this on aarch64. On x86-64 with BMI2 (
pdep,pext,bzhi,andn), hand-rolled inline asm could:pextfor the symbol-bits extraction (instead ofbits[s] >> table_shift)lzcnt/tzcntfor the sentinel positionThis is a stretch experiment — only worth the maintenance cost if profile
on a Zen3/Zen4/Sapphire Rapids box shows the burst body as a measurable
fraction of
decode_literalsself-time AND the LLVM-generated code leaves≥5% on the table.
Gating criteria (must be met before opening PR)
cargo flamegraphondecompress/level_3_dfast/decodecorpus-z000033/c_streamon x86-64 BMI2 hardware
cargo asm --no-color -p structured-zstd decoding::literals_section_decoderScope (if gates pass)
#[cfg(target_arch = "x86_64")]#[target_feature(enable = "bmi2")]variant of the 4-stream burst body
is_x86_feature_detected!("bmi2")runtime gate atdecode_literalsentry, with the current safe Rust path as fallbackAnti-criteria (if gates fail)
If profile shows the burst body is sub-5% of decode_literals time, OR
LLVM output is already at the data-dependency floor, close this issue
without opening a PR — maintenance cost of hand-rolled asm exceeds the
expected gain.
Files involved
References
PR #201 "Out of scope" Step F, Issue #199 adaptation table.