Skip to content

perf(decode): HUF burst body x86-64 inline asm experiment (BMI2) #205

@polaz

Description

@polaz

Context

Follow-up to #199 / #201 (HUF 4-stream burst port). Step F from the original
adaptation table, deferred from #201 scope as a stretch goal pending profile
evidence.

Hypothesis

The donor burst body in decompress_literals runs ~4 packed-table reads + 4
left-shifts + 4 sentinel trailing_zeros per outer iteration. LLVM should
generate near-optimal code for this on aarch64. On x86-64 with BMI2 (pdep,
pext, bzhi, andn), hand-rolled inline asm could:

  1. Use pext for the symbol-bits extraction (instead of bits[s] >> table_shift)
  2. Use lzcnt / tzcnt for the sentinel position
  3. Schedule the 4-stream interleave to issue port-balanced uops

This is a stretch experiment — only worth the maintenance cost if profile
on a Zen3/Zen4/Sapphire Rapids box shows the burst body as a measurable
fraction of decode_literals self-time AND the LLVM-generated code leaves
≥5% on the table.

Gating criteria (must be met before opening PR)

  • Run cargo flamegraph on decompress/level_3_dfast/decodecorpus-z000033/c_stream
    on x86-64 BMI2 hardware
  • Identify if the burst body is ≥10% of decode_literals self-time
  • Manually inspect the LLVM-generated code for the burst body via
    cargo asm --no-color -p structured-zstd decoding::literals_section_decoder
  • Confirm ≥5% headroom over what LLVM produces

Scope (if gates pass)

  • Add #[cfg(target_arch = "x86_64")] #[target_feature(enable = "bmi2")]
    variant of the 4-stream burst body
  • Hide the asm behind is_x86_feature_detected!("bmi2") runtime gate at
    decode_literals entry, with the current safe Rust path as fallback
  • Bench on Zen3 / Zen4 / Sapphire Rapids
  • Document required tooling (assembler version, target_feature set)

Anti-criteria (if gates fail)

If profile shows the burst body is sub-5% of decode_literals time, OR
LLVM output is already at the data-dependency floor, close this issue
without opening a PR
— maintenance cost of hand-rolled asm exceeds the
expected gain.

Files involved

  • zstd/src/decoding/literals_section_decoder.rs (burst body)

References

PR #201 "Out of scope" Step F, Issue #199 adaptation table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions