Skip to content

perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt) #71

@polaz

Description

@polaz

Problem

All current platform-specific optimizations target x86-64 only (BMI2 bzhi, SSE _mm_prefetch). ARM/AArch64 falls back to scalar paths everywhere. As ARM becomes a primary deployment target (Apple M-series, AWS Graviton, Android), dedicated ARM code paths are needed.

Goal

Add architecture-specific optimizations across encode and decode paths, with runtime CPU-feature gating and scalar fallback.

Implementation plan

1. CRC32 intrinsics for hash computation (encoding, AArch64)

ARM has hardware CRC32 instructions (__crc32cw, __crc32cd) that can reduce hash-mix cost in fast matcher paths.

Target: AArch64 CRC path in match_generator.rs with runtime (is_aarch64_feature_detected!("crc")) and compile-time fallback.

1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)

Extend the same hash-mix strategy to x86_64 using SSE4.2 CRC32 (_mm_crc32_u64) with runtime feature gating.

Target: x86_64 SSE4.2 path in match_generator.rs with runtime (is_x86_feature_detected!("sse4.2")) and compile-time fallback.

2. NEON wildcopy for decode buffer (decoding)

This overlaps with #68 (SIMD wildcopy) but specifies the ARM-specific implementation:

  • vld1q_u8 / vst1q_u8 for 16-byte bulk copy
  • vdupq_n_u8 for RLE broadcast (offset=1 overlap copy)
  • vqtbl1q_u8 for short-offset pattern repeat (offset 2-15)

Note: Implementation goes into #68's simd_copy.rs as the AArch64 backend. This issue tracks ARM-specific design decisions; #68 tracks the overall integration.

3. SVE2 histogram for frequency counting (encoding)

SVE2 HISTCNT instruction computes per-element histogram in a single operation. This accelerates:

  • Huffman frequency counting in huff0_encoder.rs
  • FSE symbol frequency analysis in fse_encoder.rs
  • Dictionary training frequency tracking in dictionary/frequency.rs

Available on: Graviton 3+, Apple M4+ (SVE2 mandatory in ARMv9).

4. ARM prefetch (prfm)

Replace x86-only _mm_prefetch assumptions with ARM-native prefetch:

  • prfm pldl1keep for L1 temporal prefetch (literal data)
  • prfm pldl2keep for L2 prefetch (match back-references)

Acceptance criteria

  • CRC32 hash path used on AArch64 with runtime feature detection.
  • CRC32 hash path used on x86_64 SSE4.2 with runtime feature detection.
  • NEON wildcopy backend in perf(decoding): SIMD wildcopy for literal and match memcpy #68's architecture.
  • SVE2 histcnt path for frequency counting (gated behind feature detection).
  • ARM prefetch instructions replace no-op on AArch64.
  • All optimizations keep scalar fallback.
  • CPU-gated tests exist for CRC hash paths (AArch64 CRC and x86_64 SSE4.2).
  • Cross-validation tests pass; on architectures without dedicated runner, CPU-gated tests must safely skip when feature is absent.

Files involved

  • zstd/src/encoding/match_generator.rs (AArch64 CRC + x86_64 SSE4.2 CRC hash mix)
  • zstd/src/decoding/simd_copy.rs (NEON wildcopy, via perf(decoding): SIMD wildcopy for literal and match memcpy #68)
  • zstd/src/huff0/huff0_encoder.rs (SVE2 histcnt)
  • zstd/src/fse/fse_encoder.rs (SVE2 histcnt)
  • zstd/src/decoding/prefetch.rs (ARM prefetch)

Dependencies

Estimate

2d 4h

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions