You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All current platform-specific optimizations target x86-64 only (BMI2 bzhi, SSE _mm_prefetch). ARM/AArch64 falls back to scalar paths everywhere. As ARM becomes a primary deployment target (Apple M-series, AWS Graviton, Android), dedicated ARM code paths are needed.
Goal
Add architecture-specific optimizations across encode and decode paths, with runtime CPU-feature gating and scalar fallback.
Implementation plan
1. CRC32 intrinsics for hash computation (encoding, AArch64)
ARM has hardware CRC32 instructions (__crc32cw, __crc32cd) that can reduce hash-mix cost in fast matcher paths.
Target: AArch64 CRC path in match_generator.rs with runtime (is_aarch64_feature_detected!("crc")) and compile-time fallback.
1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)
Extend the same hash-mix strategy to x86_64 using SSE4.2 CRC32 (_mm_crc32_u64) with runtime feature gating.
Target: x86_64 SSE4.2 path in match_generator.rs with runtime (is_x86_feature_detected!("sse4.2")) and compile-time fallback.
2. NEON wildcopy for decode buffer (decoding)
This overlaps with #68 (SIMD wildcopy) but specifies the ARM-specific implementation:
vld1q_u8 / vst1q_u8 for 16-byte bulk copy
vdupq_n_u8 for RLE broadcast (offset=1 overlap copy)
vqtbl1q_u8 for short-offset pattern repeat (offset 2-15)
Note: Implementation goes into #68's simd_copy.rs as the AArch64 backend. This issue tracks ARM-specific design decisions; #68 tracks the overall integration.
3. SVE2 histogram for frequency counting (encoding)
SVE2 HISTCNT instruction computes per-element histogram in a single operation. This accelerates:
Huffman frequency counting in huff0_encoder.rs
FSE symbol frequency analysis in fse_encoder.rs
Dictionary training frequency tracking in dictionary/frequency.rs
Available on: Graviton 3+, Apple M4+ (SVE2 mandatory in ARMv9).
4. ARM prefetch (prfm)
Replace x86-only _mm_prefetch assumptions with ARM-native prefetch:
prfm pldl1keep for L1 temporal prefetch (literal data)
prfm pldl2keep for L2 prefetch (match back-references)
Acceptance criteria
CRC32 hash path used on AArch64 with runtime feature detection.
CRC32 hash path used on x86_64 SSE4.2 with runtime feature detection.
Problem
All current platform-specific optimizations target x86-64 only (BMI2
bzhi, SSE_mm_prefetch). ARM/AArch64 falls back to scalar paths everywhere. As ARM becomes a primary deployment target (Apple M-series, AWS Graviton, Android), dedicated ARM code paths are needed.Goal
Add architecture-specific optimizations across encode and decode paths, with runtime CPU-feature gating and scalar fallback.
Implementation plan
1. CRC32 intrinsics for hash computation (encoding, AArch64)
ARM has hardware CRC32 instructions (
__crc32cw,__crc32cd) that can reduce hash-mix cost in fast matcher paths.Target: AArch64 CRC path in
match_generator.rswith runtime (is_aarch64_feature_detected!("crc")) and compile-time fallback.1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)
Extend the same hash-mix strategy to x86_64 using SSE4.2 CRC32 (
_mm_crc32_u64) with runtime feature gating.Target: x86_64 SSE4.2 path in
match_generator.rswith runtime (is_x86_feature_detected!("sse4.2")) and compile-time fallback.2. NEON wildcopy for decode buffer (decoding)
This overlaps with #68 (SIMD wildcopy) but specifies the ARM-specific implementation:
vld1q_u8/vst1q_u8for 16-byte bulk copyvdupq_n_u8for RLE broadcast (offset=1 overlap copy)vqtbl1q_u8for short-offset pattern repeat (offset 2-15)Note: Implementation goes into #68's
simd_copy.rsas the AArch64 backend. This issue tracks ARM-specific design decisions; #68 tracks the overall integration.3. SVE2 histogram for frequency counting (encoding)
SVE2
HISTCNTinstruction computes per-element histogram in a single operation. This accelerates:huff0_encoder.rsfse_encoder.rsdictionary/frequency.rsAvailable on: Graviton 3+, Apple M4+ (SVE2 mandatory in ARMv9).
4. ARM prefetch (
prfm)Replace x86-only
_mm_prefetchassumptions with ARM-native prefetch:prfm pldl1keepfor L1 temporal prefetch (literal data)prfm pldl2keepfor L2 prefetch (match back-references)Acceptance criteria
Files involved
zstd/src/encoding/match_generator.rs(AArch64 CRC + x86_64 SSE4.2 CRC hash mix)zstd/src/decoding/simd_copy.rs(NEON wildcopy, via perf(decoding): SIMD wildcopy for literal and match memcpy #68)zstd/src/huff0/huff0_encoder.rs(SVE2 histcnt)zstd/src/fse/fse_encoder.rs(SVE2 histcnt)zstd/src/decoding/prefetch.rs(ARM prefetch)Dependencies
Estimate
2d 4h