perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt)

## Problem

All current platform-specific optimizations target x86-64 only (BMI2 `bzhi`, SSE `_mm_prefetch`). ARM/AArch64 falls back to scalar paths everywhere. As ARM becomes a primary deployment target (Apple M-series, AWS Graviton, Android), dedicated ARM code paths are needed.

## Goal

Add architecture-specific optimizations across encode and decode paths, with runtime CPU-feature gating and scalar fallback.

## Implementation plan

### 1. CRC32 intrinsics for hash computation (encoding, AArch64)
ARM has hardware CRC32 instructions (`__crc32cw`, `__crc32cd`) that can reduce hash-mix cost in fast matcher paths.

**Target:** AArch64 CRC path in `match_generator.rs` with runtime (`is_aarch64_feature_detected!("crc")`) and compile-time fallback.

### 1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)
Extend the same hash-mix strategy to x86_64 using SSE4.2 CRC32 (`_mm_crc32_u64`) with runtime feature gating.

**Target:** x86_64 SSE4.2 path in `match_generator.rs` with runtime (`is_x86_feature_detected!("sse4.2")`) and compile-time fallback.

### 2. NEON wildcopy for decode buffer (decoding)
This overlaps with #68 (SIMD wildcopy) but specifies the ARM-specific implementation:
- `vld1q_u8` / `vst1q_u8` for 16-byte bulk copy
- `vdupq_n_u8` for RLE broadcast (offset=1 overlap copy)
- `vqtbl1q_u8` for short-offset pattern repeat (offset 2-15)

**Note:** Implementation goes into #68's `simd_copy.rs` as the AArch64 backend. This issue tracks ARM-specific design decisions; #68 tracks the overall integration.

### 3. SVE2 histogram for frequency counting (encoding)
SVE2 `HISTCNT` instruction computes per-element histogram in a single operation. This accelerates:
- Huffman frequency counting in `huff0_encoder.rs`
- FSE symbol frequency analysis in `fse_encoder.rs`
- Dictionary training frequency tracking in `dictionary/frequency.rs`

**Available on:** Graviton 3+, Apple M4+ (SVE2 mandatory in ARMv9).

### 4. ARM prefetch (`prfm`)
Replace x86-only `_mm_prefetch` assumptions with ARM-native prefetch:
- `prfm pldl1keep` for L1 temporal prefetch (literal data)
- `prfm pldl2keep` for L2 prefetch (match back-references)

## Acceptance criteria
- [ ] CRC32 hash path used on AArch64 with runtime feature detection.
- [ ] CRC32 hash path used on x86_64 SSE4.2 with runtime feature detection.
- [ ] NEON wildcopy backend in #68's architecture.
- [ ] SVE2 histcnt path for frequency counting (gated behind feature detection).
- [ ] ARM prefetch instructions replace no-op on AArch64.
- [ ] All optimizations keep scalar fallback.
- [ ] CPU-gated tests exist for CRC hash paths (AArch64 CRC and x86_64 SSE4.2).
- [ ] Cross-validation tests pass; on architectures without dedicated runner, CPU-gated tests must safely skip when feature is absent.

## Files involved
- `zstd/src/encoding/match_generator.rs` (AArch64 CRC + x86_64 SSE4.2 CRC hash mix)
- `zstd/src/decoding/simd_copy.rs` (NEON wildcopy, via #68)
- `zstd/src/huff0/huff0_encoder.rs` (SVE2 histcnt)
- `zstd/src/fse/fse_encoder.rs` (SVE2 histcnt)
- `zstd/src/decoding/prefetch.rs` (ARM prefetch)

## Dependencies
- NEON wildcopy coordinates with #68 (SIMD wildcopy).
- SVE2 histcnt may require nightly intrinsics or inline asm.
- Hash mix CRC paths are independent and can land incrementally.

## Estimate
2d 4h


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt) #71

Problem

Goal

Implementation plan

1. CRC32 intrinsics for hash computation (encoding, AArch64)

1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)

2. NEON wildcopy for decode buffer (decoding)

3. SVE2 histogram for frequency counting (encoding)

4. ARM prefetch (`prfm`)

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt) #71

Description

Problem

Goal

Implementation plan

1. CRC32 intrinsics for hash computation (encoding, AArch64)

1b. SSE4.2 CRC32 intrinsics for hash computation (encoding, x86_64)

2. NEON wildcopy for decode buffer (decoding)

3. SVE2 histogram for frequency counting (encoding)

4. ARM prefetch (prfm)

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4. ARM prefetch (`prfm`)