perf(wasm): SIMD-vectorize 8 inference DSP hot loops by czoli1976 · Pull Request #683 · Rikorose/DeepFilterNet

czoli1976 · 2026-05-03T15:21:12Z

Summary

Adds f32x4 SIMD vectorization (via core::arch::wasm32 intrinsics under #[cfg(target_arch = "wasm32")]) for eight loops that auto-vectorisation didn't fire on, even with +simd128. All eight are called per-frame inside df_process_frame. Native scalar fallbacks retained via #[cfg(not(target_arch = "wasm32"))] for every helper.

Loop	Hot path	What it does
`compute_band_corr`	`feat_erb`	Sum + scale ERB-band cross-correlation
`band_mean_norm_erb`	`feat_erb`	Per-bin IIR mean-norm
`band_unit_norm`	`feat_cplx`	Per-bin Complex32 unit-norm IIR + interleaved divide
`band_unit_norm_t`	`feat_cplx_t`	Same norm, writes to split-halves output
`apply_band_gain`	`DFState::apply_mask` (post-network)	Complex32 × f32 mul-in-place per ERB band
`apply_window_in_place`	`frame_synthesis`	f32 mul-in-place
Inline overlap-add to output	`frame_synthesis`	`out[i] = x_first[i] + synthesis_mem[i]`
Inline synthesis_mem update	`frame_synthesis`	`s_first[i] += xs_first[i]` (overlap for next frame)

The band_unit_norm* pair was the trickiest: load 4 Complex32 as 2 v128s, use i32x4_shuffle to build pure-real and pure-imag vectors, compute (re²+im²).sqrt() lane-wise, update state, then divide xs by sqrt(state). For band_unit_norm the divisor is re-interleaved via two more shuffles to match the [re,im,re,im] layout; for band_unit_norm_t the output halves are contiguous so no re-interleave is needed.

apply_band_gain is structurally identical to the public generic apply_interp_band_gain<T> for T=Complex32. The SIMD version reinterprets &mut [Complex32] as &mut [f32] of length 2N and multiplies each f32 lane by the band scalar b. DFState::apply_mask redirected to call apply_band_gain directly so the inference path picks up the SIMD version; apply_interp_band_gain<T> stays generic for transforms.rs (training-side) callers.

apply_window_in_place's signature changed from generic IntoIterator<Item=&'a f32> to &[f32]. The sole call site already passed a slice, so no callsite changes.

The two frame_synthesis inline loops vectorised in commit 4 are interesting: the compiler auto-vec'd the analogous loops in frame_analysis (so I didn't touch those — confirmed by negative bench result), but the nested zip().zip() pattern in frame_synthesis defeats LLVM's auto-vectorization where the izip!() pattern in frame_analysis succeeded. Same SIMD pattern (4-wide f32x4_add); biggest single-commit speedup of the bundle.

Why

wasm-objdump on a release wasm32 build (+simd128 enabled) shows the eight loop bodies emitted entirely in scalar f32.load / f32.mul / f32.add / f32.sqrt ops — auto-vec didn't fire. They're each called per-frame in df_process_frame, so the cost compounds.

The SIMD shape is the standard f32x4 reduction / element-wise pattern: chunked 4-wide v128_load + arithmetic + v128_store, with a scalar tail (0–3 trailing elements) handled at the end of each helper.

For band_unit_norm*, used (re²+im²).sqrt() directly instead of Complex32::norm()'s libm::hypotf. For DFN3's audio-spectrum magnitudes (no overflow/underflow regime), both produce identical bits — verified by exhaustive FNV-1a hash equality below.

Test plan — quality gate

Bit-identical output verified across all 8 vectorisations via FNV-1a hash of the df_process_frame output stream over 3000 frames of deterministic seeded input. Same hash across every commit:

Variant	FNV-1a hash
Rikorose main (baseline, no SIMD)	`53ae8dfc3595faf0`
compute_band_corr only (commit 1)	`53ae8dfc3595faf0` ✓
4 functions vectorised (commit 2)	`53ae8dfc3595faf0` ✓
6 functions vectorised (commit 3)	`53ae8dfc3595faf0` ✓
8 loops vectorised (commit 4)	`53ae8dfc3595faf0` ✓

Match across multiple independent bench runs on Node v20.11.1 / V8.

Native build: cargo check -p deep_filter --features tract,default-model clean (4 pre-existing lifetime-style warnings only).
WASM build: wasm-pack build libDF --target no-modules --release --features wasm succeeds. wasm-opt -Oz delta over baseline: +1489 bytes total for all 8 vectorisations (commit 4 is actually -24 bytes vs commit 3 because copy_from_slice emits less code than the explicit loop it replaces).
Bit-identical output on N=3000 deterministic frames, verified above.
Browser engines: not yet re-tested at the 8-loop commit (was tested for compute_band_corr alone on Chromium / WebKit / Firefox single + 4-thread). Happy to re-run if you'd like before merge — the change is wasm32-target-conditional, so engine differences should be minimal.

Speed (RTF)

Same-machine bench, Node v20.11.1, 3000 frames per run, multiple independent runs:

Stage	Median RTF (over 5+ runs)	Δ vs Rikorose main
Rikorose main (no SIMD)	~0.0608	—
compute_band_corr only (commit 1)	~0.0598	-1.6%
4 functions (commit 2)	~0.0598	-1.7%
6 functions (commit 3)	~0.0594	-2.3%
8 loops (this PR head)	~0.0570	~-6%

The 4th commit (frame_synthesis loops) gives the biggest single speedup of the bundle. Variance: typical run-to-run jitter on a single machine is ~2-3%, so absolute deltas under 1% should be read as direction-only. Median is more reliable than mean here (one outlier run hit a CPU spike).

The frame_synthesis improvements come from real auto-vectorization gaps (LLVM didn't pick up the nested zip().zip() pattern), not from the marginal libDF helpers — the headline gain is concentrated in commit 4. The first 6 functions add a smaller but consistent ~1-2% on top.

Caveats

4 commits — happy to split per-loop or squash if you prefer that shape for review.
The redirect of DFState::apply_mask from apply_interp_band_gain<Complex32> to apply_band_gain is a behavioural no-op (the two functions had identical bodies) but worth flagging as a code-organisation change.
band_unit_norm* uses (re²+im²).sqrt() instead of Complex32::norm(). Bit-equality verified empirically; the two diverge only on extreme inputs (near f32::MAX or denormal regime) which DFN3's signal path doesn't hit.
Skipped post_filter even though it's hot: it has scalar .sin() calls per element and SIMD trig requires polynomial approximations that wouldn't preserve bit-equality.
Skipped the frame_analysis windowing loops (different from the synthesis ones in this PR): empirical bench showed they're already auto-vectorised by LLVM, so explicit SIMD added function-call overhead without benefit. Negative result; not included.

Hot loop on the per-frame ERB feature path: dot-product over a band of Complex32 against itself (or a reference). The wasm32 build with `+simd128` was leaving this loop scalar — `wasm-objdump` shows zero v128 ops for the function body in the production build. Replace the inner accumulator with a 4-wide f32x4 reduction using `core::arch::wasm32` intrinsics. Output is bit-exact identical (FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit / Firefox, single-threaded and 4-thread). Same-machine focused bench, Chromium, 5-run alternated, 300 iter × 20 frames per measurement (t-test): vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity) my_mt_1t: 3.748 -> 3.723 ms (-0.67%, t=2.22) my_mt_4t: 4.679 -> 4.646 ms (-0.71%, t=2.45) Native builds use the existing scalar reduction via cfg gating; no behaviour change off wasm32.

Adds f32x4 vectorization for three more hot DSP functions in the df_process_frame inference path, on top of the compute_band_corr work in this PR's first commit: * band_mean_norm_erb (called from feat_erb per frame): per-bin IIR mean-norm. State is per-bin (no recurrence between bins) so straightforward 4-wide SIMD over all ERB bins. * apply_band_gain (called from apply_mask post-network): Complex32 x f32 scalar mul-in-place per ERB band. Reinterprets &mut [Complex32] as &mut [f32] of length 2N (Complex32 is #[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies. Also redirects DFState::apply_mask to call apply_band_gain (the Complex32 specialisation) instead of the generic apply_interp_band_gain<T>, since the existing apply_band_gain function is already structurally identical. * apply_window_in_place (called from frame_synthesis per frame): f32 mul-in-place. Signature changed from generic IntoIterator<Item=&'a f32> to &[f32] (the sole caller already passes &state.window which IS a slice). 4-wide SIMD multiplies. Each function keeps the original scalar implementation as the non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))]. Bit-identical output verified: FNV-1a hash of df_process_frame output stream over 3000 random frames matches the Rikorose main baseline exactly across all 3 independent bench runs on Node v20.11.1 / V8. Wasm size delta vs baseline: +835 bytes total (compute_band_corr +699; the 3 new helpers add net +136 bytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx / feat_cplx_t (called per frame inside df_process_frame). The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...] layout so we can compute the per-bin norm (sqrt(re^2 + im^2)) lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use i32x4_shuffle to build pure-real and pure-imag vectors, compute norm in 4-wide SIMD, update state, then divide xs by sqrt(state). * band_unit_norm (xs: &mut [Complex32]) — re-interleaves the per-bin sqrt(state) divisor via two i32x4_shuffles to match the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32) at a time. * band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same norm computation but writes to o_re / o_im split halves of out (CONTIGUOUS), so no re-interleave step is needed for the divide. Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow regime), both produce identical bits — verified by FNV-1a hash of df_process_frame output stream over N=3000 deterministic random frames matching baseline exactly across 5 independent runs on Node v20.11.1 / V8. Wasm size delta: +678 bytes vs the 4-function bundle commit. Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three more loops in frame_synthesis emit scalar code on wasm32 despite +simd128 (unlike the frame_analysis windowing loops which LLVM auto-vec'd; something about the nested zip().zip() iterator pattern in frame_synthesis vs the izip!() pattern in frame_analysis defeats auto-vectorization). Three changes: * out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output) — new f32_add_to(a, b, out) helper, three-slice element-wise add via 4-wide v128 + f32x4_add. * s_first[i] += xs_first[i] (overlap-add for next frame, in-place) — new f32_add_inplace(xs, ys) helper, two-slice element-wise in-place add. * s_second[i] = xs_second[i] (override left-shifted buffer) — replaced the explicit loop with copy_from_slice; the compiler likely emitted memcpy already, but the stdlib idiom is clearer and lets the optimiser pick the best implementation. Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0 unchanged across N=3000 deterministic frames over 6 independent bench runs. Speed: median bundle_synth vs the previous 6-function bundle is -1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to -11% additional gain (those runs had background CPU activity that hit the previous bundle harder). Real direction, modest absolute gain, no quality cost. Wasm size delta: -24 bytes vs previous bundle (copy_from_slice emits less code than the explicit loop). Net total: +1489 bytes over the no-SIMD baseline for all 8 vectorisations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 and others added 2 commits May 3, 2026 16:20

czoli1976 changed the title ~~perf(wasm): SIMD-vectorize compute_band_corr inner loop~~ perf(wasm): SIMD-vectorize 4 inference DSP hot loops May 3, 2026

czoli1976 changed the title ~~perf(wasm): SIMD-vectorize 4 inference DSP hot loops~~ perf(wasm): SIMD-vectorize 6 inference DSP hot loops May 3, 2026

czoli1976 changed the title ~~perf(wasm): SIMD-vectorize 6 inference DSP hot loops~~ perf(wasm): SIMD-vectorize 8 inference DSP hot loops May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683

perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683
czoli1976 wants to merge 4 commits intoRikorose:mainfrom
czoli1976:ship-upstream/wasm-compute-band-corr-simd

czoli1976 commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan — quality gate

Speed (RTF)

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

czoli1976 commented May 3, 2026 •

edited

Loading