Skip to content

perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683

Open
czoli1976 wants to merge 4 commits intoRikorose:mainfrom
czoli1976:ship-upstream/wasm-compute-band-corr-simd
Open

perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683
czoli1976 wants to merge 4 commits intoRikorose:mainfrom
czoli1976:ship-upstream/wasm-compute-band-corr-simd

Conversation

@czoli1976
Copy link
Copy Markdown

@czoli1976 czoli1976 commented May 3, 2026

Summary

Adds f32x4 SIMD vectorization (via core::arch::wasm32 intrinsics under #[cfg(target_arch = "wasm32")]) for eight loops that auto-vectorisation didn't fire on, even with +simd128. All eight are called per-frame inside df_process_frame. Native scalar fallbacks retained via #[cfg(not(target_arch = "wasm32"))] for every helper.

Loop Hot path What it does
compute_band_corr feat_erb Sum + scale ERB-band cross-correlation
band_mean_norm_erb feat_erb Per-bin IIR mean-norm
band_unit_norm feat_cplx Per-bin Complex32 unit-norm IIR + interleaved divide
band_unit_norm_t feat_cplx_t Same norm, writes to split-halves output
apply_band_gain DFState::apply_mask (post-network) Complex32 × f32 mul-in-place per ERB band
apply_window_in_place frame_synthesis f32 mul-in-place
Inline overlap-add to output frame_synthesis out[i] = x_first[i] + synthesis_mem[i]
Inline synthesis_mem update frame_synthesis s_first[i] += xs_first[i] (overlap for next frame)

The band_unit_norm* pair was the trickiest: load 4 Complex32 as 2 v128s, use i32x4_shuffle to build pure-real and pure-imag vectors, compute (re²+im²).sqrt() lane-wise, update state, then divide xs by sqrt(state). For band_unit_norm the divisor is re-interleaved via two more shuffles to match the [re,im,re,im] layout; for band_unit_norm_t the output halves are contiguous so no re-interleave is needed.

apply_band_gain is structurally identical to the public generic apply_interp_band_gain<T> for T=Complex32. The SIMD version reinterprets &mut [Complex32] as &mut [f32] of length 2N and multiplies each f32 lane by the band scalar b. DFState::apply_mask redirected to call apply_band_gain directly so the inference path picks up the SIMD version; apply_interp_band_gain<T> stays generic for transforms.rs (training-side) callers.

apply_window_in_place's signature changed from generic IntoIterator<Item=&'a f32> to &[f32]. The sole call site already passed a slice, so no callsite changes.

The two frame_synthesis inline loops vectorised in commit 4 are interesting: the compiler auto-vec'd the analogous loops in frame_analysis (so I didn't touch those — confirmed by negative bench result), but the nested zip().zip() pattern in frame_synthesis defeats LLVM's auto-vectorization where the izip!() pattern in frame_analysis succeeded. Same SIMD pattern (4-wide f32x4_add); biggest single-commit speedup of the bundle.

Why

wasm-objdump on a release wasm32 build (+simd128 enabled) shows the eight loop bodies emitted entirely in scalar f32.load / f32.mul / f32.add / f32.sqrt ops — auto-vec didn't fire. They're each called per-frame in df_process_frame, so the cost compounds.

The SIMD shape is the standard f32x4 reduction / element-wise pattern: chunked 4-wide v128_load + arithmetic + v128_store, with a scalar tail (0–3 trailing elements) handled at the end of each helper.

For band_unit_norm*, used (re²+im²).sqrt() directly instead of Complex32::norm()'s libm::hypotf. For DFN3's audio-spectrum magnitudes (no overflow/underflow regime), both produce identical bits — verified by exhaustive FNV-1a hash equality below.

Test plan — quality gate

Bit-identical output verified across all 8 vectorisations via FNV-1a hash of the df_process_frame output stream over 3000 frames of deterministic seeded input. Same hash across every commit:

Variant FNV-1a hash
Rikorose main (baseline, no SIMD) 53ae8dfc3595faf0
compute_band_corr only (commit 1) 53ae8dfc3595faf0
4 functions vectorised (commit 2) 53ae8dfc3595faf0
6 functions vectorised (commit 3) 53ae8dfc3595faf0
8 loops vectorised (commit 4) 53ae8dfc3595faf0

Match across multiple independent bench runs on Node v20.11.1 / V8.

  • Native build: cargo check -p deep_filter --features tract,default-model clean (4 pre-existing lifetime-style warnings only).
  • WASM build: wasm-pack build libDF --target no-modules --release --features wasm succeeds. wasm-opt -Oz delta over baseline: +1489 bytes total for all 8 vectorisations (commit 4 is actually -24 bytes vs commit 3 because copy_from_slice emits less code than the explicit loop it replaces).
  • Bit-identical output on N=3000 deterministic frames, verified above.
  • Browser engines: not yet re-tested at the 8-loop commit (was tested for compute_band_corr alone on Chromium / WebKit / Firefox single + 4-thread). Happy to re-run if you'd like before merge — the change is wasm32-target-conditional, so engine differences should be minimal.

Speed (RTF)

Same-machine bench, Node v20.11.1, 3000 frames per run, multiple independent runs:

Stage Median RTF (over 5+ runs) Δ vs Rikorose main
Rikorose main (no SIMD) ~0.0608
compute_band_corr only (commit 1) ~0.0598 -1.6%
4 functions (commit 2) ~0.0598 -1.7%
6 functions (commit 3) ~0.0594 -2.3%
8 loops (this PR head) ~0.0570 ~-6%

The 4th commit (frame_synthesis loops) gives the biggest single speedup of the bundle. Variance: typical run-to-run jitter on a single machine is ~2-3%, so absolute deltas under 1% should be read as direction-only. Median is more reliable than mean here (one outlier run hit a CPU spike).

The frame_synthesis improvements come from real auto-vectorization gaps (LLVM didn't pick up the nested zip().zip() pattern), not from the marginal libDF helpers — the headline gain is concentrated in commit 4. The first 6 functions add a smaller but consistent ~1-2% on top.

Caveats

  • 4 commits — happy to split per-loop or squash if you prefer that shape for review.
  • The redirect of DFState::apply_mask from apply_interp_band_gain<Complex32> to apply_band_gain is a behavioural no-op (the two functions had identical bodies) but worth flagging as a code-organisation change.
  • band_unit_norm* uses (re²+im²).sqrt() instead of Complex32::norm(). Bit-equality verified empirically; the two diverge only on extreme inputs (near f32::MAX or denormal regime) which DFN3's signal path doesn't hit.
  • Skipped post_filter even though it's hot: it has scalar .sin() calls per element and SIMD trig requires polynomial approximations that wouldn't preserve bit-equality.
  • Skipped the frame_analysis windowing loops (different from the synthesis ones in this PR): empirical bench showed they're already auto-vectorised by LLVM, so explicit SIMD added function-call overhead without benefit. Negative result; not included.

czoli1976 and others added 2 commits May 3, 2026 16:20
Hot loop on the per-frame ERB feature path: dot-product over a band
of Complex32 against itself (or a reference). The wasm32 build with
`+simd128` was leaving this loop scalar — `wasm-objdump` shows zero
v128 ops for the function body in the production build.

Replace the inner accumulator with a 4-wide f32x4 reduction using
`core::arch::wasm32` intrinsics. Output is bit-exact identical
(FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit /
Firefox, single-threaded and 4-thread).

Same-machine focused bench, Chromium, 5-run alternated, 300 iter
× 20 frames per measurement (t-test):
  vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity)
  my_mt_1t:             3.748 -> 3.723 ms (-0.67%, t=2.22)
  my_mt_4t:             4.679 -> 4.646 ms (-0.71%, t=2.45)

Native builds use the existing scalar reduction via cfg gating;
no behaviour change off wasm32.
Adds f32x4 vectorization for three more hot DSP functions in the
df_process_frame inference path, on top of the compute_band_corr
work in this PR's first commit:

  * band_mean_norm_erb (called from feat_erb per frame): per-bin
    IIR mean-norm. State is per-bin (no recurrence between bins) so
    straightforward 4-wide SIMD over all ERB bins.

  * apply_band_gain (called from apply_mask post-network): Complex32
    x f32 scalar mul-in-place per ERB band. Reinterprets
    &mut [Complex32] as &mut [f32] of length 2N (Complex32 is
    #[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies.
    Also redirects DFState::apply_mask to call apply_band_gain (the
    Complex32 specialisation) instead of the generic
    apply_interp_band_gain<T>, since the existing apply_band_gain
    function is already structurally identical.

  * apply_window_in_place (called from frame_synthesis per frame):
    f32 mul-in-place. Signature changed from generic
    IntoIterator<Item=&'a f32> to &[f32] (the sole caller already
    passes &state.window which IS a slice). 4-wide SIMD multiplies.

Each function keeps the original scalar implementation as the
non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))].

Bit-identical output verified: FNV-1a hash of df_process_frame
output stream over 3000 random frames matches the Rikorose main
baseline exactly across all 3 independent bench runs on
Node v20.11.1 / V8.

Wasm size delta vs baseline: +835 bytes total (compute_band_corr
+699; the 3 new helpers add net +136 bytes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 changed the title perf(wasm): SIMD-vectorize compute_band_corr inner loop perf(wasm): SIMD-vectorize 4 inference DSP hot loops May 3, 2026
Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx /
feat_cplx_t (called per frame inside df_process_frame).

The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...]
layout so we can compute the per-bin norm (sqrt(re^2 + im^2))
lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use
i32x4_shuffle to build pure-real and pure-imag vectors, compute
norm in 4-wide SIMD, update state, then divide xs by sqrt(state).

  * band_unit_norm (xs: &mut [Complex32]) — re-interleaves the
    per-bin sqrt(state) divisor via two i32x4_shuffles to match
    the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32)
    at a time.

  * band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same
    norm computation but writes to o_re / o_im split halves of
    out (CONTIGUOUS), so no re-interleave step is needed for the
    divide.

Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm
hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow
regime), both produce identical bits — verified by FNV-1a hash of
df_process_frame output stream over N=3000 deterministic random
frames matching baseline exactly across 5 independent runs on
Node v20.11.1 / V8.

Wasm size delta: +678 bytes vs the 4-function bundle commit.
Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 changed the title perf(wasm): SIMD-vectorize 4 inference DSP hot loops perf(wasm): SIMD-vectorize 6 inference DSP hot loops May 3, 2026
Three more loops in frame_synthesis emit scalar code on wasm32
despite +simd128 (unlike the frame_analysis windowing loops which
LLVM auto-vec'd; something about the nested zip().zip() iterator
pattern in frame_synthesis vs the izip!() pattern in frame_analysis
defeats auto-vectorization).

Three changes:

  * out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output)
    — new f32_add_to(a, b, out) helper, three-slice element-wise
    add via 4-wide v128 + f32x4_add.

  * s_first[i] += xs_first[i] (overlap-add for next frame, in-place)
    — new f32_add_inplace(xs, ys) helper, two-slice element-wise
    in-place add.

  * s_second[i] = xs_second[i] (override left-shifted buffer)
    — replaced the explicit loop with copy_from_slice; the compiler
    likely emitted memcpy already, but the stdlib idiom is clearer
    and lets the optimiser pick the best implementation.

Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0
unchanged across N=3000 deterministic frames over 6 independent
bench runs.

Speed: median bundle_synth vs the previous 6-function bundle is
-1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to
-11% additional gain (those runs had background CPU activity that
hit the previous bundle harder). Real direction, modest absolute
gain, no quality cost.

Wasm size delta: -24 bytes vs previous bundle (copy_from_slice
emits less code than the explicit loop). Net total: +1489 bytes
over the no-SIMD baseline for all 8 vectorisations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 changed the title perf(wasm): SIMD-vectorize 6 inference DSP hot loops perf(wasm): SIMD-vectorize 8 inference DSP hot loops May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant