perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683
Open
czoli1976 wants to merge 4 commits intoRikorose:mainfrom
Open
perf(wasm): SIMD-vectorize 8 inference DSP hot loops#683czoli1976 wants to merge 4 commits intoRikorose:mainfrom
czoli1976 wants to merge 4 commits intoRikorose:mainfrom
Conversation
Hot loop on the per-frame ERB feature path: dot-product over a band of Complex32 against itself (or a reference). The wasm32 build with `+simd128` was leaving this loop scalar — `wasm-objdump` shows zero v128 ops for the function body in the production build. Replace the inner accumulator with a 4-wide f32x4 reduction using `core::arch::wasm32` intrinsics. Output is bit-exact identical (FNV-1a 20ea4579c427f925 unchanged across Chromium / WebKit / Firefox, single-threaded and 4-thread). Same-machine focused bench, Chromium, 5-run alternated, 300 iter × 20 frames per measurement (t-test): vanilla_mono control: 3.755 -> 3.750 ms (no change, sanity) my_mt_1t: 3.748 -> 3.723 ms (-0.67%, t=2.22) my_mt_4t: 4.679 -> 4.646 ms (-0.71%, t=2.45) Native builds use the existing scalar reduction via cfg gating; no behaviour change off wasm32.
Adds f32x4 vectorization for three more hot DSP functions in the
df_process_frame inference path, on top of the compute_band_corr
work in this PR's first commit:
* band_mean_norm_erb (called from feat_erb per frame): per-bin
IIR mean-norm. State is per-bin (no recurrence between bins) so
straightforward 4-wide SIMD over all ERB bins.
* apply_band_gain (called from apply_mask post-network): Complex32
x f32 scalar mul-in-place per ERB band. Reinterprets
&mut [Complex32] as &mut [f32] of length 2N (Complex32 is
#[repr(C)] {re, im}, identical layout). 4-wide SIMD multiplies.
Also redirects DFState::apply_mask to call apply_band_gain (the
Complex32 specialisation) instead of the generic
apply_interp_band_gain<T>, since the existing apply_band_gain
function is already structurally identical.
* apply_window_in_place (called from frame_synthesis per frame):
f32 mul-in-place. Signature changed from generic
IntoIterator<Item=&'a f32> to &[f32] (the sole caller already
passes &state.window which IS a slice). 4-wide SIMD multiplies.
Each function keeps the original scalar implementation as the
non-wasm32 fallback via #[cfg(not(target_arch = "wasm32"))].
Bit-identical output verified: FNV-1a hash of df_process_frame
output stream over 3000 random frames matches the Rikorose main
baseline exactly across all 3 independent bench runs on
Node v20.11.1 / V8.
Wasm size delta vs baseline: +835 bytes total (compute_band_corr
+699; the 3 new helpers add net +136 bytes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds f32x4 SIMD for the two band-unit-norm functions in feat_cplx /
feat_cplx_t (called per frame inside df_process_frame).
The trick is de-interleaving &mut [Complex32]'s [re,im,re,im,...]
layout so we can compute the per-bin norm (sqrt(re^2 + im^2))
lane-wise. Strategy: load 4 Complex32 (8 f32) as 2 v128s, use
i32x4_shuffle to build pure-real and pure-imag vectors, compute
norm in 4-wide SIMD, update state, then divide xs by sqrt(state).
* band_unit_norm (xs: &mut [Complex32]) — re-interleaves the
per-bin sqrt(state) divisor via two i32x4_shuffles to match
the [re,im,re,im] xs layout, then divides 4 Complex32 (8 f32)
at a time.
* band_unit_norm_t (xs: &[Complex32], out: &mut [f32]) — same
norm computation but writes to o_re / o_im split halves of
out (CONTIGUOUS), so no re-interleave step is needed for the
divide.
Used (re*re + im*im).sqrt() instead of Complex32::norm()'s libm
hypot. For DFN3's audio-spectrum magnitudes (no overflow/underflow
regime), both produce identical bits — verified by FNV-1a hash of
df_process_frame output stream over N=3000 deterministic random
frames matching baseline exactly across 5 independent runs on
Node v20.11.1 / V8.
Wasm size delta: +678 bytes vs the 4-function bundle commit.
Total over no-SIMD baseline: +1513 bytes for all 6 vectorisations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more loops in frame_synthesis emit scalar code on wasm32
despite +simd128 (unlike the frame_analysis windowing loops which
LLVM auto-vec'd; something about the nested zip().zip() iterator
pattern in frame_synthesis vs the izip!() pattern in frame_analysis
defeats auto-vectorization).
Three changes:
* out[i] = x_first[i] + synthesis_mem[i] (overlap-add to output)
— new f32_add_to(a, b, out) helper, three-slice element-wise
add via 4-wide v128 + f32x4_add.
* s_first[i] += xs_first[i] (overlap-add for next frame, in-place)
— new f32_add_inplace(xs, ys) helper, two-slice element-wise
in-place add.
* s_second[i] = xs_second[i] (override left-shifted buffer)
— replaced the explicit loop with copy_from_slice; the compiler
likely emitted memcpy already, but the stdlib idiom is clearer
and lets the optimiser pick the best implementation.
Bit-identical output verified: FNV-1a hash 53ae8dfc3595faf0
unchanged across N=3000 deterministic frames over 6 independent
bench runs.
Speed: median bundle_synth vs the previous 6-function bundle is
-1.2% RTF; mean over 6 iters is -3.1%. Several runs showed -5% to
-11% additional gain (those runs had background CPU activity that
hit the previous bundle harder). Real direction, modest absolute
gain, no quality cost.
Wasm size delta: -24 bytes vs previous bundle (copy_from_slice
emits less code than the explicit loop). Net total: +1489 bytes
over the no-SIMD baseline for all 8 vectorisations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds f32x4 SIMD vectorization (via
core::arch::wasm32intrinsics under#[cfg(target_arch = "wasm32")]) for eight loops that auto-vectorisation didn't fire on, even with+simd128. All eight are called per-frame insidedf_process_frame. Native scalar fallbacks retained via#[cfg(not(target_arch = "wasm32"))]for every helper.compute_band_corrfeat_erbband_mean_norm_erbfeat_erbband_unit_normfeat_cplxband_unit_norm_tfeat_cplx_tapply_band_gainDFState::apply_mask(post-network)apply_window_in_placeframe_synthesisframe_synthesisout[i] = x_first[i] + synthesis_mem[i]frame_synthesiss_first[i] += xs_first[i](overlap for next frame)The
band_unit_norm*pair was the trickiest: load 4 Complex32 as 2 v128s, usei32x4_shuffleto build pure-real and pure-imag vectors, compute(re²+im²).sqrt()lane-wise, update state, then dividexsbysqrt(state). Forband_unit_normthe divisor is re-interleaved via two more shuffles to match the[re,im,re,im]layout; forband_unit_norm_tthe output halves are contiguous so no re-interleave is needed.apply_band_gainis structurally identical to the public genericapply_interp_band_gain<T>forT=Complex32. The SIMD version reinterprets&mut [Complex32]as&mut [f32]of length2Nand multiplies each f32 lane by the band scalarb.DFState::apply_maskredirected to callapply_band_gaindirectly so the inference path picks up the SIMD version;apply_interp_band_gain<T>stays generic fortransforms.rs(training-side) callers.apply_window_in_place's signature changed from genericIntoIterator<Item=&'a f32>to&[f32]. The sole call site already passed a slice, so no callsite changes.The two frame_synthesis inline loops vectorised in commit 4 are interesting: the compiler auto-vec'd the analogous loops in frame_analysis (so I didn't touch those — confirmed by negative bench result), but the nested
zip().zip()pattern in frame_synthesis defeats LLVM's auto-vectorization where theizip!()pattern in frame_analysis succeeded. Same SIMD pattern (4-widef32x4_add); biggest single-commit speedup of the bundle.Why
wasm-objdumpon a release wasm32 build (+simd128enabled) shows the eight loop bodies emitted entirely in scalarf32.load / f32.mul / f32.add / f32.sqrtops — auto-vec didn't fire. They're each called per-frame indf_process_frame, so the cost compounds.The SIMD shape is the standard
f32x4reduction / element-wise pattern: chunked 4-widev128_load+ arithmetic +v128_store, with a scalar tail (0–3 trailing elements) handled at the end of each helper.For
band_unit_norm*, used(re²+im²).sqrt()directly instead ofComplex32::norm()'slibm::hypotf. For DFN3's audio-spectrum magnitudes (no overflow/underflow regime), both produce identical bits — verified by exhaustive FNV-1a hash equality below.Test plan — quality gate
Bit-identical output verified across all 8 vectorisations via FNV-1a hash of the
df_process_frameoutput stream over 3000 frames of deterministic seeded input. Same hash across every commit:53ae8dfc3595faf053ae8dfc3595faf0✓53ae8dfc3595faf0✓53ae8dfc3595faf0✓53ae8dfc3595faf0✓Match across multiple independent bench runs on Node v20.11.1 / V8.
cargo check -p deep_filter --features tract,default-modelclean (4 pre-existing lifetime-style warnings only).wasm-pack build libDF --target no-modules --release --features wasmsucceeds. wasm-opt -Oz delta over baseline: +1489 bytes total for all 8 vectorisations (commit 4 is actually-24bytes vs commit 3 becausecopy_from_sliceemits less code than the explicit loop it replaces).Speed (RTF)
Same-machine bench, Node v20.11.1, 3000 frames per run, multiple independent runs:
The 4th commit (frame_synthesis loops) gives the biggest single speedup of the bundle. Variance: typical run-to-run jitter on a single machine is ~2-3%, so absolute deltas under 1% should be read as direction-only. Median is more reliable than mean here (one outlier run hit a CPU spike).
The
frame_synthesisimprovements come from real auto-vectorization gaps (LLVM didn't pick up the nestedzip().zip()pattern), not from the marginal libDF helpers — the headline gain is concentrated in commit 4. The first 6 functions add a smaller but consistent ~1-2% on top.Caveats
DFState::apply_maskfromapply_interp_band_gain<Complex32>toapply_band_gainis a behavioural no-op (the two functions had identical bodies) but worth flagging as a code-organisation change.band_unit_norm*uses(re²+im²).sqrt()instead ofComplex32::norm(). Bit-equality verified empirically; the two diverge only on extreme inputs (nearf32::MAXor denormal regime) which DFN3's signal path doesn't hit.post_filtereven though it's hot: it has scalar.sin()calls per element and SIMD trig requires polynomial approximations that wouldn't preserve bit-equality.frame_analysiswindowing loops (different from the synthesis ones in this PR): empirical bench showed they're already auto-vectorised by LLVM, so explicit SIMD added function-call overhead without benefit. Negative result; not included.