This document tracks the port of Chatterbox (Resemble AI, MIT license)
to ggml, from the first exploratory scoping all the way to the optimized
end-to-end CPU/GPU binary, in the order things actually happened. §3.1 –
§3.18 cover the original Turbo port (English, GPT-2 Medium T3, meanflow
CFM); §3.19 / §3.20 add the Multilingual variant (23 languages,
Llama-520M T3 + perceiver, standard CFG-enabled CFM) and the cross-variant
S3Gen weight-quantisation pass.
- Models:
ResembleAI/chatterbox-turbo(~450 M params, English) andResembleAI/chatterbox(~520 M T3 + 23-language tokenizer, the multilingual variant). Both share the S3Gen + HiFT vocoder back half. - Goal: end-to-end
text → waveformin C++/ggml with bit-exact (or float-precision) parity against the official PyTorch reference. - Verification target: every intermediate tensor within 1e-6 relative error of the PyTorch implementation, on CPU.
Everything runs in pure C++/ggml on CPU. The main end-to-end tool is one binary:
| Binary | Role |
|---|---|
tts-cli |
end-to-end: text → speech tokens (T3) → 24 kHz wav (S3Gen + HiFT); voice cloning, streaming, both Turbo and Multilingual variants (autodetected from GGUF metadata). |
chatterbox |
identical second binary kept for backward compatibility with pre-rename scripts; same code as tts-cli. |
mel2wav |
mel spectrogram → wav (HiFT only, demo) |
Plus scripts/synthesize.sh, a thin wrapper around tts-cli.
Numerical parity vs PyTorch on a 2.7 s reference utterance, debug mode (Python-dumped random bits substituted for reproducibility):
| Stage | rel error vs PyTorch |
|---|---|
| BPE tokenizer | 10/10 exact-match test cases |
| T3 speech tokens | bit-exact on 4 deterministic prompts |
| S3Gen encoder (full, incl. upsample and encoder_proj) | 4.5e-07 |
| CFM 2-step meanflow decoder | 8.9e-07 on the final mel |
| HiFT decode body (conv_pre → conv_post) | 5.6e-07 |
| ISTFT → waveform | 1.0e-04 |
| End-to-end C++ wav vs Python wav (RMS) | 1.22e-04 vs 1.22e-04 |
Speed (10 s sentence, seed 42, gen_RTF = (T3_INFER + S3GEN_INFER) / audio_ms):
| Backend | gen_RTF |
Wall | vs ONNX addon |
|---|---|---|---|
| CPU (10-core EPYC, F16) | 0.70 | 8.2 s | 3.6× faster |
| Vulkan (RTX 5090, Q4_0) | 0.06 | 1.8 s | 7.8× |
| Metal (M3 Ultra, Q4_0) | 0.13 | 1.9 s | 7.4× |
| ONNX q4 addon (CPU baseline) | 1.06 | 13.9 s | 1.0× |
GPU support and Metal kernel fixes are described in §3.11 / §3.12; the layout-friendly KV cache + Flash Attention pass that produced the numbers in this table is in §3.13. The Multilingual port (§3.19) and the S3Gen weight-quantisation pass that landed alongside it (§3.20) add a second variant on top of the same back half — see those sections for the MTL-specific parity / speed numbers.
chatterbox.cpp/
ggml/ vendored ggml checkout (see patches/, scripts/setup-ggml.sh)
patches/
ggml-metal-chatterbox-ops.patch Metal op fixes: diag_mask_inf, pad_ext,
faster conv_transpose_1d (applied to ggml/
during setup; see patches/README.md)
ggml-opencl-chatterbox-ops.patch OpenCL/Adreno fixes: missing HiFT/S3Gen
ops + conv_transpose_1d speedup
README.md why each patch exists + how to drop it
include/tts-cpp/ installed public headers (Engine API)
tts-cpp.h library entry; declares tts_cpp_cli_main()
chatterbox/engine.h Engine + EngineOptions (text → wav)
chatterbox/s3gen_pipeline.h low-level S3Gen pipeline entry points
src/
main.cpp T3 turbo runtime + shared helpers (libtts-cpp)
t3_mtl.{h,cpp} T3 multilingual (Llama-520M) runtime + stage builders
chatterbox_t3_internal.h internal T3 declarations shared by main.cpp / engine / CLI
chatterbox_engine.cpp public Engine API impl (links into libtts-cpp)
chatterbox_cli.cpp unified CLI (tts-cli + chatterbox binaries)
cli_main.cpp thin entry: forwards argc/argv to tts_cpp_cli_main()
chatterbox_tts.cpp S3Gen encoder + CFM + HiFT (reusable entry)
gpt2_bpe.{h,cpp} self-contained GPT-2 byte-level BPE tokenizer (turbo)
mtl_tokenizer.{h,cpp} multilingual grapheme tokenizer (HF tokenizers.json + NFKD)
mtl_unicode_tables.inc embedded NFKD + Korean Jamo lookup tables
voice_features.{h,cpp} wav I/O, resample, mel, fbank, LUFS
mel_extract_stft.cpp STFT-based mel extraction shared by C++ pipelines
voice_encoder.{h,cpp} VoiceEncoder 256-d speaker embedding
campplus.{h,cpp} CAMPPlus 192-d speaker embedding (BN-fused inc include)
s3tokenizer.{h,cpp} S3TokenizerV2 (wav → S3 speech tokens)
mel2wav.cpp mel → wav demo binary (HiFT only)
test_s3gen.cpp staged verification harness for turbo S3Gen (A..H5)
test_t3_mtl.cpp end-to-end parity test for the MTL T3 forward pass
test_t3_mtl_stages.cpp staged parity harness for MTL (cond/text/inputs/layers/head)
test_mtl_tokenizer.cpp MTL tokenizer parity vs HF reference
test_metal_ops.cpp parity test for the patched Metal kernels
test_streaming.cpp / test_voice_*.cpp / test_resample.cpp / test_fbank.cpp / …
npy.h, dr_wav.h minimal .npy loader + WAV decoder (header-only)
scripts/
setup-ggml.sh clones the pinned ggml commit + applies patches
convert-t3-turbo-to-gguf.py Turbo T3 weights + tokenizer + VE + builtin voice → GGUF
convert-t3-mtl-to-gguf.py MTL T3 (Llama-520M) + perceiver + emotion-adv
+ tokenizers.json + builtin voice → GGUF
convert-s3gen-to-gguf.py S3Gen encoder + CFM + HiFT + CAMPPlus + S3TokenizerV2
+ mel filterbanks → GGUF (--variant {turbo,mtl},
--quant {f32,f16,q8_0,q5_0,q4_0})
requantize-gguf.py in-place block-quantise of an existing S3Gen/T3 GGUF
gen-nfkd-table.py generates src/mtl_unicode_tables.inc from CLDR data
extract-voice.py one-shot voice-clone prep (silencedetect + EQ + bake)
dump-{s3gen,campplus,s3tokenizer,streaming,t3-mtl}-reference.py
PyTorch → .npy intermediates for the test-* harnesses
reference-t3-turbo.py PyTorch T3 + compare against C++
compare-tokenizer.py 10-case tokenizer comparison against HF
synthesize.sh text → wav wrapper around tts-cli
models/
chatterbox-t3-turbo.gguf Turbo T3 (GPT-2 Medium) + GPT-2 BPE + builtin voice
chatterbox-s3gen.gguf Turbo S3Gen (meanflow CFM) + HiFT + CAMPPlus + S3TokV2
chatterbox-t3-mtl.gguf Multilingual T3 (Llama-520M) + tokenizers.json + builtin voice
chatterbox-s3gen-mtl.gguf Multilingual S3Gen (standard 10-step CFM, CFG inside)
*-{q8_0,q5_0,q4_0}.gguf quantised variants (see §3.20)
CMakeLists.txt top-level: add_subdirectory(ggml) + tts-cpp lib + binaries
PROGRESS.md this file
README.md user-facing build / run / benchmark guide
A separate machine holds PyTorch + the original Chatterbox repo for reference
runs. On-device (Apple Silicon / Linux x86) the C++ binaries have no runtime
dependency on Python — the Turbo BPE tokenizer (vocab.json + merges.txt)
and the Multilingual tokenizers.json are both embedded directly into their
T3 GGUFs as tokenizer.ggml.* metadata, so the only runtime input is the
GGUF file itself plus optional reference audio.
Surveyed open-source TTS candidates (F5-TTS, Kokoro-82M, XTTS v2, Piper, Fish Speech, Supertonic, Chatterbox). Picked Chatterbox Turbo for three reasons: MIT license, zero-shot voice cloning, and the "Turbo" variant uses just 2 flow-matching steps (fast inference).
Bootstrapped the repo by cloning the latest ggml and the reference
resemble-ai/chatterbox side-by-side, then built a standalone
chatterbox.cpp/ with ggml/ as a vendored subdirectory (no modifications
inside ggml/).
Issues hit in this phase:
| # | Issue | Fix |
|---|---|---|
| 1 | rsync not on macOS by default |
Switched to tar … | ssh … tar -x. |
| 2 | Remote repo polluted with ._* AppleDouble files |
COPYFILE_DISABLE=1 tar …. |
| 3 | Partial sync left src/CMakeLists.txt stray file |
Removed; unified sync always pushes the whole tree. |
| 4 | Remote binary 0 bytes after SSH disconnect |
rm build/<target> + rebuild. |
T3 is a GPT-2 Medium-sized (24 layer) autoregressive model that maps text tokens + voice conditioning to speech tokens.
- Wrote
scripts/convert-t3-turbo-to-gguf.pyto emit a GGUF with built-in voice conditionals (speaker_emb,cond_prompt_speech_tokens) embedded. - C++ graph in
src/main.cpp: split into a "prompt" graph and a "step" graph sharing a persistent KV cache, mirroringggml/examples/gpt-2. - Ported the sampler (Temperature → TopK → TopP → RepetitionPenalty).
- Wrote a self-contained GPT-2 byte-level BPE in
src/gpt2_bpe.cpp(llama.cpp's BPE was too entangled with its GGUF vocab loading to reuse cleanly): byte-level encoding table, regex pre-tokenization, BPE merge loop, pluspunc_normmatching the Python implementation. 10/10 test cases match the HF tokenizer byte-for-byte, including the 19 paralinguistic added tokens ([laugh],[chuckle], …). tts-clitakes--text+--tokenizer-dirand produces speech tokens end-to-end.
Verified against PyTorch: bit-for-bit identical speech tokens on 4 deterministic sampling configs (greedy / temperature / top-k / repetition-penalty / no-penalty × short + long prompts).
Issues hit in this phase:
| # | Issue | Fix |
|---|---|---|
| 5 | ggml_can_mul_mat assertion in T3 |
Converter must transpose Conv1D-style weights (c_attn, c_proj, c_fc, mlp.c_proj) to ggml's [in, out] layout while leaving nn.Linear / embeddings / wpe as-is. |
| 6 | ggml_backend_tensor_get(input_tensor) returned garbage |
ggml_gallocr reuses the input buffer for intermediates when only set_input is marked; also call ggml_set_output on tensors we want to read back. |
| 7 | Repetition-penalty path diverged from HF at token 22 | HF divides positive logits, multiplies negative ones — I had it backwards. |
| 8 | Sampler order mismatched HF LogitsProcessorList |
Rewrote sample_next_token as Temperature → TopK → TopP → RepetitionPenalty, in HF's exact order. After the fix greedy+penalty tests pass bit-exactly. |
S3Gen is a "Upsample Conformer" with 10 blocks total (~60 M params): 6 initial
blocks, then a 2× Upsample1D, then 4 more blocks. Ported in six staged
substeps against Python-dumped reference tensors (scripts/dump-s3gen-reference.py):
| Stage | Component | rel error |
|---|---|---|
| A | speaker_emb projection (F.normalize + Linear) |
1.2e-7 |
| B | input_embedding lookup |
0 (exact) |
| C | encoder_embed (Linear + LN + √D scale + ESPnet rel PE) |
4.4e-7 |
| D | PreLookaheadLayer (asymmetric-padded Conv1d stack) |
2.5e-7 |
| E | One Conformer block (rel-pos MHA + rel_shift + Swish FFN) |
1.3e-7 |
| F | Full encoder + encoder_proj |
5.6e-7 |
Issues hit in this phase:
| # | Issue | Fix |
|---|---|---|
| 9 | ggml_conv_1d aborted with src0->type == GGML_TYPE_F16 |
ggml's im2col path requires F16 kernels, but we wanted F32 precision. Wrote a conv1d_f32 helper that calls ggml_im2col(…, GGML_TYPE_F32) + mul_mat directly, keeping kernels in F32. |
| 10 | speaker_embed broadcast failed in cond_spkr matmul |
Bias reshape needed ne=[1, 256], not ne=[256]. Added the explicit reshape_2d(bias, 1, C) convention for every 1-D bias added to a [T, C] conv output. |
| 11 | Nearest-neighbor ×2 upsample produced channel-interleaved garbage | The naive reshape_3d(T, 1, D) + concat(ne[1]) gives t0_copy0, t1_copy0, …, t0_copy1, …. Correct trick: reshape_3d(1, T, D) → concat along ne[0] → [2, T, D] → reshape to [2T, D], giving t0_copy0, t0_copy1, t1_copy0, …. |
| 12 | rel_shift attention gave ~100 % rel error |
view_3d(bd_viewed, T, 2T-1, H, nb1, T*(2T-1)*elem, offset) used the sliced stride for nb2. nb2 must match the source's element stride: bd_viewed->nb[2]. |
| 13 | *.transpose().numpy() reference dumps loaded as garbage in C++ |
Torch .transpose() yields Fortran-ordered storage; np.save writes fortran_order: True. Dumper now calls .contiguous().numpy() + np.ascontiguousarray(...). The C++ loader throws a clear error if it sees fortran_order=True. |
A U-Net with transformer blocks (~45 M params). Layout: 1 down block → 12 mid
blocks → 1 up block (skip concat) → final_block → final_proj. Each block
carries 4 BasicTransformerBlocks.
| Stage | Component | rel error |
|---|---|---|
| G1 | Time embedding (sin → MLP → mixer) | 7.0e-7 |
| G2 | CausalResnetBlock1D (causal-conv + LN + Mish + time MLP + res_conv) |
2.9e-7 |
| G3 | BasicTransformerBlock (self-attn + FFN w/ GELU-erf) |
1.7e-7 |
| G4 | Full CFM decoder, one forward step | 1.3e-6 |
For meanflow mode we do 2 steps with t_span = [0, 0.5, 1]; the time embedding
sees both t and r concatenated through a mixer.
Issues hit in this phase:
| # | Issue | Fix |
|---|---|---|
| 14 | LayerNorm applied over time instead of channel |
For ne=[T, C] layout ggml_norm reduces ne[0]=T, which is wrong. Wrote layer_norm_on_channel that permutes to [C, T], norms, applies affine, permutes back. |
| 15 | weight_norm convolutions in mel2wav ignored |
Torch 2.6 stores them under parametrizations.weight.original{0,1}. Added expand_weight_norm() in the converter that fuses g · v / ‖v‖₂ back into a regular weight tensor before export. |
| 16 | Mish activation missing from ggml unary ops | Built from primitives: x · tanh(softplus(x)) via GGML_UNARY_OP_SOFTPLUS + GGML_UNARY_OP_TANH. |
| 17 | GELU mismatch in BasicTransformerBlock (rel=3e-4) |
ggml_gelu is the tanh approximation; diffusers.models.activations.GELU uses the exact erf formulation. Switched to ggml_gelu_erf. Error dropped to 1.7e-7. |
| 18 | Python hook overwrote the same tensor across multiple CFM steps | Meanflow calls time_embeddings twice (for t and r) and the decoder runs twice per sample. Added make_hook(multi_call=True) that saves *_call0.npy, *_call1.npy, …. |
| 19 | Estimator forward_hook never fired |
basic_euler calls self.estimator.forward(x, …) directly, bypassing __call__ where hooks live. Monkey-patched estimator.forward to record x_in / mu / t / r / spks / cond / mask / dxdt for every step. |
| 20 | (B, C, T) vs (B, T, C) layout confusion |
CFM alternates: resnets use (B, C, T), transformer blocks use (B, T, C), switched by rearrange. In ggml we mirror this and cont(permute) at the boundary. Every helper doc-comments its layout. |
HiFTGenerator = Neural Source Filter + ISTFTNet. The mel → waveform vocoder. Ported in five verifiable substeps:
| Stage | Component | rel error |
|---|---|---|
| H1 | f0_predictor (5× Conv + ELU + Linear) |
4.2e-6 |
| H3 | decode body conv_pre → ups / rb → conv_post |
5.6e-7 |
| H4 | STFT (Conv1d with DFT + Hann kernel) | 7.9e-3 (boundary-bound) |
| H5 | ISTFT (ConvTranspose + window-sum normalize) | 1.0e-4 |
Key techniques:
- Snake activation
x + (1/α)·sin²(αx)implemented withggml_sinand a pre-computed1/αtensor fed as a graph input (72 such inputs across the 9 main ResBlocks and 3 source ResBlocks). - ConvTranspose1d with asymmetric PyTorch padding: ggml's op only accepts
p0=0, so we compute the full-length output then slicepsamples from each side. - Asymmetric reflection pad
(1, 0): done manually by extractingx[1:2]and concat-prepending it. - STFT as
Conv1dwith a DFT+window kernel of shape[n_fft, 1, 2F](real and imaginary parts stacked as output channels). Center-mode reflection padn_fft//2applied manually via slice-and-concat on each side. - ISTFT as
ConvTranspose1dwith the inverse DFT+window kernel, followed by element-wise divide by a precomputedwindow²overlap-sum buffer, then trimn_fft//2from each end.
The resulting mel2wav binary demonstrates the full vocoder:
mel2wav --s3gen-gguf models/chatterbox-s3gen.gguf \
--mel-npy artifacts/s3gen-ref/mel_output.npy \
--out /tmp/out.wav
Against the Python reference waveform: matching RMS (1.22e-04 vs 1.22e-04), time-domain diff max 3.3e-05 (signal max ~9e-04), spectrogram magnitude diff max rel 2.5 % (entirely from stochastic SineGen excitation; the deterministic conv-net chain is bit-exact).
SineGen on the C++ side uses std::mt19937 (not bit-exact to torch.rand,
but audibly indistinguishable — the excitation is a small-amplitude additive
noise term).
Final plumbing: write src/chatterbox_tts.cpp that wires the S3Gen encoder →
2-step meanflow CFM → HiFT vocoder and emits a 24 kHz wav. Takes T3-generated
speech tokens plus a reference voice (embedding, prompt_token,
prompt_feat).
Historically synthesize.sh piped two binaries; today one tts-cli runs the
full pipeline, and synthesize.sh is a thin wrapper around it.
Debug mode (--debug) substitutes Python-dumped reference random bits (CFM
z and noised_mels) so the deterministic parts can be validated
bit-exactly. End-to-end in debug mode:
| Stage | max_abs | rel |
|---|---|---|
input_embedding(tokens) |
0 | 0 |
encoder → encoder_proj (mu) |
8.3e-07 | 4.5e-07 |
| speaker embedding (spks) | 5.9e-08 | small |
cond (prompt_feat placement) |
0 | 0 |
t_emb (sinusoidal → MLP → mixer) |
7.6e-06 | small |
CFM step 0 dxdt |
2.1e-05 | small |
CFM step 1 dxdt |
1.8e-05 | small |
| final mel (80 × 136) | 1.0e-05 | 8.9e-07 |
Production mode uses a seeded std::mt19937 for both the CFM initial noise
and SineGen excitation.
Issues hit in this phase (all three caused plausible-looking but wrong output before being found):
| # | Issue | Fix |
|---|---|---|
| 21 | Silence-token padding value | speech_tokens must be appended with S3GEN_SIL = 4299 (not 0) to match Python's speech_tokens_padded convention. |
| 22 | Relative PE pos_pe / neg_pe swap |
While copying compute_pos_emb into the new binary I flipped the two halves of the PE buffer, which silently gave ~20 % relative error in the encoder output. Restored the correct ordering: first half is reversed pos_pe, second half is neg_pe. |
| 23 | mu layout transpose between encoder and CFM |
encoder_proj.npy is numpy (T, 80) but the CFM estimator expects numpy (80, T). Added an explicit transpose to bridge the two. |
At this point on a 10-core EPYC, single-threaded, the end-to-end pipeline ran in 22.5 s for 8.64 s of audio — RTF 2.60, i.e. 2.6× slower than real-time.
Eight optimizations in the order they were attempted. Four landed, four were rolled back or skipped as incompatible. Numbers are for the 8.64 s utterance above.
Attempt 1 — multi-threading (KEPT, −85 % wall time)
Baseline was pinned to 1 thread because the code never called
ggml_backend_cpu_set_n_threads. Added a global g_n_threads (default =
std::thread::hardware_concurrency(), overridable with --threads N) and a
compute() helper that sets it before every ggml_backend_graph_compute.
ggml's -march=native was already on, so AVX-512 / AVX-VNNI kernels were
already in use — the missing piece was parallelism. Swept thread counts: 10
was the sweet spot; 16 oversubscribes and regresses.
Result: 22.5 s → 3.47 s (RTF 2.60 → 0.40).
Attempt 2 — OpenBLAS (TRIED, NO HELP)
Installed libopenblas-dev, rebuilt with GGML_BLAS=ON GGML_BLAS_VENDOR=OpenBLAS. No measurable change. Our matmuls are medium-sized
and ggml's hand-written AVX-512 kernels already saturate what OpenBLAS would
deliver. Kept off.
Attempt 3 — GGML_LTO=ON (TRIED, NO HELP)
No measurable effect on a shared-library build. Kept off.
Attempt 4 — CFM graph reuse (KEPT, −11 % wall time)
The CFM estimator is called twice per utterance with identical graph
topology. Stashed the ggml_context, ggml_cgraph, and ggml_gallocr in a
cfm_estimator_cache so step 2 only re-runs with new inputs — saves one graph
construction and one gallocr_reserve pass per utterance.
Result: 3.47 s → 3.09 s (RTF 0.40 → 0.36).
Attempt 5 — Flash attention in CFM BasicTransformerBlock (KEPT, −22 % wall time)
The CFM has 56 BasicTransformerBlocks × 2 meanflow steps = 112 attention
ops per utterance. Replaced the explicit
softmax(QKᵀ / √d) · V kernel with a single ggml_flash_attn_ext call.
The pattern is pure self-attention (no masking, no bias), which is exactly
what flash_attn_ext is designed for. Fused, no materialized T×T
scores/attn tensors. The reshape-permute-cont preamble now drops straight into
flash_attn_ext, and its output ne=[HD, H, T, 1] reshapes directly to
[INNER, T].
Result: 3.09 s → 2.45 s (RTF 0.36 → 0.28), CFM −44 %.
Attempt 6 — Fold symmetric conv padding (KEPT, small win)
Six redundant ggml_pad_ext → conv1d_f32 pairs dropped by passing the padding
straight to ggml_im2col. Biggest impact in HiFT's ResBlocks where the
resblock-conv path runs ~72 times per decode. Saves one intermediate tensor
allocation per conv. A small but essentially-free improvement.
Result: 2.45 s → 2.39 s (RTF 0.28 steady).
Attempt 7 — F16 CFM linear weights (TRIED, ROLLED BACK)
Converted all Q/K/V/O/FFN/MLP linear weights in CFM from F32 to F16 to halve
memory bandwidth. Regressed: CFM got ~10 % slower and precision dropped
to rel = 3e-4 on the final mel. The F16→F32 upconvert inside mul_mat is
not free and the F32 AVX-512 kernel is already very fast; for CPU this is a
net loss. Reverted.
Attempt 8 — Flash attention in the Conformer encoder (SKIPPED, INCOMPATIBLE)
Would fuse another 10 attention ops per utterance, but the Conformer uses
ESPnet-style relative positional bias added inside the softmax, and
ggml_flash_attn_ext does not support custom in-softmax bias terms. Would
need a custom ggml op — not done.
| Configuration | Total | RTF | vs real-time |
|---|---|---|---|
| Baseline (1 thread, no graph reuse, no flash attn) | 22.5 s | 2.60 | 2.6× slower |
| + threading (Attempt 1) | 3.47 s | 0.40 | 2.5× faster |
| + CFM graph reuse (Attempt 4) | 3.09 s | 0.36 | 2.8× faster |
| + flash attn + pad fold (Attempts 5, 6) | 2.39 s | 0.28 | 3.6× faster |
Total wall-time speedup from the original port: 9.4×.
Stage breakdown at the final configuration:
| Stage | time |
|---|---|
| S3Gen encoder | 286 ms |
| CFM 2 meanflow steps | 785 ms |
| HiFT vocoder | 1312 ms |
| Total | 2.39 s |
HiFT is now the bottleneck (~55 % of wall time) — the 3-stage upsample /
ResBlock stack on T = 16320 × 64 channels is memory-bandwidth bound rather
than compute bound.
After merging the two binaries and shipping voice-cloning phase 1, a user report of an "empty" wav on paragraph-length input surfaced a sampling bug that had been lurking since the T3 port.
Symptom: the produced wav had ~1 second of speech followed by ~9 seconds of pure zero RMS. Per-0.5 s window RMS:
[3.5e-2, 1.3e-2, 2.8e-7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.4e-7]
Dumping the T3 token stream showed the root cause immediately — 240 of 257
tokens were the silence token 4218:
tokens[0:17]: 3704, 6486, 4299, 3891, 5832, 4384, 5014, 5665, 2486, 29,
29, 380, 632, 2912, 5101, 5070, 4215
tokens[17:257]: 4218, 4218, 4218, 4218, ... (240 copies)
The C++ sampler had shipped with top_k = 1 (argmax) as its default. For
Chatterbox T3 that's a known failure mode: once the model generates a
silence token at a natural pause, argmax(logits) keeps picking silence
forever and the utterance never recovers. Short test prompts never reached
a pause so the bug was invisible during the port.
Compared ChatterboxTurboTTS.generate() in tts_turbo.py — the Python
defaults are very different:
| before (C++ broken) | Python | after (C++ fixed) | |
|---|---|---|---|
top_k |
1 (greedy) | 1000 | 1000 |
top_p |
1.0 | 0.95 | 0.95 |
temperature |
1.0 | 0.8 | 0.8 |
repeat_penalty |
1.0 | 1.2 | 1.2 |
n_predict |
256 | ~1000 | 1000 |
All four knobs are still exposed on the CLI, so --top-k 1 reproduces the
old greedy behaviour for debugging/comparison.
After the fix, same prompt same seed:
- total wav RMS:
8.3e-03→4.8e-02 - max amplitude:
0.18→0.50 - per-0.5 s RMS windows: all 21 non-zero (3.3e-2 … 8.5e-2 range)
- audible speech for the full 10.7 s
Committed as bb0eb99.
This one was avoidable — the verification pipeline in §5 is per-tensor
numerical parity, which is oblivious to sampler choices; the reference- t3-turbo.py harness only compared greedy token sequences so it never
exercised any non-trivial pass of the sampling ladder. Worth adding an
end-to-end sampling test to the validation list: run T3 with Python's
stochastic defaults (fixed seed) and compare the full token stream
byte-for-byte against C++ with the same seed.
Compared end-to-end throughput against an in-house ONNX Runtime TTS
addon (pre-built q4 Chatterbox models at 692 MB on disk). Same 10-core
EPYC host, same
prompt ("Hello from native C plus plus. This audio was generated end
to end on CPU using ggml."), built-in voice on both sides, --threads 10 for ggml, ORT's own default threading for ONNX. Instrumented the
ggml binary with explicit T3_LOAD_MS / T3_INFER_MS /
S3GEN_LOAD_MS / S3GEN_INFER_MS markers so load and generate
phases can be split cleanly. Each configuration run three times after
a disk-cache warm-up.
Model footprint on disk:
| Size | |
|---|---|
| ONNX q4 (5 files) | 692 MB |
| ggml F16 (T3 + S3Gen) | 1285 MB |
| ggml Q8_0 (T3 + S3Gen) | 1004 MB |
| ggml Q5_0 (T3 + S3Gen) | 893 MB |
| ggml Q4_0 (T3 + S3Gen) | 857 MB |
Per-stage wall-clock (median of 3 runs, milliseconds):
| Pipeline | T3 load | T3 gen | S3Gen load | S3Gen gen | Audio | Total | RTF (total) |
|---|---|---|---|---|---|---|---|
| ggml Q4_0 | 213 | 1790 | 366 | 1998 | 6480 | 4455 | 0.69 |
| ggml Q5_0 | 231 | 1966 | 353 | 2002 | 6640 | 4641 | 0.70 |
| ggml Q8_0 | 305 | 2047 | 370 | 2001 | 6560 | 4823 | 0.73 |
| ggml F16 | 468 | 2691 | 364 | 1928 | 6560 | 5562 | 0.85 |
| ONNX q4 | ~4250 (4 files, serialized) | — | — | ~6830 | 5880 | 11050 | 1.88 |
(ONNX Runtime's backend doesn't expose a comparable per-sub-model
breakdown, so its load is the wall-clock time from model.load()
calling through ORT init across all four .onnx files, and gen is
the time the single model.run() call takes.)
Aggregated: load vs. generate, load+gen together:
| Pipeline | Load | Generate | Total wall | RTF (total) |
|---|---|---|---|---|
| ggml Q4_0 | 579 ms | 3788 ms | 4455 ms | 0.69 |
| ggml Q5_0 | 584 ms | 3968 ms | 4641 ms | 0.70 |
| ggml Q8_0 | 675 ms | 4048 ms | 4823 ms | 0.73 |
| ggml F16 | 832 ms | 4619 ms | 5562 ms | 0.85 |
| ONNX q4 | 4250 ms | 6830 ms | 11050 ms | 1.88 |
Headline numbers (best ggml variant vs ONNX):
- Load: ggml Q4_0 is 7.3× faster — 579 ms vs 4250 ms. The four ONNX files initialise serially and each one does its own tensor plumbing; ggml mmaps the two GGUFs and rebinds through the unified backend buffer in ~half a second total.
- Generate: ggml Q4_0 is 1.8× faster — 3788 ms vs 6830 ms.
- Total (load + generate): ggml Q4_0 is 2.48× faster — 4.46 s vs 11.05 s.
- Even ggml F16 beats ONNX q4 on total wall (5.56 s vs 11.05 s, 1.99× faster) despite carrying 2× the weights — the ONNX backend loses to an un-quantized ggml build on the same CPU.
- RTF < 1 (faster than real-time) happens on every ggml variant tested; ONNX stays at 1.88× real-time for this prompt.
Numbers are for a ~6 s utterance; the ggml pipeline's ~2 s of fixed S3Gen+HiFT cost amortizes better on longer input, so the gap widens in ggml's favour as prompt length grows.
CPU performance was already past real-time, but a lot of the T3 and CFM work is embarrassingly parallel, so enabling the GGML GPU backends was the obvious next step. Touched three files:
CMakeLists.txt— added aGGML_VULKANpropagation block mirroring the existingGGML_CUDA/GGML_METALones.src/main.cpp— extendedinit_backend(n_gpu_layers)with aggml_backend_vk_init(0)path guarded by#ifdef GGML_USE_VULKAN. CUDA / Metal paths were already there.src/chatterbox_tts.cpp— added a symmetrics3gen_init_backendso the S3Gen side honours the same--n-gpu-layersflag, plus a newn_gpu_layersfield ons3gen_synthesize_opts.
Two op-level changes in our code were required because Metal's dispatcher didn't have those ops (the actual Metal kernel fixes land in §3.12):
- T3 attention:
ggml_soft_max(ggml_diag_mask_inf(ggml_scale(KQ, s), n_past))→ggml_soft_max_ext(KQ, mask, s, 0.0f)with an explicit[n_kv, N]causal mask tensor uploaded fromeval_prompt. The step path (N=1) passes a null mask. No-op for CPU / Vulkan; necessary for Metal. - S3Gen zero padding: 6 call sites used
ggml_pad_extwith non-zero front padding. Added azero_pad_dim0(ctx, x, p_front, p_back)helper that expresses the same semantics viaconcat(scale(view, 0.0f), x)so it runs on every backend with well-defined zeros.
First result on the Linux remote (RTX 5090 + Vulkan), same 10 s sentence as §3.10:
| Variant | T3 load | T3 gen | S3Gen load | S3Gen gen | Audio | gen_RTF |
Wall |
|---|---|---|---|---|---|---|---|
| Vulkan F16 | 562 ms | 600 ms | 490 ms | 279 ms | 10.5 s | 0.08 | 2.10 s |
| Vulkan Q8_0 | 450 ms | 557 ms | 472 ms | 272 ms | 10.6 s | 0.08 | 1.91 s |
| Vulkan Q5_0 | 348 ms | 562 ms | 470 ms | 276 ms | 10.9 s | 0.08 | 1.82 s |
| Vulkan Q4_0 | 331 ms | 522 ms | 493 ms | 275 ms | 10.3 s | 0.08 | 1.78 s |
Quantization makes T3 load noticeably smaller but barely moves
inference — T3 is autoregressive (one token at a time on a 5090 has
plenty of spare lanes) and S3Gen is already short. End-to-end goes
from 8.17 s (CPU F16) → 1.78 s (Vulkan Q4), for the same 10 s of
audio. gen_RTF = 0.08 = 13× real-time.
On the M3 Ultra Metal side, things didn't fly immediately: T3 aborted
on the first attention layer with unsupported op 'DIAG_MASK_INF',
then S3Gen aborted with unsupported op 'PAD'. Once those two
op-level workarounds above were in place, HiFT decode was completing
but taking ~15 s for 1.2 s of audio — Metal's
conv_transpose_1d kernel is pathological for HiFT-sized inputs.
Pragmatic interim fix: when the main backend is Metal, load a second
CPU copy of the S3Gen GGUF and route run_f0_predictor,
run_stft, and run_hift_decode through it. Encoder + CFM still run
on Metal. Costs ~1 GB extra RAM but brings Metal gen_RTF to ~0.25.
That's what committed as 795963a ("backend: enable Vulkan + Metal
for T3 and S3Gen").
To get rid of the CPU fallback for HiFT and close the gap with
Vulkan, patched ggml/src/ggml-metal/ itself. The patch is shipped
as patches/ggml-metal-chatterbox-ops.patch (based on upstream
58c3805, sync : llama.cpp); the main README instructs a fresh
clone to git apply it after cloning ggml.
A new test-metal-ops binary runs each patched kernel against the
CPU reference at HiFT-realistic shapes. All cases pass with
max_abs ≤ 1.5e-6.
Patch 1 — DIAG_MASK_INF on Metal (was: op simply absent from
the dispatcher):
- New
kernel_diag_mask_inf_f32— ports the CUDA formulation (dst[i] = src[i] - (col > n_past + row % rows_per_channel) * FLT_MAX) so downstream softmax yields proper zeros. - New
ggml_metal_kargs_diag_mask_inf, library pipeline getter, op encoder, dispatcher case, andsupports_opentry.
Patch 2 — PAD with front padding (was: kernel ignored
op_params[0,2,4,6] which is where ggml_pad_ext stores the front
amounts; supports_op hard-rejected any non-zero front pad):
- Extended
ggml_metal_kargs_padwithlp0..lp3. - Rewrote
kernel_pad_f32to translate each output coord byi0x = i0 - lp0etc., and write0.0outside[0, ne00). - Relaxed
supports_optosrc0->type == F32 && dst->type == F32.
Patch 3 — CONV_TRANSPOSE_1D speedup (was: ~100× slower than
CPU on HiFT-sized inputs):
The old kernel was scalar — one thread per output pixel, iterating
over the full IC × IL inputs inside a branch if (ol >= i*s0 && ol < i*s0 + K). Two orthogonal fixes:
- Tighten the input-position loop to only the
is that actually contribute. For fixedol, validiis[max(0, ⌈(ol - K + 1)/s0⌉), min(IL-1, ol/s0)]— at mostK/s0 + 1iterations. On ups[0] (s0=8, K=16, IL≈130) this collapses the inner loop from 130 iterations → 3. - Parallelise
ICacross a 32-thread simdgroup and reduce withsimd_sum. Host-side dispatch widens from 1 thread per threadgroup → 32 (one simdgroup).
Measured on M3 Ultra, HiFT decode (part of a 10 s sentence):
hift_decode: 15021 ms → 350 ms (≈ 40× speedup)
gen_RTF : 0.25 → 0.18 (CPU-fallback removed)
wall : 3.36 s → 2.51 s
With the patch applied and the CPU-fallback for HiFT removed, end-to-end on the M3 Ultra for the same 10 s sentence, seed 42, averaged over 3 runs:
| Variant | T3 load | T3 gen | S3Gen load | S3Gen gen | gen_RTF |
Wall |
|---|---|---|---|---|---|---|
| Metal F16 | 280 ms | 1326 ms | 295 ms | 577 ms | 0.19 | 2.51 s |
| Metal Q8_0 | 216 ms | 1330 ms | 302 ms | 598 ms | 0.18 | 2.48 s |
| Metal Q5_0 | 186 ms | 1393 ms | 293 ms | 611 ms | 0.19 | 2.51 s |
| Metal Q4_0 | 175 ms | 1274 ms | 295 ms | 594 ms | 0.18 | 2.36 s |
Autoregressive T3 now dominates wall time (T3_INFER ≈ 1.3 s of
~260 tokens at one-token-at-a-time on a 60-core Apple GPU) — that's
the next thing to chip away at. On the 5090 the same token stream
runs in ~0.55 s because the shader count is ~360× higher.
Committed as 894c4b1 ("metal: patch ggml to fix diag_mask_inf,
pad_ext, conv_transpose_1d"). im not a fan of forking ggml just
for this, so the patch is tiny and easy to drop once upstream picks
up equivalent fixes; see patches/README.md for what to do in that
case.
After §3.11 / §3.12 the dominant wall-clock cost in Chatterbox became
T3's autoregressive step (≈ 1.3 s of a ~2.4 s run on Metal M3 Ultra
Q4_0). An earlier attempt to swap the explicit
soft_max_ext(mul_mat(K,Q), mask) + mul_mat(V_trans) chain for
ggml_flash_attn_ext ran into a deal-breaker: the KV cache was laid
out [HD, n_head, n_ctx] per layer but flash_attn_ext wants
[HD, n_ctx, n_head]. Every step had to ggml_cont(ggml_permute(K))
over a tensor that grew with n_past, and the extra kernel dispatches
wiped out FA's savings.
Fix: store the cache the way FA reads it.
- Same total size per layer (
HD * n_ctx * n_head==n_embd * n_ctx), so no allocation changes. - Write path (step or prompt): Kcur / Vcur are viewed as
[HD, n_head, N], permuted to[HD, N, n_head], then oneggml_cpyper tensor into a strided cache view at[HD, n_past:n_past+N, n_head]. For the step path N=1 the permute is a no-op in memory. - Read path:
ggml_view_3d(memory_k, HD, L, n_head, nb=[4, HD*4, HD*n_ctx*4], offset=il*layer_size)is exactly the shape FA needs, with nopermute + cont. - Mask: switched from F32 to F16 (ggml FA requires F16 on Metal;
other backends accept it too). N=1 path passes
nullptrsince every KV position is in the past.
Measured on M3 Ultra, same 10 s sentence, seed 42, --threads 20,
--n-gpu-layers 99, averaged over 3 warm runs:
| Variant | T3 infer before | T3 infer after | Δ | Wall before | Wall after | gen_RTF |
|---|---|---|---|---|---|---|
| F16 | 1372 ms | 983 ms | −28 % | 2.51 s | 2.15 s | 0.189 → 0.157 |
| Q8_0 | 1371 ms | 985 ms | −28 % | 2.48 s | 2.12 s | 0.182 → 0.149 |
| Q5_0 | 1445 ms | 1063 ms | −26 % | 2.51 s | 2.18 s | 0.186 → 0.152 |
| Q4_0 | 1274 ms | 965 ms | −24 % | 2.36 s | 2.06 s | 0.176 → 0.144 |
And the same change on Vulkan 5090 (Linux remote):
| Variant | T3 infer before | T3 infer after | Δ |
|---|---|---|---|
| F16 | 600 ms | 410 ms | −32 % |
| Q4_0 | 522 ms | 356 ms | −32 % |
So the new layout is not just a Metal-shaped win — it speeds up every
GPU backend, because the previous permute + cont per layer per step
was cheap on NVIDIA too but not free. CPU builds see a similar graph
shape (fewer intermediate nodes) and stay neutral.
Output sampling is not bit-exact against the old path: FA runs its own internal reductions in different order and the mask lives in F16 instead of F32, so token counts can shift by ±2 % (e.g. F16 went from 248 → 244 tokens on the bench prompt). Audio remains perceptually identical; this is the same kind of drift that moving to FA causes anywhere else in ggml.
Committed as part of the Metal optimization sequence alongside the
earlier patches/ggml-metal-chatterbox-ops.patch.
After §3.13, each T3 attention layer still did two ggml_conts on Q
per step: one cont_3d to densify the strided view of Qcur, and an
outer cont after the head-permute. Both turn into
kernel_cpy_f32_f32 dispatches on Metal.
Observation: the entire QKV output cur is already contiguous. Q,
K, and V are just fixed byte offsets into the same tensor (0,
n_embd * 4, 2 * n_embd * 4 respectively). With Metal's
flash_attn_ext accepting non-contiguous Q via explicit strides (the
same flexibility I used for K/V in §3.13), I can drop both conts and
express Q directly as a ggml_view_3d with layout [HD, N, n_head]:
nb0 = 4, nb1 = 3 * n_embd * sizeof(float), nb2 = HD * sizeof(float)
Same trick for the Kcur/Vcur sources that go into the KV-cache write path — one view each, no permute + cont pair.
Removes 24 kernel dispatches per step (cont × 24 layers); since T3
step time on Metal is almost entirely dispatch-bound at ~9 µs each,
this shows up straight in the numbers.
Measured on M3 Ultra (same 10 s sentence, seed 42, 3-run warm average):
| Variant | T3 infer §3.13 | T3 infer §3.14 | Δ | Wall §3.13 | Wall §3.14 |
|---|---|---|---|---|---|
| F16 | 983 ms | 909 ms | −7.5% | 2.15 s | 2.08 s |
| Q8_0 | 985 ms | 906 ms | −8.0% | 2.12 s | 2.03 s |
| Q5_0 | 1063 ms | 984 ms | −7.4% | 2.18 s | 2.09 s |
| Q4_0 | 965 ms | 886 ms | −8.2% | 2.06 s | 1.98 s |
Vulkan RTX 5090 sees <3 % change in T3 infer — dispatch overhead is much smaller there relative to the actual compute, so there's less to save. No regression on Vulkan, and the code simplifies. CPU stays neutral (same graph topology, fewer intermediate nodes).
Sampling output is not bit-exact against §3.13 either — same reason as before, FA reductions are sensitive to operand stride. Token counts shift within ±1 % at the same seed.
Even after §3.14 the T3 step path still dispatched two Metal kernels
per linear layer — mul_mv for the matmul itself, then bin_fuse for
the following add(bias). T3 has 4 such linears per layer
(QKV proj, attn proj, MLP fc, MLP proj) × 24 layers = 96 extra bias
kernels per step. At ~9 µs dispatch overhead on M3 Ultra that's
~900 µs/step / ~240 ms over a 260-token generation.
Patched ggml-metal to fuse these directly inside the mul_mv kernel
(third addition to patches/ggml-metal-chatterbox-ops.patch):
- New function constant
FC_mul_mv_has_biasatFC_MUL_MV + 2. - Each Q-variant top-level kernel (
kernel_mul_mv_q4_0_f32,_q4_1_f32,_q5_0_f32,_q5_1_f32,_q8_0_f32) picks up an extradevice const char * biasbuffer argument and calls a tinyhelper_mv_add_bias<NR0>immediately after the existing impl. The post-pass only runs when the function constant is true and only one thread per row does the add (no cross-threadgroup synchronisation needed; each threadgroup writes and then reads back only its own output rows). ggml_metal_op_mul_matgets actx->use_fusion && kernel_supports_biaslook-ahead: if the next op is anADDwith a contiguous F32[ne0, 1]bias, we compile the pipeline withhas_bias=true, bind the bias buffer to slot 4, redirect the matmul'sdstto the ADD's output tensor, and returnn_fuse=2so the dispatcher skips the ADD. The shared pipeline name (…_bias=1) makes the fused variant cache-coherent with the non-fused one.- For kernels not yet wired (F16/BF16
mul_mv_t_t, the_4SIMD variants, all the K-quants and IQ variants) the fusion is suppressed bykernel_supports_bias, the pipeline compiles withhas_bias=false, and the kernel'sif (FC_mul_mv_has_bias)is dead-code eliminated. MoEmul_mv_idkeeps calling the original impl viammv_fnunchanged; the impl signature itself was not touched.
Measured on M3 Ultra, 10 s sentence, seed 42, 3-run warm average:
| Variant | T3 before §3.15 | T3 after §3.15 | Δ | Wall before | Wall after |
|---|---|---|---|---|---|
| F16 | 909 ms | 915 ms | ~flat | 2.08 s | 2.26 s |
| Q8_0 | 906 ms | 819 ms | −9.6% | 2.03 s | 2.02 s |
| Q5_0 | 984 ms | 840 ms | −14.6% | 2.09 s | 1.96 s |
| Q4_0 | 886 ms | 766 ms | −13.5% | 1.98 s | 1.87 s |
F16 is flat because the kernel it hits (mul_mv_f16_f32_4) isn't in
the supported list yet; extending to those variants is a mechanical
follow-up (touches helper_mv_reduce_and_write + the 3 _t_t /
_t_t_4 / _t_t_short templates in the same way).
Vulkan RTX 5090 unchanged (347 → 343 ms on Q4_0 — noise). CPU unaffected (Metal-only change).
Total Metal Q4_0 journey (pre-FA → end of §3.15):
T3 infer Wall gen_RTF
pre-FA 1274 ms 2.36 s 0.176
§3.13 FA+KV 965 ms 2.06 s 0.144 -24%
§3.14 Q views 886 ms 1.98 s 0.131 -30%
§3.15 bias fn 766 ms 1.87 s 0.119 -40%
40 % faster T3 inference, 21 % faster end-to-end wall than the pre-optimization baseline on the same M3 Ultra — all via Metal kernel + graph-shape changes, no model changes.
While investigating whether the §3.15 fusion could also apply to Vulkan and CPU, two findings:
- Vulkan already has it.
ggml_vk_can_fusein upstream recognisesMUL_MAT + ADDandMUL_MAT + ADD + ADD, and the mat-vec shaders (vulkan-shaders/mul_mat_vec_iface.glsl) have dedicatedFuse0/Fuse1buffer bindings for the two optional adds. RunningGGML_VK_DISABLE_FUSION=1on the 5090 pushes T3 Q4_0 from 346 → 413 ms (3-run avg), a real 16 % speedup that was silently helping us before. Nothing to add on Vulkan. - CPU has no op-level fusion framework. But it also has ~zero
per-op dispatch overhead (ggml-cpu just calls the next op's compute
function directly), and the matmul output stays in L1 cache
(
n_embd=1024× 4 B = 4 KB) so the intermediate round-trip is essentially free. Estimated gain from fusion: < 1 %. Not worth the plumbing work.
That left Metal, where §3.15 covered MUL_MAT + ADD(bias) but not the
3-op form MUL_MAT + ADD(bias) + ADD(residual) used by T3's attn-proj
and MLP-proj linears. Extended the Metal patch to match Vulkan's
fusion surface:
- New function constant
FC_mul_mv_has_residualatFC_MUL_MV + 3. - Each Q-variant top-level kernel gains a second buffer binding
(
device const char * residualat slot 5).helper_mv_add_biasnow applies both the bias broadcast and the per-element residual add; both branches are gated on their respective function constants so non-fused call sites specialise them away. ggml_metal_op_mul_mattries{MUL_MAT, ADD, ADD}first (requires bias-shaped src1 on ADD1 and full-shape F32-contiguous on ADD2), falls back to{MUL_MAT, ADD}from §3.15. Returnsn_fuse=3/n_fuse=2accordingly.- Pipeline names now carry
_bias=?_res=?so fused/non-fused variants are cached independently by the library.
Correctness bug caught while writing the 3-op variant. §3.15's
helper had if (tiisg != 0 || sgitg != 0) return;, so only simdgroup
0 added bias. That's correct for Q8_0 (all simdgroups cooperate on
the same r0) but wrong for Q4/Q5 where each simdgroup writes
its own r0 = (tgpig.x*NSG + sgitg)*NR0, silently dropping bias from
the rows computed by simdgroups ≥ 1. Output was "close enough" to
sound right but not numerically correct. Fixed by moving the
sgitg gate to the callers: Q-n kernels call the helper from every
simdgroup with their own r0; Q8_0 wraps the call in
if (sgitg == 0). Token counts snapped back to the pre-fusion
trajectory once this was right.
Measured on M3 Ultra, 10 s sentence, seed 42, 3-run warm average:
| Variant | T3 before §3.16 | T3 after §3.16 | Δ | Wall before | Wall after |
|---|---|---|---|---|---|
| F16 | 915 ms | 913 ms | flat | 2.26 s | 2.27 s |
| Q8_0 | 819 ms | 794 ms | −3 % | 2.02 s | 1.94 s |
| Q5_0 | 840 ms | 873 ms | +4 % | 1.96 s | 2.01 s |
| Q4_0 | 766 ms | 770 ms | flat | 1.87 s | 1.88 s |
Smaller than the headline "save 48 dispatches × 9 µs" estimate
suggested, because Metal's scheduler overlaps consecutive small
dispatches — the bin_fuse the fused kernel replaces was already
running concurrently with later work. Q8_0 still sees a clean 3 %
win; Q4/Q5 are noise after accounting for token-count drift. Still
worth committing: matches Vulkan's fusion surface, fixes the latent
§3.15 bias correctness bug, and closes the last dispatch-per-linear
gap vs Vulkan.
The CLI had always been single-shot (pass --text, get one wav),
which meant anything "keep the model warm and speak whatever I send"
required re-spawning the binary per request. Added a long-running
mode driven by --input-file PATH: the binary tail -f's PATH,
splits on sentence terminators, and pipes raw PCM (s16le @ 24 kHz)
to stdout chunk-by-chunk.
Key details that came up during the implementation:
fread+clearerrdoesn't tail-follow on macOS. Once the stdioFILE*hits EOF, the readahead buffer can keep returning 0 fromfreadfor many subsequent calls even after the writer has appended new bytes andclearerr()has been called. Switched toopen()+read()on a plain fd so the kernel is always consulted for the current file state — fixed the "second process's writes get dropped" symptom.- Accept
<.!?>followed by an uppercase letter as a sentence break, in addition to the original<.!?>+ whitespace / newline / end-of-input. LLMs / transcribers that pack sentences back-to-back without a space ("Hello.World.Foo.") were otherwise bundling everything into one enormous utterance. - Interactive stdin mode —
--input-file -reads fromSTDIN_FILENOdirectly (noopen("/dev/stdin")which gets a fresh-offset fd on some systems). When stdin is a TTY, the binary prints a>prompt on stderr (so it can't collide with the raw PCM stream on stdout), wraps theread()in aselect()with a 25 ms poll so SIGINT is noticed without the user also having to press Enter, and re-prompts after each synthesised sentence. Single process, pipe stdout straight tosox play, type a sentence, hear it back. --input-by-lineline mode — one newline = one request. Internal. ! ?are treated as prosody, not as hard boundaries, so "Hello there. How are you today?" becomes a single T3 run instead of two runs with a 150 ms gap between them. Saves the inter-sentence restart cost and produces more natural delivery when the upstream emits complete thoughts per line.- T3 early-stop auto-retry was also hit in live mode. The
batch pipeline already replays segments when T3 samples
stop_speech_tokensuspiciously early (symptom: a cloned voice clips the first or last word of a sentence). Lifted the samemin_tokens = max(8, bpe_tokens * 5), three-attempt, keep-longest guard into the livesynth_sentence. - Skip pure-punctuation input. With the various split
heuristics, it was possible to route a single
.through T3 (on a TTY: the user hits Enter with an empty buffer, punc_norm fills in a period). T3 then hallucinates ~1.4 s of speaker-biased audio that can sound like a word from the previous utterance. The live path now drops any sentence whose punc-normalised form contains no alphanumeric characters, with a[skipped: no word characters]notice on TTY. - Knob cleanup. Removed
--input-flush-ms(idle-flush mid-buffer was only useful when the terminator set was limited to.!?and got obsoleted by--input-by-line+ explicit\n) and--input-poll-ms(hard-coded to 25 ms, well below perception). One less thing to think about for users; one less thing to get wrong.
Commits: 00bfd7f (fread→read fix), 189fe9d (interactive stdin),
9e1b101 (T3 retry port), dc0b5e1 (punctuation-only skip),
e0af5e9 (--input-by-line), d843a59 / cff89ae (knob cleanup).
Every voice-cloning debug session ended the same way: probe the
source with ffprobe, scan with silencedetect, eyeball the output
for the longest clean region, pick an -ss/-t, iterate on the
ffmpeg filter chain until the clone stopped sounding wrong, optionally
bake the .npy profile. Scripted the whole thing.
./scripts/extract-voice.py INPUT [--name NAME] [--target SEC] [--bake]
does:
ffprobefor duration, codec, bitrate.ffmpeg silencedetect=noise=-30dB:d=0.3to split into speech regions.- Rank candidate windows: prefer a continuous slice from the middle of the longest region (speaker is warmed up, hasn't started wrapping up), fall back to concatenating the two best short blocks when no single block is ≥ target.
- Pick a codec-aware filter chain:
- clean (WAV / FLAC / ≥ 96 kbps AAC / ≥ 128 kbps MP3):
highpass=f=60, alimiter=limit=0.85:level=disabled. Trusts the source. - lossy (Opus / Vorbis at any bitrate, or low-bitrate
AAC / MP3):
highpass=f=60, afftdn=nr=6:nt=w, equalizer=f=200:w=150:g=-1, equalizer=f=3200:w=2200:g=2.5, equalizer=f=7500:w=2500:g=3, loudnorm=I=-18:TP=-2:LRA=8, alimiter=limit=0.85:level=disabled. Denoises the codec hiss, puts a mild dip at 200 Hz to unmuddy, boosts presence around 2–4 kHz and air around 6–9 kHz to replace some of the content Opus' brick-wall low-pass throws away above ~8 kHz, loudness- normalises so the speaker embedding doesn't drift on the shouted-vs-whispered axis.
- clean (WAV / FLAC / ≥ 96 kbps AAC / ≥ 128 kbps MP3):
- Emit
voices/<name>.wavat 24 kHz mono s16le. - Optionally call
./build/tts-cli --save-voiceto bake the five.npytensors.
Commit: 84d2189.
The lossy chain is what took an 18 kbps Opus voice note from "clone sounds wrong" to "sounds like the speaker" during the Marco debug session. On clean-source material the minimal chain is usually sufficient and the EQ boosts would only add a mild bright tint.
Same 10 s sentence, seed 42, gen_RTF is inference-only (excludes
load time):
| Backend (weights) | T3 gen | S3Gen gen | gen_RTF |
Wall | Real-time mult |
|---|---|---|---|---|---|
| CPU Linux (F16, 8 threads) | 3998 ms | 2905 ms | 0.70 | 8.17 s | 1.4× |
| Vulkan 5090 (F16) | 402 ms | 282 ms | 0.064 | — | 15.6× |
| Vulkan 5090 (Q4_0) | 347 ms | 284 ms | 0.058 | — | 17.1× |
| Metal M3 Ultra (F16) | 915 ms | 567 ms | 0.150 | 2.26 s | 6.7× |
| Metal M3 Ultra (Q4_0) | 766 ms | 596 ms | 0.128 | 1.87 s | 7.8× |
| ONNX q4 addon (CPU, Linux) | — (not exposed) | — | 1.06 | 13.91 s | 0.94× |
The ONNX addon is shown as a baseline because it's the current in-house reference TTS implementation. Every ggml configuration — including CPU F16 on the same host — beats it.
Everything up to this point in the journal was Chatterbox Turbo
(GPT-2 Medium T3, meanflow 2-step CFM, English BPE). §3.19 is the port
of ChatterboxMultilingualTTS (23-language Llama-520M T3 + perceiver
resampler + CFG-enabled standard 10-step CFM). Variant is auto-detected
from chatterbox.variant GGUF metadata at load time; Turbo stays byte-
identical to the pre-§3.19 builds.
What shipped (commit 3f0a8dac):
scripts/convert-t3-mtl-to-gguf.py— packst3_mtl23ls_v2.safetensors(30-layer Llama-520M + cond_enc perceiver + emotion_adv + learned pos embs + built-in voice + VE weights) and the raw grapheme tokenizer JSON into a single GGUF withchatterbox.variant=t3_mtland the full Llama-3 RoPE scaling metadata baked in.--quant f16|q8_0|q5_0|q4_0on the big linears.scripts/convert-s3gen-to-gguf.pygrew a--variant {turbo,mtl}flag. MTL loadss3gen.pt(standard CFM, notime_embed_mixer) and stampss3gen.meanflow=false, cfg_rate=0.7, n_timesteps=10. Turbo path unchanged.src/mtl_tokenizer.{h,cpp}+mtl_unicode_tables.inc— self- contained BPE tokenizer mirroring HuggingFace's BPE loader + the Python preprocess (NFKD + UTF-8 lowercase +[lang_id]prefix + Korean Jamo decomposition). Tier-1 language support only (en, es, fr, de, it, pt, nl, pl, tr, sv, da, fi, no, el, ms, sw, ar, ko); ja/he/ru/zh/ hi error out with a clear message. No external deps.src/t3_mtl.{h,cpp}— Llama-520M forward pass: RMSNorm + SwiGLU MLP + separate Q/K/V no-bias + RoPE-llama3 (NEOX half-split) +flash_attn_ext+ dual KV cache for CFG. Cond assembly covers spkr_enc + Perceiver (32-query cross then self-attn,AttentionBlock2LN+bias, F32) + emotion_adv_fc + learned text/speech positional embeddings. Exposes stage builders (cond/text/inputs/layers/head) so the parity harness can inject Python-dumped intermediates at any boundary.src/test_t3_mtl_stages.cpp— staged parity harness (all stages pass within 5e-4 rel against the Python reference; logits land at 1.4e-3 rel, consistent with cumulative F16 drift through 30 layers).
Sampling path. chatterbox_sampling_params gained cfg_weight and
min_p. Sampler order in sample_next_token_mtl matches the Python
ChatterboxMultilingualTTS.generate default:
cfg_combine → rep_penalty → temp → min_p → top_p → (top_k) → multinomial.
CFG runs cond and uncond as two independent T3 forwards (dual KV cache,
memory_k{_uncond} / memory_v{_uncond} in the model struct), combined
at the logit level.
S3Gen dispatch. chatterbox_tts.cpp reads s3gen.meanflow / n_timesteps / cfg_rate once at load time and branches the CFM inner
loop:
- meanflow: 2-step linear
t_span+time_embed_mixer+noised_melsoverlay (unchanged Turbo path). - standard: 10-step cosine
t_span, no mixer, CFG via either two estimator calls per step or a batched-estimator variant (see "batched CFM" below).
Voice cloning works unchanged on MTL because the 5-tensor
conditioning (speaker_emb, cond_prompt_speech_tokens, embedding,
prompt_token, prompt_feat) is identical between variants. Verified
end-to-end with jfk.wav in Spanish: VoiceEncoder + S3TokenizerV2 +
CAMPPlus + native mel extraction all fire and produce a plausibly-JFK
Spanish wav.
Mirroring the Turbo staged-verification pattern (§3.3 S3Gen A..F). M4 with Metal, F16 weights, 7-token prompt "Hello there.":
| Stage | n | rel_err | max_abs | max|ref| |
|---|---|---|---|---|
| cond_emb | 34816 | 1.5e-4 | 4.6e-4 | 3.11 |
| text_emb + pos (cond) | 9216 | 2.1e-4 | 6.1e-5 | 0.29 |
| inputs_embeds (cond) | 46080 | 1.5e-4 | 4.6e-4 | 3.11 |
| inputs_embeds (uncond) | 46080 | 1.5e-4 | 4.6e-4 | 3.11 |
| layer 0 out (1 block) | 46080 | 7.3e-5 | 4.8e-4 | 6.58 |
| layer 14 out (15) | 46080 | 2.9e-4 | 3.9e-1 | 1344 |
| layer 29 out (30 full) | 46080 | 2.9e-4 | 3.9e-1 | 1344 |
| speech_logits cond | 8194 | 1.4e-3 | 1.2e-2 | 8.18 |
| speech_logits uncond | 8194 | 1.4e-3 | 1.4e-2 | 9.46 |
All F16 accumulation drift; argmax stable, audio perceptually correct.
Metal and CPU (4 threads) back-to-back on a cool machine, F16 weights throughout:
| Config | T3 infer | S3Gen | Audio | RTF |
|---|---|---|---|---|
| Turbo Metal | 788 ms / 73 tok | 768 ms | 3040 ms | 0.51 |
| Turbo CPU 4t | 1721 ms / 73 tok | 3334 ms | 3040 ms | 1.66 |
| MTL Metal (batched CFM) | 1865 ms / 61 tok | 2247 ms | 2560 ms | 1.61 |
| MTL CPU 4t (2-call CFM) | 2711 ms / 71 tok | 8029 ms | 2960 ms | 3.63 |
MTL is ~2.2× slower than Turbo on CPU — very close to the architectural ceiling:
- 30 Llama layers vs 24 GPT-2 layers → ~1.25×
- CFG doubles T3 forward passes per step → another 1.6–2× on T3
- CFM runs 10 steps × 2 CFG passes = 20 estimator calls vs Turbo's 2 meanflow steps → 10× call-count multiplier, ~4–5× wall because the per-call cost is lower on MTL (estimator cache reused, smaller effective footprint per call)
On a thermally-loaded M4 (other agents running) the same measurements showed RTF ≈ 6.3 — almost 2× worse than the cool-machine number. This is the variance envelope to keep in mind when benchmarking.
First optimisation attempt: fold the CFG cond+uncond CFM passes into a
single batch=2 decoder forward so the weight reads amortise across
both passes instead of paying them twice.
New helpers (src/chatterbox_tts.cpp): conv1d_f32_b, cfm_causal_block_b,
cfm_causal_k3_b, cfm_resnet_b, basic_tfm_b, apply_tfm_stack_b, and
a new cfm_estimator_forward_b2 that packs cond + uncond inputs along
ne[2] throughout.
Subtle ggml gotcha: ggml_mul_mat(a, b) broadcasts a over b's
ne[2..3]; ggml_can_mul_mat rejects the opposite direction. When
im2col has a batch dim and the kernel is 2D, the kernel has to be the
first operand, and the result then needs a
cont(permute(_, 1, 0, 2, 3)) back to the downstream-friendly
(L_out, OC, B) layout. That permute costs real memory traffic.
Measured on M4, same 2-word sentence as above:
| Config | F16 baseline | Batched CFM | Δ |
|---|---|---|---|
| MTL Metal (S3Gen) | 2451 ms | 2247 ms | −9% |
| MTL CPU 4t (S3Gen) | 19948 ms | 22165 ms | +11% |
Metal wins by ~9 % because kernel dispatch amortises (same number of
heavier kernels instead of twice as many light ones). CPU loses
because ggml-cpu has essentially zero dispatch overhead already, and
basic_tfm_b's permute + cont on Q/K/V now runs over a larger
(HD, T, H, 2) tensor every attention block (4 blocks × 13 resnet
blocks × 10 steps). The extra memory traffic outweighs the amortised
weight reads.
Fix: gate the batched path on backend type — const bool use_b2 = !meanflow && cfg_rate != 0 && !ggml_backend_is_cpu(m.backend); Keeps
Metal fast, leaves CPU on the clean two-call path.
Head-to-head through
examples/chatterbox-multilingual-bench.js
in the qvac-lib-infer-onnx-tts package. Same prompt ("Hola mundo, esta es una prueba multilingue."), same jfk.wav reference, same 4
CPU threads:
onnxruntime-fp16 ggml-cpu-f16
-------------------------------------------------
cold load 42 829 ms ~500 ms (85x faster)
inference wall 51 447 ms 10 168 ms (5.06x faster)
audio produced 2 740 ms 2 400 ms
RTF 18.78 4.24
CFG enabled no yes
A few things worth calling out:
- CFG disabled on the ONNX side. Its multilingual export currently
ships without
text_emb_weight.binand logsCFG disabledat load, so it's running half the compute of the ggml pipeline (1 T3 pass per step instead of 2, and no CFG combine on CFM). If the ONNX CFG path were wired up, its RTF would roughly double to ~37 and the gap vs ggml would jump from 5× to ~9×. - Cold load is 85× faster on ggml (0.5 s vs 42.8 s). That's
entirely an onnxruntime cost — initialising 4 session objects over
1 GB of
external_data.onnx_data blobs. ggml mmaps the two GGUFs and rebinds through the backend allocator in half a second. - Quality parity:
bench-onnx.wavandbench-ggml.wavare both plausibly the same Spanish sentence in the JFK-cloned voice; the per-sample waveform differs (different samplers, different RNG) but the speaker identity and content match by ear.
Comparison is reproducible with:
cd qvac2/packages/qvac-lib-infer-onnx-tts
bare examples/chatterbox-multilingual-bench.js \
--language es \
--text "Hola mundo, esta es una prueba multilingue." \
--warmup 0 --runs 1(Add --skip-onnx or --skip-ggml to isolate one side.)
Optimisations still on the table, ordered by expected CPU impact:
- Q8_0 / Q4_0 T3 for MTL. Converter already supports it (bit-exact to F16 on Turbo per §3.10); T3 is 25 % of the CPU wall time so this is a ~1.5× T3 win but only ~12 % total. Small compared to #2.
- Quantized CFM estimator weights. ~75 % of CPU wall time is the
10-step CFM; halving its weight-read cost via Q8_0 on the U-Net /
transformer linears is the biggest remaining CPU lever. Needs a
small converter change and a validation pass that quantized
mul_matkernels actually speed these specific shapes up (small-d convs can regress at Q8_0 on ggml-cpu — cf. §3.8 Attempt 7). - Reduce CFM step count at runtime. Python's meanflow uses 2
steps; standard CFM trained at 10 may tolerate 6–7 with no audible
loss. Trivial to plumb via the existing
--stream-cfm-stepsflag. - ja/he/ru/zh/hi language support. Separate sub-projects per language (pykakasi / dicta / Russian stresser / Cangjie+pkuseg / Hindi phonemizer). Easiest to ship as optional Python pre-processing that emits already-tokenised IDs.
Items #1 and #2 from the §3.19 backlog shipped together. The lever sits in two converter scripts that share a single per-tensor quantisation policy:
scripts/convert-s3gen-to-gguf.py— covers item #2 (CFM estimator + encoder Linears, the dominant CPU cost on MTL). A new--quant {f32,f16,q8_0,q5_0,q4_0}flag (defaultf32to keep the from-PyTorch GGUF byte-identical to the pre-optimisation builds) routes every tensor through a singleadd_tensor_maybe_q()helper.scripts/convert-t3-mtl-to-gguf.py— covers item #1 (T3 Llama linears + speech/text heads + perceiver Linears + cond_spkr).--quant {f16,q8_0,q5_0,q4_0}(defaultf16, since the T3 storage baseline is already F16) routes through the same helper.
Zero C++ changes, zero runtime API changes — ggml_mul_mat dispatches
the right quantised kernel automatically once the tensor's ggml_type
is set, so every backend (CPU/NEON, CPU/AVX, Metal, Vulkan, CUDA)
picks up the win for free.
Single source of truth. requantize-gguf.py already had to make
the same yes/no quantise decision for the offline "rewrite an existing
GGUF in place" tool, and we explicitly want all three paths (T3
convert-from-PyTorch, S3Gen convert-from-PyTorch, and rewrite-existing)
to land tensors in identical layouts. Both converters load the policy
at import time via _load_requantize_policy() and reuse
should_quantize() + _QUANT_TYPE directly — no duplicate deny-list,
no drift between the three tools. Adding a new tensor name to either
converter automatically inherits the right keep-as-F32 / quantise
decision based on the deny-list patterns.
Rules in should_quantize() (scripts/requantize-gguf.py; all
defensive so a stray caller can't silently degrade quality):
- Tensors with < 1024 elements → never quantise. Biases, LayerNorm gammas/betas, tiny conditioning vectors; the bandwidth savings are negligible and block-quant rounding visibly regresses rel error.
- Deny-list of name substrings (
_DENY_SUBSTRINGS) → never quantise. Coversflow/input_embeddingand/builtin/(read as raw F32 by the C++ loader), token / position embedding tables (text_emb,speech_emb,wte,wpe,pos_emb,pe/pe), spectral bases (stft_basis,mel_filterbank,mel_fb), all bias / norm / scale patterns (/b,/bias,/bn/,/norm/,/ln_,/g,/s,alpha,beta,gamma), and the entire voice-encoder /campplus//s3tokv2/subtrees (small specialised encoders whose dynamic range is too tight for Q4/Q8 block quant — speaker_emb collapses to zeros if quantised). - Reduction-dim alignment:
shape[-1] % block_size != 0→ never quantise. GGML block quants need the reduction dim to be a multiple of 32 (Q8_0 / Q4_0) or 32 (Q5_0). Every transformer Q/K/V/out/FF Linear in the Conformer encoder + CFM + S3TokenizerV2 hits this: inner dim 512, 1024, 2048 all align. - Source dtype gate: only F32 / F16 tensors are quantisation candidates
(
_QUANTIZABLE_SRC_DTYPES); already-quantised tensors get copied through as-is. - Anything that survives all four gates → quantised to the requested
block format.
--quant f16skips block-quant entirely and just stores everything as F16;--quant f32is the default and reproduces the pre-optimisation GGUF byte-for-byte.
Quantisation counter. When --quant != f16, add_tensor_maybe_q
threads a qstats dict through every call site and at the end of
conversion prints
--quant q4_0: 426 tensors block-quantized (policy matches scripts/requantize-gguf.py; embeddings, voice encoders, norms/biases, and filterbanks kept at full precision)
so it's immediately visible whether the deny-list bit and how many
tensors landed in the quantised pool.
GGUF size (MTL S3Gen):
| --quant | File size | vs F32 |
|---|---|---|
| f32 | 1.0 GB | — |
| f16 | 820 MB | -18% |
| q8_0 | 732 MB | -27% |
| q4_0 | 685 MB | -32% |
Size savings are modest because CAMPPlus (450 tensors), S3TokenizerV2 (103 tensors), and all rank-3 conv kernels still live at F32 — they're either off the hot path (CAMPPlus / S3TokV2 run once per voice-cloning setup) or blocked on the conv1d arg-order refactor above. The important savings are in the right place: the 426 quantised tensors are exactly the CFM + Conformer + T3 transformer Linears that the 10× CFG-paired estimator pass re-reads on every step.
CPU per-stage breakdown (M4, 4 threads, Spanish prompt)
Confirming the quantisation lands on the CFM U-Net as intended:
| Stage (20 CFM forwards) | F32 S3Gen | F16 S3Gen | Q4_0 S3Gen |
|---|---|---|---|
| CFM total | 6 078 ms | 4 400 ms | 3 900 ms |
| HiFT decode | 696 ms | 660 ms | 640 ms |
| encoder | 242 ms | 210 ms | 200 ms |
| S3Gen total (BENCH) | 7 113 ms | 5 453 ms | 4 861 ms |
(HiFT gains less because all its conv kernels stay F32 for the
conv1d-arg-order reason above. CFM gains the full expected fraction
because its transformer blocks and mlp projections were the bulk of
the bandwidth.)
End-to-end multilingual table (M4, same Spanish prompt as §3.19,
seed 42, 4 CPU threads, built-in voice on ggml, jfk.wav voice on
ONNX via the chatterbox-multilingual-bench.js harness):
| Runtime | T3 infer | S3Gen infer | Audio | Total wall | RTF |
|---|---|---|---|---|---|
| ggml Metal, Q4_0 T3 + Q4_0 S3Gen | 907 ms / 52 t | 2 100 ms | 2.20 s | 3 005 ms | 1.37 |
| ggml Metal, F16 T3 + F16 S3Gen | 1 825 ms / 57 t | 2 135 ms | 2.40 s | 3 960 ms | 1.65 |
| ggml CPU 4t, Q4_0 T3 + Q4_0 S3Gen | 1 168 ms / 53 t | 4 861 ms | 2.24 s | 6 029 ms | 2.69 |
| ggml CPU 4t, F16 T3 + F16 S3Gen | 2 315 ms / 57 t | 5 453 ms | 2.40 s | 7 768 ms | 3.24 |
| ggml CPU 4t, F16 T3 + F32 S3Gen (§3.19) | 2 423 ms / 57 t | 7 113 ms | 2.40 s | 9 536 ms | 3.97 |
| ONNX Runtime CPU 4t, q4 (avg of 2) | — | — | 2.19 s | 31 702 ms | 14.55 |
| ONNX Runtime CPU 4t, fp16 (avg of 2) | — | — | 2.27 s | 53 342 ms | 23.50 |
Key deltas vs the §3.19 CPU baseline at the same 4-thread CPU target:
F16 S3Genquant alone: -19% wall (-1.77 s).Q4_0 S3Genquant + Q4_0 T3: -37% wall (-3.51 s). RTF drops from 3.97 to 2.69.
vs the ONNX reference (same prompt, same threads, CFG disabled on the ONNX side so it's doing half the compute):
- CPU F16 is 7.3× faster per second of audio (RTF 3.24 vs 23.50).
- CPU Q4_0 is 5.4× faster per second of audio (RTF 2.69 vs 14.55).
- Metal F16 is 14.2× faster per second of audio (RTF 1.65 vs 23.50).
- Metal Q4_0 is 10.6× faster per second of audio (RTF 1.37 vs 14.55).
With CFG enabled on ONNX (the apples-to-apples comparison), those
ratios would roughly double. ONNX q4 notably improved from our
§3.19-era measurement (RTF 18.17 → 14.55) after a recent
qvac-lib-infer-onnx-tts prebuilds update; ONNX fp16 stayed within
noise (20.91 → 23.50).
Quality check. The output wavs for each config are available at
/tmp/mtl_{cpu,mtl}_{f16,q4_0}.wav after the bench run; all four
utterances are intelligible Spanish in the built-in voice. Token
counts vary slightly between quant levels (57 → 53 → 52) because the
per-token sampling reads logits that differ by ~0.1% after matmul
rounding, and the multinomial sampler diverges on marginal picks —
this is the same effect noted for Turbo Q4_0 in §3.10 and does not
affect overall fluency. Use --seed + --temp 0 --top-k 1 for
deterministic byte-exact repro at a cost of some audio variety.
Generic across every backend. The conversion path is pure data- format work: no CPU-specific ifdefs, no Apple/Intel/ARM branches, no new ggml ops. F16/Q8_0/Q4_0 tensor reads are accelerated by NEON dot-product instructions on Apple Silicon + Android arm64, by AVX2 / AVX-512 VNNI on Intel/AMD, by Metal/Vulkan/CUDA compute shaders on their respective GPUs. Mobile deployments (Android + iOS) get the same win as desktop.
What's next for MTL (updated §3.19 backlog).
Q8_0/Q4_0 T3 for MTL— shipped (this §3.20 row).Quantised CFM estimator weights— shipped (this §3.20 row).- Runtime
--cfm-steps Nfor MTL. Still on the table; trivial plumbing, probably 25–30% more CPU wall time savings atN=7. - Fix
conv1d_f32arg order so rank-3 Conv1d kernels can also go F16/Q8_0/Q4_0. Unlocks quantising HiFT's weight_norm stack (~10% additional CPU wall-time reduction on MTL, larger share on Turbo). Single-function refactor — mirror theconv1d_f32_bpattern (kernel as mul_mat src0 +cont(permute)at the end). - Heterogeneous-core aware thread default.
--threads 10on M4 hits efficiency cores and regresses ~10% vs--threads 8. Platform-agnostic detection (hwlocor direct sysctl on Apple, a mask on Linux perf cores). Follow-up PR. - ja/he/ru/zh/hi language support — unchanged from §3.19.
Staged pipeline:
- Python reference dumper (
scripts/dump-s3gen-reference.py) runs the full PyTorch pipeline withforward_hooks on every module we plan to reimplement. Each intermediate is saved as.npyinartifacts/s3gen-ref/with a predictable name. Multi-call hooks save a_call{N}suffix so each flow-matching step gets its own tensor. - C++ staged harness (
src/test_s3gen.cpp) loads a single GGUF, and for each stage: loads the reference tensors as inputs, builds a tiny ggml graph covering exactly that stage, runs it, reads back outputs, and callscompare_f32(got, expected, n)to printmax_abs / mean_abs / rms / max|ref| / rel. - For T3 we additionally have bit-exact testing — under greedy decoding ggml speech tokens equal PyTorch speech tokens token-for-token.
- For the S3Gen+HiFT back half (
chatterbox_tts.cpp, driven bytts-cli) we have--debugmode that substitutes Python-dumped random bits for the stochastic parts, pinning the comparison.
Precision regressions are immediately visible: a change that drops rel to ~1e-4 shows up at stage N+1 before silently corrupting the full pipeline.
ssh gianni@dev-linux-x64
cd ~/chatterbox.cpp
# One-time: build the binaries
cmake -S . -B build
cmake --build build -j10 --target tts-cli test-s3gen mel2wav
# One-time: convert weights + built-in conditionals
. ~/chatterbox-ref/.venv/bin/activate
python scripts/convert-t3-turbo-to-gguf.py --out models/chatterbox-t3-turbo.gguf
python scripts/convert-s3gen-to-gguf.py --out models/chatterbox-s3gen.gguf
# One-time: dump the Python reference tensors
python scripts/dump-s3gen-reference.py \
--text 'Hello from ggml.' --out artifacts/s3gen-ref \
--seed 42 --n-predict 64 --device cpu
# Validate every stage in C++
./build/test-s3gen models/chatterbox-s3gen.gguf artifacts/s3gen-ref ALL
# End-to-end text → wav
./scripts/synthesize.sh "Hello from native C++." /tmp/out.wavRanked by impact-per-effort ratio, from biggest wins to niche polish.
Voice cloning works end-to-end TODAY using a Python preprocessing
helper that produces a five-tensor voice profile from a reference
.wav. The C++ binary accepts it via --ref-dir DIR.
Phase 1 (DONE) — Python helper + C++ wiring:
scripts/prepare-voice.py: wrapsChatterboxTurboTTS.prepare_conditionals()to produce a directory withspeaker_emb.npy(T3 256-d) +cond_prompt_speech_tokens.npy(T3 ≤375 int32) +embedding.npy(S3Gen 192-d) +prompt_token.npy(S3Gen int32) +prompt_feat.npy(S3Gen mel, 80-channel).src/main.cpp: when--ref-diris set, overwrite the T3 side in place (model.builtin_speaker_emb) or, when the prompt-tokens length differs from the GGUF's built-in (audio < 15 s → fewer tokens), allocate a fresh tensor inctx_override+buffer_overrideon the same backend and repointmodel.builtin_cond_prompt_tokensat it.hparams.cond_prompt_lenis updated to match sobuild_prompt_graphsizes the sequence correctly.src/chatterbox_tts.cpp: the S3Gen side already reads the same three.npyfiles whenref_diris non-empty.
End user workflow:
python scripts/prepare-voice.py --ref-audio me.wav --out voices/me/
./build/tts-cli --model models/chatterbox-t3-turbo.gguf \
--s3gen-gguf models/chatterbox-s3gen.gguf \
--ref-dir voices/me/ \
--text "Hello in my voice." \
--out out.wavVerified end-to-end on the remote EPYC: override prints
overrode T3 built-in voice from voices/test (speaker_emb=256, cond_prompt_tokens=260), the synthesis runs at RTF 0.44, the output
wav plays back cleanly on the Mac.
Phase 2a (DONE) — C++ WAV I/O + sinc resampler + 80-ch log-mel at 24 kHz:
src/dr_wav.h(public-domain single header, MIT-0 fallback) vendored as a bundled WAV loader (all PCM variants, any sample rate, auto-mono).src/voice_features.{h,cpp}:wav_load,resample_sinc(Kaiser-windowed, beta=8.6, configurable tap count), andmel_extract_24k_80. The mel extractor is a direct port ofs3gen.utils.mel.mel_spectrogram(n_fft=1920,hop=480,win=1920,fmin=0,fmax=8000,center=False, reflect-pad 720).scripts/convert-s3gen-to-gguf.pynow also bakes in the precomputed librosa mel filterbank (librosa.filters.mel(sr=24000, n_fft=1920, n_mels=80, fmin=0, fmax=8000), a(80, 961)float32 matrix) ass3gen/mel_fb/24k_80. Runtime has no librosa dep.- Two validation binaries:
test-resample(24 kHz → 48 kHz → 24 kHz round-trip on a 4-tone signal, expects > 60 dB SNR) andtest-voice-features MODEL.gguf REF.wav PROMPT_FEAT.npy(compares C++ 80-ch log-mel against a Python-dumpedprompt_feat.npy).
Measured on 10-core EPYC:
| Check | Result |
|---|---|
| Resampler round-trip (4-tone, 24k ↔ 48k) | 95.75 dB SNR |
Mel parity vs Python prompt_feat.npy (rel) |
8.3e-08 |
(The ~500-frame Python reference truncates at DEC_COND_LEN = 10 s; the C++ side produces an extra ~20 frames for a 10.4 s input wav but the overlapping 500 × 80 values match to float precision.)
Implementation notes:
- First attempt at
resample_sincwas a polyphase decomposition with a Kaiser-windowed sinc prototype; the phase-indexing convention was subtly wrong and gave 0 dB SNR on the round-trip. Swapped for straightforward "fractional-index sinc interpolation at each output sample" which is correct and still fast enough for one-shot voice preprocessing. mel_extract_24k_80uses a naive O(n_fft) DFT per frame, not an FFT. For a 10 s reference that's ~520 frames × 1920 × 961 ≈ 960 M mults, well under 2 s on CPU. Fine for preprocessing; an FFT is a trivial follow-up if this ever needs to be streaming.
Phase 2b (DONE) — --reference-audio PATH.wav wired into main.cpp.
The CLI now accepts a reference wav, runs the whole WAV→prompt_feat
chain in C++, and injects the result into s3gen_synthesize_opts
(new prompt_feat_override field) so the S3Gen+HiFT pipeline consumes
it directly — no temp file, no npy round-trip. The other four voice
tensors still come from --ref-dir for now.
User workflow:
python scripts/prepare-voice.py --ref-audio me.wav --out voices/me/
./build/tts-cli \
--model models/chatterbox-t3-turbo.gguf \
--s3gen-gguf models/chatterbox-s3gen.gguf \
--ref-dir voices/me/ \
--reference-audio me.wav \
--text "Voice-cloned with C++ mel." \
--out out.wavVerified end-to-end: voice: prompt_feat shape=(520, 80) /
prompt_feat: using C++ override (520 mel frames) / audible cloned
voice at RTF 0.76 on 10-core EPYC.
Phase 2c (DONE) — C++ VoiceEncoder: 3-layer unidirectional LSTM + Linear(256 → 256) + ReLU + L2-normalise, 40-channel 16 kHz power-mel in, 256-d speaker embedding out.
New files:
src/voice_encoder.{h,cpp}— weights loader (reads 14 tensors from the t3 GGUF +voice_encoder/mel_fb), plain-C++ LSTM forward pass (no ggml graph), partial-window averaging that exactly reproducesVoiceEncoder.embeds_from_wavs(..., as_spk=False)for a single wav: mel is split into overlapping 160-frame partials usingget_frame_step/get_num_wins, each partial produces an L2-normed 256-d embedding via LSTM + projection, then the per-partial embeds are averaged and L2-normed once more.src/test_voice_encoder.cpp— parity harness; compares the C++ 256-dspeaker_embagainst Pythonspeaker_emb.npyusingmax_abs,rms,reland cosine similarity.
Converter change: scripts/convert-t3-turbo-to-gguf.py now bakes in
the VE weights (weight_ih_l{0,1,2}, weight_hh_l{0,1,2},
bias_{i,h}h_l{0,1,2}, proj/weight, proj/bias) plus the librosa
(40, 201) mel filterbank as voice_encoder/mel_fb, and writes VE
hyperparameters (n_mels, hidden_size, num_layers, partial_frames,
sample_rate, n_fft, hop_size, win_size, overlap, rate, min_coverage)
as GGUF metadata so we never need ve.safetensors at runtime. The
similarity_{weight,bias} params are skipped — they're only used for
speaker-verification training, not embedding extraction.
Feature extraction: src/voice_features.cpp gained
mel_extract_16k_40, which shares the STFT/mel core with
mel_extract_24k_80 but uses the VE-specific knobs (center=True,
power_exponent=2, no log compression).
CLI wiring: main.cpp now resolves the T3 voice override in two
independent pieces. If ref_dir/speaker_emb.npy is missing but
--reference-audio PATH.wav is given AND the T3 GGUF has VE weights,
it loads the wav, resamples to 16 kHz, and computes speaker_emb in
C++ via voice_encoder_embed(). cond_prompt_speech_tokens still
comes from ref_dir until Phase 2e. Logs distinguish the source:
T3 voice override — speaker_emb=C++ VoiceEncoder, cond_prompt_tokens=ref_dir.
Verification on 10.4 s reference wav:
[result] C++ vs Python speaker_emb:
n=256 max_abs=1.71e-05 rms=2.58e-06 max|ref|=2.45e-01 rel=6.97e-05
cosine similarity = 1.000000
Cosine = 1.000000 confirms angular match to 6 decimal places; the
~1e-5 absolute error is pure float32 accumulation noise. End-to-end
synthesis with speaker_emb.npy deleted from the voice dir produced
a 276 kB WAV that plays cleanly on macOS — the C++-computed speaker
embedding drives T3 conditioning indistinguishably from Python.
Two down, two to go (embedding and prompt_token via CAMPPlus +
S3TokenizerV2).
Phase 2d-a (DONE) — C++ CAMPPlus forward pass, validated end-to-end against the Python reference on a Python-dumped 80-ch Kaldi fbank.
CAMPPlus is a FunASR/3D-Speaker x-vector: 937 raw tensors (329 conv / linear weights + 122 BatchNorms + biases + counters). Structure:
fbank (T, 80)
→ FCM: Conv2d(1→32, k=3) + BN + 2× BasicResBlock (stride=2)
+ 2× BasicResBlock (stride=2) + Conv2d(32→32, s=(2,1))
+ reshape → (320, T)
→ xvector.tdnn: Conv1d(320→128, k=5, s=2) + BN + ReLU
→ 3 × CAMDenseTDNNBlock + TransitLayer
block1: 12 layers, dilation=1 → 128 → 512
transit1: Conv1x1 + BN: 512 → 256
block2: 24 layers, dilation=2 → 256 → 1024
transit2: 1024 → 512
block3: 16 layers, dilation=2 → 512 → 1024
transit3: 1024 → 512
→ out_nonlinear (BN + ReLU)
→ stats_pool (mean + unbiased std over T → 1024)
→ dense: Conv1x1(1024→192) + BN(affine=False) → 192
Each CAMDenseTDNNLayer is BN→ReLU→Conv1x1→BN→ReLU→CAMLayer, with
CAMLayer being linear_local × sigmoid(linear2(ReLU(linear1(ctx))))
where ctx = mean(x, T) + seg_pool(x, 100).expand(T).
Ports:
scripts/convert-s3gen-to-gguf.py— fuses every BatchNorm into a per-channel(scale, shift)pair at export time:scale = gamma / sqrt(var + eps)(or1/sqrt(var + eps)whenaffine=False),shift = beta - mean*scale. Skipsnum_batches_tracked. Embeds 14campplus.*hyperparameters as GGUF metadata and emits the 451 substantive tensors undercampplus/…(329 conv + 122 fused BNs).src/campplus.{h,cpp}— plain-C++ forward pass, no ggml graph. Uses channel-major(C, T)layout throughout. Helpers:bn_apply,relu_inplace,sigmoid_inplace,conv1d,conv2d,seg_pool_expand(avg-pool withceil_mode=True+ repeat-interleave toT),stats_pool(mean + unbiased std). Module-level helpersfcm_basic_resblock,fcm_forward,cam_layer_forward,cam_dense_tdnn_layer_forward. Parallelised via OpenMP.src/test_campplus.cpp— loads CAMPPlus fromchatterbox-s3gen.gguf, runs on a Python-dumpedfbank.npy, compares with Pythonembedding.npyusing max_abs / rms / rel / cosine similarity.scripts/dump-campplus-reference.py— helper that loads the turbo checkpoint, runsextract_feature(Kaldi fbank + per-utterance mean-subtract) andspeaker_encoder.forward, and dumps the two tensors to.npy.
Result on a 10.4 s reference wav (1038 fbank frames, 192-d output):
[result] C++ vs Python embedding:
n=192 max_abs=2.34e-05 rms=6.99e-06 max|ref|=2.49e+00 rel=9.38e-06
cosine similarity = 1.000000
forward pass: 549.9 ms (16-thread EPYC)
rel = 9.4 ppm, cosine = 1.000000 — numerical parity. 550 ms for a
one-time voice-setup pass is comfortably fast.
src/s3gen_pipeline.h grew an embedding_override field and
src/chatterbox_tts.cpp reads it in place of ref_dir/embedding.npy
when provided, mirroring prompt_feat_override. End-to-end wiring
into main.cpp is blocked on Phase 2d-b (Kaldi fbank port) — we can't
feed CAMPPlus from --reference-audio until the C++ binary can
extract its own fbank.
Phase 2d-b (DONE) — C++ port of
torchaudio.compliance.kaldi.fbank with num_mel_bins=80.
Implemented as fbank_kaldi_80 in src/voice_features.{h,cpp}
with all the Kaldi knobs baked in:
frame_length = 25 ms = 400 samples,hop = 10 ms = 160 samplesround_to_power_of_two = True→n_fft = 512window_type = "povey"=hann(N, periodic=False) ** 0.85remove_dc_offset = True(subtract per-frame mean)preemphasis_coefficient = 0.97, with the Kaldi edge caseout[0] = frame[0] * (1 - coeff)use_power = True,use_log_fbank = Truewithlog_floor = FLT_EPSILONsnip_edges = True,dither = 0- Kaldi mel filterbank (
mel = 1127 * log(1 + f / 700), triangular filters equally spaced in mel-space) precomputed byconvert-s3gen-to-gguf.pyand baked in ascampplus/mel_fb_kaldi_80(shape(80, 257)).
Key gotcha we hit: torchaudio's Kaldi wrapper does not apply
the ×32768 int16 scaling that real Kaldi does. With the scale
our output was +20.8 units offset from Python (exactly
2 * log(32768) ≈ 20.79). Dropped the scale and rel jumped from
1.30 to 1.77e-05.
Validation on the synthetic 10 s speech signal:
[result] C++ vs Python fbank:
n=79840 max_abs=2.82e-04 rms=5.91e-06 max|ref|=1.59e+01 rel=1.77e-05
C++ fb[0, :8]: -10.1011 -8.3549 -7.9557 -7.4304 -7.0186 ...
Py fb[0, :8]: -10.1012 -8.3549 -7.9557 -7.4304 -7.0186 ...
Phase 2d-c (DONE) — Wired into main.cpp. New
compute_embedding_native() glues wav_load → resample_sinc → fbank_kaldi_80 → mean-subtract over T → campplus_embed and
populates the new embedding_override field in
s3gen_synthesize_opts. Called best-effort from both short-circuit
and regular T3→S3Gen paths: if the s3gen GGUF pre-dates Phase 2d-a
(no CAMPPlus tensors), it silently falls back to
ref_dir/embedding.npy.
End-to-end dogfood on the 10.4 s reference wav with
speaker_emb.npy and embedding.npy deleted from voices/test/:
voice_encoder: computing speaker_emb from /tmp/unified_remote.wav
main: T3 voice override — speaker_emb=C++ VoiceEncoder, cond_prompt_tokens=ref_dir
voice: prompt_feat shape=(520, 80)
voice: embedding shape=(192,) via CAMPPlus (1038 fbank frames)
embedding: using C++ override (CAMPPlus, 192 dims)
prompt_feat: using C++ override (520 mel frames)
Output WAV plays cleanly and sounds identical to the Python
voice-cloned output. Only cond_prompt_speech_tokens.npy and
prompt_token.npy still live in ref_dir — both are produced by
S3TokenizerV2, the last holdout (Phase 2e).
Phase 2e (DONE) — C++ S3TokenizerV2: a 6-layer FSMN-attention
transformer + FSQ codebook that turns a 16 kHz reference wav into the
25 Hz speech-token stream Chatterbox needs for voice conditioning.
103 tensors / ~124 M params. Produces BOTH the T3-side
cond_prompt_speech_tokens and the S3Gen-side prompt_token streams.
Architecture (mirrors s3tokenizer.model_v2.S3TokenizerV2 exactly):
wav_16k
→ log_mel_spectrogram (n_fft=400, hop=160, 128 mels, log10 clamp+floor
+ (x + 4) / 4 normalise)
→ Conv1d(128 → 1280, k=3, s=2) + GELU
→ Conv1d(1280 → 1280, k=3, s=2) + GELU
→ 6 × ResidualAttentionBlock:
LN → q/k/v (RoPE, NEOX-style, theta=10000)
depth-wise Conv1d(k=31) over v → fsmn_memory
scaled dot-product attention
out = Linear(attn) + fsmn_memory
LN → Linear 1280→5120 → GELU → Linear 5120→1280
→ FSQCodebook:
Linear(1280 → 8) → tanh * 0.999 → round + 1
token = Σ h[i] * 3^i (0..6560)
Implementation:
src/s3tokenizer.{h,cpp}: weights struct + GGUF loader +s3tokv2_log_mel(plain C++ STFT + mel filterbank + log clamp + normalise) +s3tokv2_tokenize(ggml graph for conv-stem + 6 transformer blocks + plain-C++ FSQ). Uses the standard pattern: one weight context (no_alloc, pre-allocated backend buffer) + a per-run input context + a big graph context for intermediates, allocated viaggml_gallocr.- Subtleties:
ggml_conv_1dandggml_conv_1d_dw_phboth assert F16 kernels in their fused kernel paths; we ship F32 weights, so we go throughggml_im2col + ggml_mul_matmanually (conv1d_f32,conv1d_dw_f32).- ggml conv output has time innermost (ne=[T, C]), but the
transformer wants channels innermost (ne=[C, T]) for LN and
1-D bias broadcasts. We
ggml_cont(ggml_transpose(...))between the stem and the blocks. - Attention permutations: q/k to ne=(head_dim, T, n_head),
v to ne=(T, head_dim, n_head), so
mul_mat(k, q)gives scores ne=(T_k, T_q, n_head) with T_k innermost forggml_soft_max, andmul_mat(v, scores)gives out ne=(head_dim, T_q, n_head). - RoPE:
ggml_rope_extwithGGML_ROPE_TYPE_NEOX,freq_base = 10000,n_ctx_orig = 2048, matches the reference's half-splitrotate_halfconvention.
- Converter:
convert-s3gen-to-gguf.pyemits all 103tokenizer.*tensors ass3tokv2/…plus 15 hyperparameters as GGUF metadata. scripts/dump-s3tokenizer-reference.py: dumpswav_16k.npy,log_mel.npy, andtokens.npyfor validation.src/test_s3tokenizer.cpp: parity harness that validates log-mel (always passes cleanly) and reports token accuracy vs Python.
Validation on a 10 s synthetic speech signal:
log_mel : max_abs=1.80e-05 rel=1.30e-05 (numerical parity)
tokens : 236 / 250 = 94.40% (FSQ-rounding drift)
FSQ is extremely sensitive: the project_down → tanh → round pipeline
turns 8 floats into 8 ternary digits, so sub-LSB float drift through
the 6 transformer layers can flip a digit and change the token. Most
mismatches are at a single high-order ternary digit — tokens
1977 = (0,2,0,1,0,2,2,0)_3 vs Python's
4164 = (0,2,0,1,0,2,2,1)_3 differ only in bit 7. In practice the
resulting speaker conditioning is close enough that the cloned audio
sounds identical.
Wiring: main.cpp gained compute_speech_tokens_native() which runs
the tokenizer twice (first 10 s of the wav → prompt_token, first
15 s → cond_prompt_speech_tokens capped to speech_cond_prompt_len).
Results feed s3gen_synthesize_opts::prompt_token_override (new
field) and the existing T3 cond_prompt_speech_tokens override path.
End-to-end pure-C++ voice cloning: with voices/test/ deleted
entirely and only --reference-audio my.wav given, the unified
tts-cli now runs the whole flow in C++:
voice_encoder: computing speaker_emb from /tmp/unified_remote.wav
voice: prompt_token=(250,) cond_prompt_speech_tokens=(260,) via S3TokenizerV2
main: T3 voice override — speaker_emb=C++ VoiceEncoder, cond_prompt_tokens=C++ S3TokenizerV2
voice: prompt_feat shape=(520, 80)
voice: embedding shape=(192,) via CAMPPlus (1038 fbank frames)
prompt_token: using C++ override (S3TokenizerV2, 250 tokens)
embedding: using C++ override (CAMPPlus, 192 dims)
prompt_feat: using C++ override (520 mel frames)
scripts/prepare-voice.py is now redundant — the CLI only needs a
reference wav. Impact: voice cloning has zero Python runtime
dependencies; a user just runs the binary.
Impact: Phase 1 unlocked voice cloning as a usable feature. Phases 2a–2e replaced every Python preprocessing step with a native C++ port, so the deployment story is now "one binary + two GGUFs".
Wired --n-gpu-layers through both T3 and S3Gen/HiFT. Now builds with
any of -DGGML_CUDA=ON, -DGGML_METAL=ON, or -DGGML_VULKAN=ON;
init_backend() in main.cpp and s3gen_init_backend() in
chatterbox_tts.cpp pick the matching backend when n_gpu_layers > 0
and fall back to CPU otherwise.
Out-of-the-box Metal was missing three things that needed kernel-level
fixes in ggml/src/ggml-metal/:
GGML_OP_DIAG_MASK_INF— no dispatcher entry. Added a kernel + pipeline getter + op encoder +supports_opcase.GGML_OP_PADwith non-zero front padding — rejected bysupports_op. Extendedkargs_padwithlp0..lp3, updated the kernel to apply them, relaxed the check.GGML_OP_CONV_TRANSPOSE_1D— kernel was scalar. Tightened the input-position loop (i_start..i_endinstead of0..IL) and parallelised theICreduction across a 32-thread simdgroup withsimd_sum. 40× speedup on HiFT-sized shapes.
Patches live in patches/ggml-metal-chatterbox-ops.patch (applied to
the vendored ggml during build); src/test_metal_ops.cpp validates
each patched kernel against the CPU reference. CUDA and Vulkan needed
no backend changes — only the chatterbox wiring.
Result: gen_RTF on a 10 s sentence drops from 0.70 (CPU) to
0.08 (Vulkan 5090) and 0.18 (Metal M3 Ultra).
Still open: T3 autoregressive inference dominates wall time on small GPUs (≈ 1.3 s for 260 tokens on a 60-core Apple GPU). Worth exploring speculative decoding or a smaller T3 draft model if further wins are needed — but current numbers are already interactive.
T3 (GPT-2 Medium, ~700 MB in F16) is the memory-bandwidth-dominated
component in the pipeline. Implemented via --quant {f16,q8_0,q5_0,q4_0}
flag in scripts/convert-t3-turbo-to-gguf.py.
The Python gguf 0.18 package has the K-quants (Q4_K / Q5_K / Q6_K)
declared but raises NotImplementedError in their quantize_blocks
implementations, so only legacy block types (Q4_0, Q5_0, Q8_0) are
produced here. Running the F16 GGUF through llama.cpp's llama-quantize
tool would work too, producing true K-quants — not done yet.
Only the big 2-D mul_mat weights get quantized: per-layer
attn/c_attn/w, attn/c_proj/w, mlp/c_fc/w, mlp/c_proj/w, plus
chatterbox/speech_head. Biases, layer norms, embeddings,
positional encoding, and the tokenizer metadata all stay at their
original dtype (F32 / F16). No C++ changes — ggml_mul_mat with
quantized weights + F32 activations is already a fast path.
Measured results, same prompt and --n-predict 200 (201 tokens output):
10-core EPYC (remote):
| Variant | GGUF size | T3 wall time | vs F16 |
|---|---|---|---|
| F16 | 736 MB | 3.91 s | 1.00× |
| Q8_0 | 460 MB | 2.85 s | 1.37× faster |
| Q5_0 | 350 MB | 2.58 s | 1.52× faster |
| Q4_0 | 313 MB | 2.38 s | 1.64× faster |
10-core Mac16,12 (M-series):
| Variant | T3 wall time | vs F16 |
|---|---|---|
| F16 | 14.92 s | 1.00× |
| Q8_0 | 5.41 s | 2.76× faster |
| Q5_0 | 5.27 s | 2.83× faster |
| Q4_0 | 4.74 s | 3.15× faster |
The Mac speedup is disproportionately large because M-series is much more memory-bandwidth-bound on F16 than EPYC's DDR5 is.
Quality, comparing output tokens on a long prompt:
- Q8_0: bit-for-bit identical to F16. No audible or measurable quality loss. Recommended default for quantized builds.
- Q5_0: sampling diverges starting around token 6. Audio output still sounds correct; small perceptible voice-identity shift.
- Q4_0: sampling diverges slightly earlier and more. Audio still intelligible, with more drift from the F16 reference voice.
S3Gen / HiFT weights initially stayed F32 because Conv1d kernels are F32-only on the ggml CPU backend (F16 on CFM linears regressed on CPU — see §3.8 Attempt 7). The S3Gen-quant pass in §3.20 lifts this for the big 2-D matmul weights only (CFM attn/FF Linears, encoder projections, HiFT Conv1d weights where the inner-dim alignment allows block layout); biases, LayerNorm, conv kernels and embedding tables still stay full precision. See §3.20 for the storage-format table and the resulting end-to-end speed / parity numbers.
Remaining: Q4_K / Q5_K path. Drop-in win would come from
llama-quantize models/chatterbox-t3-turbo.gguf /out.gguf Q4_K_M
once that tool's loader is pointed at our non-llama GGUF, or by
porting one of the K-quant kernels to the Python gguf package.
B1. Streaming / chunked generation for first-token latency — ✅ DONE (Phases 1–3d shipped; live-input mode added in §3.17)
The current pipeline is "wait 2.4 s then hear all 8.6 s at once". For interactive apps, first-audio-out latency matters more than total RTF.
What to port:
- Chatterbox's
S3GenStreamerpath in Python: interleaves T3 token-generation with chunked S3Gen / HiFT runs, overlap-adds their waveforms at the seams. - Adds
flow_cache,cache_source,mel_cacheparameters we've been setting to empty, plus the overlap-add math for the HiFT vocoder. - Emit audio to stdout (or a callback) as each chunk comes out.
Scope: ~1 week, mostly because the overlap-add math has to match Python byte-for-byte or seams click.
Impact: first audio chunk out in ~200–400 ms instead of 2+ s. Turns the binary from "batch" into "live".
Before shipping the streaming binary we needed the per-chunk C++ mel to
match Python to float32 precision. The per-chunk harness
(src/test_streaming.cpp + scripts/dump-streaming-reference.py) now
reports worst rel = 8.67e-07 for both chunks (i.e. machine epsilon) on
the test.wav reference.
The last bug found was subtle: Chatterbox's turbo flow runs CFM in
meanflow mode, which means flow_inference allocates a
second noise tensor
noise = torch.randn(1, 80, speech_tokens.size(-1) * 2, ...)
super().forward(..., noised_mels=noise)and flow_matching.forward silently overwrites the speech region of
z:
z = torch.randn_like(mu) * temperature
if noised_mels is not None:
prompt_len = mu.size(2) - noised_mels.size(2)
z[..., prompt_len:] = noised_mels # ← second randn draw lives hereOur original Python capture hook wrapped only torch.randn_like, so the
saved chunk_KK_cfm_z.npy contained the first draw everywhere,
including positions t ≥ prompt_len that are actually overwritten by
the second draw. Injecting that stale z as cfm_z0_override in C++
produced CFM output that matched Python bit-exactly in the prompt
region (t < 500) and diverged wildly in the speech region (t ≥ 500)
— exactly the "receptive field of the prompt/speech boundary" pattern
we were chasing.
Fix (commit 2e82cce
and the follow-up in this section):
- Replace the
torch.randn_likecapture with a wrapper aroundCausalConditionalCFM.basic_eulerthat records the fullxtensor at the firstestimator.forwardcall. That tensor is the real z after the meanflow overlay. - Dump it as
chunk_KK_step0_x_in.npy;test-streamingloads that (instead of the oldchunk_KK_cfm_z.npy) intocfm_z0_override. - All four CFM inputs (
mu,mask,spks,cond) already matched at rel ≤ 3e-7, so fixingzmade the estimator output match at rel ≈ machine epsilon.
Lessons: in streaming validation harnesses, capture the exact tensor
the target op receives, not an earlier upstream value. Monkeypatching
a function that a caller later post-processes (z[...] = …) is a
silent source of divergence.
With CFM bit-exact across chunks, wiring up the HiFT side and the user-facing CLI was straightforward:
-
cache_sourcecarry (src/chatterbox_tts.cpp,s3gen_synthesize_opts): aftersinegen_sourceproduces the post-m_sourcesource signal, overwrite its leading samples with the caller-providedhift_cache_sourceand expose the lastsource_tail_samples(480 = 1 mel hop = 20 ms) viahift_source_tail_outso the caller can feed them back in on the next chunk. Matches PythonHiFTGenerator.inference'ss[:, :, :cache_source.shape[2]] = cache_source. -
trim_fade(same file): opt-in raised-cosine fade-in applied to the first2 * sr/50 = 960samples (40 ms) of each chunk's wav. First half zero, second half(cos(π→0)+1)/2. Streaming callers setapply_trim_fadeon chunk 0 only. -
--stream-chunk-tokens NCLI flag (src/main.cpp): wrapss3gen_synthesize_to_wavin a chunked loop that carrieshift_cache_sourceacross chunks, writes per-chunk wavs as<out>_chunk_KK.wav, and concatenates the final wav into--out. Addsappend_lookahead_silence=false,finalize=(is_last), andskip_mel_frames=prev_mels_emittedon each chunk. -
Process-wide model cache (src/chatterbox_tts.cpp,
s3gen_model_cache_get): makes the ~700 ms GGUF-tensor load a one-shot cost.s3gen_preload(path, n_gpu_layers)populates the cache eagerly so main.cpp can kick a background std::thread to warm S3Gen while T3 is still running. Brings first-chunk latency down from 2006 ms → 1340 ms on CPU for the"streaming sanity check"test.
Validation (./build/test-streaming models/chatterbox-s3gen.gguf /tmp/streaming_ref):
| chunk | mel rel | wav rel (informational) |
|---|---|---|
| 1 | 6.47e-07 | 1.06e-01 |
| 2 | 8.67e-07 | 1.24e-01 |
Mel is bit-exact; wav diverges a few percent because C++'s
sinegen_source uses std::mt19937 vs Python's torch.randn — the
audio content is identical, only the per-sample additive white-noise
seed differs. Python's own streamed-vs-batch ratio is 116 %, so our
streamed-vs-Python-streamed is 6.5 %, well inside the structural
envelope of the approach.
Performance numbers on a 3.76 s utterance (9 s of reference audio):
| metric | batch | streaming (25 tokens/chunk) |
|---|---|---|
| total wall time | 2271 ms | 5988 ms |
| first-audio-out | 2271 ms | 1340 ms |
| per-chunk RTF | 0.60 | 1.44 – 1.59 |
What actually changed — plain English. Before this phase, each streaming chunk had to re-run the encoder and CFM on the whole speech so far (so chunk 5 did more work than chunk 1), and CFM always did 2 Euler steps because that's what Python does. Result: each chunk took ~1.5 s to produce 1 s of audio, and the first chunk took ~1.3 s before you heard anything.
Two new tts-cli flags, no change to the model:
-
--stream-first-chunk-tokens N— the first chunk uses N tokens; every chunk after that uses--stream-chunk-tokens. So you can make the first chunk small (≈10 tokens / 0.4 s of audio) to get audio out fast, and keep subsequent chunks big (≈50 tokens / 2 s) to amortise the fixed per-chunk overhead. Code is ~10 lines insrc/main.cpp— just a boundary-building change, no pipeline rewrite. -
--stream-cfm-steps N— override the hard-coded CFM step count (2 for Python's meanflow). SettingN=1literally halves CFM compute per chunk, because CFM is just a 2-step Euler loop. The meanflow-trained model is designed to be sampled in 1 step (per the meanflow paper — "mean" means the ODE can be collapsed to one jump); this isn't a hack, it's using the model the way it was trained to be usable. There's a quality trade — 1-step is a bit noisier than 2-step (log-mag MAE ≈ 0.5) — so default stays at 2. Flag is opt-in. Change is ~5 lines inchatterbox_tts.cppwheret_span = {0, 0.5, 1}used to be hard-coded.
Recommended low-latency preset:
./build/tts-cli --model t3.gguf --s3gen-gguf s3gen.gguf \
--text "…" --out out.wav \
--stream-first-chunk-tokens 10 \
--stream-chunk-tokens 50 \
--stream-cfm-steps 1First audio out in ≈ 800 ms; middle chunks run at RTF 0.65 so the streamer stays ahead of playback on a 4-thread CPU. Numbers below.
What I did not do. The earlier prose promised "incremental
encoder / KV-cached CFM". That would mean: chunk 5 only re-processes
the 25 new tokens, reusing intermediate activations saved from chunks
1–4 — like the KV cache in an LLM decoder. I didn't do that, because
the model isn't built for it. I verified the Python reference: both
the flow encoder and the CFM estimator do full bidirectional
self-attention (every output position looks at every input position,
both directions, static_chunk_size = 0). Reusing previous-chunk
activations requires attention that only looks leftward (causal) or
only within fixed windows (chunked-causal). That's baked into the
trained weights — you can't retrofit it in C++, the model would need
to be retrained. So instead of "KV-cached CFM" I shipped "cheaper
CFM" (1-step) and "smarter chunk boundaries" (small first, big
after). Different optimisations, same user-visible win — fast first
audio, streaming keeps up.
Per-chunk profiling on the same 4.9 s utterance:
| stage | cost per chunk (T_mu≈650) |
|---|---|
| encoder (T_tokens≈350) | ~280 ms |
| CFM step 0 | ~580 ms |
| CFM step 1 | ~500 ms |
| HiFT decode (1 s audio) | ~265 ms |
| total | ~1630 ms for 1 s of audio |
CFM is ~2/3 of every chunk. Two things that don't work for cutting it down without retraining:
- KV-cached CFM / incremental encoder — Chatterbox's flow encoder
and CFM estimator both run full bidirectional self-attention. I
verified
static_chunk_size = 0indecoder.py(no chunked attention mask) and that the encoder has no causal mask either. Caching previous-chunk activations would require the attention to be causal (or at least chunk-causal). Retrofitting that at inference time changes the output distribution — not a pure port. - Prompt-region truncation — the 500-frame prompt accounts for ~70 % of T_mu and its CFM output is discarded every chunk. But attention is full, so any speech-region output depends on every prompt frame via softmax. Truncating to a short prompt tail would require retraining.
What does work, and is now shipped as tunables:
- Non-uniform chunk sizes (
--stream-first-chunk-tokens N). First chunk stays small (≈10 tokens / 0.4 s audio) for fast first-audio-out; subsequent chunks go big (≈50 tokens / 2 s audio) so the fixed per-chunk encoder+CFM cost amortises over more output. - Fewer CFM Euler steps (
--stream-cfm-steps 1). Turbo is meanflow-trained, and meanflow supports 1-step sampling per the paper. In practice 1-step introduces some audible high-frequency noise (log-mag MAE ≈ 0.5 vs 2-step) but keeps content intact. Default stays at 2 to match Python; users opt in via the flag.
Measured on the same text on CPU:
| config | first-audio | chunk-N RTF | overall RTF |
|---|---|---|---|
baseline (--stream-chunk-tokens 25) |
1331 ms | 1.44 – 1.70 | 1.59 |
first-small (10 → 25) |
1156 ms | 1.37 – 1.69 | 1.84 |
1-step + big (50, steps=1) |
1230 ms | 0.63 – 0.69 | 0.78 |
combined (10 → 50, steps=1) |
782 ms | 0.63 – 0.69 | 0.94 |
The "combined" preset hits both objectives at once: first audio out in ≤ 800 ms on CPU, and middle chunks complete in 2/3 of their audio duration so the streamer can stay ahead of playback. Incremental encoder / KV-cached CFM stay on the backlog for when someone wants to retrain Chatterbox with chunk-causal attention.
--out - emits each chunk's audio as raw 16-bit little-endian PCM
to stdout the moment it's produced, with an explicit fflush after
every chunk so downstream players receive it immediately (no stdio
buffering stalls at chunk boundaries).
In stdout mode no .wav files are left behind — per-chunk
intermediate writes go to /tmp/chatterbox_stream_chunk_KK.wav and
are unlink()'d right after the bytes hit stdout. All log output
stays on stderr so the audio stream is clean.
./build/tts-cli \
--model models/chatterbox-t3-turbo.gguf \
--s3gen-gguf models/chatterbox-s3gen.gguf \
--text "Testing stdout streaming." \
--stream-first-chunk-tokens 10 --stream-chunk-tokens 50 \
--stream-cfm-steps 1 \
--out - \
| ffplay -f s16le -ar 24000 -ac 1 -nodisp -autoexit -Validation: the PCM emitted to stdout is byte-for-byte identical to
the file written by the same invocation with a normal --out foo.wav, checked by loading both and taking a diff (max=0, rms=0).
Why not WAV-header-then-PCM? A live WAV header needs the total
sample count up front and we don't know it until the last chunk
finalises; writing a placeholder then patching after the fact doesn't
compose with pipe output. Raw s16le is what ffplay, aplay,
pacat, sox etc. accept natively, so no one loses in practice.
End-to-end streaming verified audible on an Apple M4 with the Metal backend and the recommended low-latency preset:
./build/tts-cli \
--model models/chatterbox-t3-turbo.gguf \
--s3gen-gguf models/chatterbox-s3gen.gguf \
--text "…long paragraph…" \
--stream-first-chunk-tokens 10 \
--stream-chunk-tokens 25 \
--stream-cfm-steps 1 \
--n-gpu-layers 99 \
--out - \
| play -q -t raw -r 24000 -b 16 -e signed -c 1 -Measured on the 48-text-token sentence "Hello from streaming Chatterbox, I am john and i work in google since 2010. I love to go out with my friends, eat some pizza and also drink some wine. I also love to traverl around the world alone." → 317 speech tokens → 12.68 s audio → 14 streaming chunks:
| chunk | tokens_total | T_mu | encoder | CFM step0 | HiFT | total ms | RTF |
|---|---|---|---|---|---|---|---|
| 1 | 10 | 514 | 84 ms | 144 ms | 37 ms | 278 ms | 0.99 |
| 2 | 35 | 564 | 69 ms | 126 ms | 116 ms | 324 ms | 0.32 |
| 3 | 60 | 614 | 91 ms | 143 ms | 115 ms | 370 ms | 0.37 |
| 4 | 85 | 664 | 117 ms | 159 ms | 115 ms | 409 ms | 0.41 |
| 5 | 110 | 714 | 126 ms | 173 ms | 115 ms | 433 ms | 0.43 |
| 6 | 135 | 764 | 153 ms | 182 ms | 116 ms | 468 ms | 0.47 |
| 7 | 160 | 814 | 163 ms | 197 ms | 117 ms | 499 ms | 0.50 |
| 8 | 185 | 864 | 153 ms | 213 ms | 114 ms | 499 ms | 0.50 |
| 9 | 210 | 914 | 191 ms | 230 ms | 115 ms | 558 ms | 0.56 |
| 10 | 235 | 964 | 210 ms | 250 ms | 114 ms | 591 ms | 0.59 |
| 11 | 260 | 1014 | 187 ms | 257 ms | 115 ms | 579 ms | 0.58 |
| 12 | 285 | 1064 | 231 ms | 266 ms | 115 ms | 634 ms | 0.63 |
| 13 | 310 | 1114 | 208 ms | 280 ms | 113 ms | 614 ms | 0.61 |
| 14 | 317 | 1134 | 212 ms | 290 ms | 49 ms | 568 ms | 1.42 |
=== streaming done: 304320 samples (12.680 s),
first-chunk latency = 278.9 ms,
total wall = 11474.7 ms (overall RTF = 0.90) ===
Observations:
- First-audio-out: 279 ms on M4 + Metal. Chunk 1 is 10 tokens
(~0.28 s of audio) and lands at RTF ~1.0 because the fixed encoder
- CFM overhead dominates such a small chunk — but the wall-time number is what matters, and it's low.
- Steady-state RTF 0.3 – 0.6 for chunks 2–13 (each 1 s of audio).
Well below real-time, so
sox playstays ahead of playback on every chunk and there are no audible gaps. - Chunk 14 is the "tail" finalise (only 0.4 s of audio; whatever's left after the last full 25-token boundary) so its RTF naturally drifts above 1. It completes before playback reaches it because chunks 11–13 produced excess buffered audio.
- Total wall time 11.47 s for 12.68 s of audio → overall RTF 0.90, i.e. even adding up every per-chunk cost, the pipeline is faster than real-time end-to-end.
Playback caveat on macOS 26 / ffmpeg 8.1: ffplay -f s16le -i - is
silent for piped raw PCM on our M4 test box (known SDL2 + CoreAudio
regression). sox play and Python sounddevice.play() work
reliably. README now recommends sox and shows the exact
invocation.
README gained a new "Streaming mode — low-latency playback" section
under "Useful flags" documenting the three --stream-* tunables, the
--out - stdout mode, the sox play recipe, and the table above.
That section plus this Phase 3d write-up are the canonical places for
future readers to pick up streaming from.
Every invocation currently pays ~200–400 ms fixed cost for graph
construction + gallocr_reserve + model load. Amortizing these over a
long-running process is free wall-time for a deployed service.
What to do:
- Daemonize with a simple stdio JSON-RPC or HTTP interface.
- Extend the
cfm_estimator_cachepattern (from §3.8 Attempt 4) to the encoder and HiFT graphs — keep them pre-reserved across requests. - Tensor shapes depend on input length → either: (a) LRU of per-length graphs, (b) pad to a fixed max length + attention mask, or (c) rebuild on shape change but pool the buffers.
Scope: 2–3 days.
Impact: for repeated short utterances on the same server, another 20–30 % off wall time on top of the current RTF 0.28.
Right now a cloned voice is persisted as five .npy files under a
directory and loaded via --ref-dir DIR. That's convenient during
development but awkward to share: end users end up with a zip of five
opaque numpy files plus the C++ binary plus the original
chatterbox-s3gen.gguf. Most deployments would rather ship one
file — a voice-baked .gguf that works with the existing CLI as a
drop-in replacement for models/chatterbox-s3gen.gguf.
Fundamentally the five tensors are already first-class GGUF citizens:
s3gen/builtin/embedding, s3gen/builtin/prompt_token,
s3gen/builtin/prompt_feat live inside the base GGUF as-is, and the T3
side needs speaker_emb + cond_prompt_speech_tokens. So "baking a
voice" is just "rewrite those five tensor slots and copy everything
else through".
What to add:
--save-model PATH.gguf(name tentative) that, combined with--reference-audio PATHor--ref-dir DIR, writes a new GGUF next to the originalchatterbox-s3gen.ggufwith the five voice tensors replaced. Bit-identical to the original in every other tensor and metadata entry — just a rewrittenbuiltinblock. The two voice tensors that belong on the T3 side (speaker_emb, cond_prompt_speech_tokens) could either live alongside in the same GGUF (preferred: the binary already knows how to look for them under as3gen/builtin/prefix) or produce a matchingchatterbox-t3-turbo.<voice>.ggufwith those two tensors replaced.- Zero runtime overhead once baked. Subsequent runs just use the
new GGUF path as
--s3gen-ggufand--model; no--ref-dir,--reference-audioor.npyfiles needed. The built-in-voice fallback inchatterbox_tts.cppalready reads from exactly those tensor names, so there's literally no new load-time code — just the converter. - CLI UX:
tts-cli --reference-audio voice.wav --save-model alice.gguf --no-synthesizeshould be enough to bake once and walk away. No--text, no wav output, just the new GGUFs on disk.
Scope: ~1 day. It's essentially a gguf re-write helper — read
the original, iterate tensors, substitute the five voice slots with
the freshly computed values, copy everything else through.
gguf_writer can do this directly; no new numeric code is needed.
Impact: clean distribution story. "Here is my voice" becomes a single 400 MB file instead of "here is this directory of numpy files and you need to know which C++ flag they go behind." Also opens up prebuilt-voice downloads on Hugging Face (cf. C3).
The S3Gen encoder's 10 Conformer blocks couldn't use flash_attn_ext
because they add ESPnet relative positional bias inside the softmax
(see §3.8 Attempt 8). A custom op that does
softmax(QKᵀ/√d + B) · V with B pre-computed [L, T, H] would fuse
those too.
Scope: 3–5 days — CPU AVX-512 kernel first, Metal/CUDA once (A2) is online.
Impact: maybe 50–100 ms off encoder (~10 % of encoder, which is already only 12 % of the pipeline). Small in absolute terms; does get you the same fusion level throughout.
Multiple utterances in one pass. Python supports it; our C++ pipeline assumes batch=1 throughout. Only matters at scale (multiple concurrent users).
- GitHub Actions CI running
compare-tokenizer.py+test-s3gen ALLon every push. All the validation infrastructure is already in place; wiring it takes a few hours. - Prebuilt GGUFs on Hugging Face so end users don't need the
Python toolchain at all. Upload the two
.gguffiles with a model card explaining the build. - Library API (not just binaries). Expose
chatterbox_synthesize(text, opts) -> wavas a C / C++ API so Swift / Node.js / Python bindings can layer on top. ~Half a day.
With A1 (voice cloning), A2 (GPU backends), A3 (T3 quantization), and B1 (streaming) done, the remaining high-impact work is:
- B3 — Bake voice into GGUF (~1 day) → cleanest distribution story for sharing custom voices; makes prebuilt-voice downloads on Hugging Face (C3) actually shippable.
- C3 — CI + prebuilt GGUFs — pick up before announcing publicly.
- T3 autoregressive speedup (speculative decoding, or a smaller T3 draft model). Biggest chunk of wall time left on both Metal and Vulkan now that HiFT is fast.
B2 (server mode) and C1 (custom Conformer attn op) are worth doing once a concrete deployment is pressuring for them; the CPU numbers are already well past real-time for CLI use, and the GPU numbers are at multi-x real-time with zero extra work.
Mirrors the shape stable-diffusion.cpp uses with its
SD_USE_SYSTEM_GGML switch. The standalone Chatterbox dev workflow
(everything described above) is intentionally untouched.
-
A single 13/-2-line additive edit to the top of
CMakeLists.txt:option(TTS_CPP_USE_SYSTEM_GGML "tts-cpp: use system-installed GGML library" OFF) if (NOT TARGET ggml) if (TTS_CPP_USE_SYSTEM_GGML) find_package(ggml CONFIG REQUIRED) if (NOT ggml_FOUND) message(FATAL_ERROR "System-installed GGML library not found.") endif() add_library(ggml ALIAS ggml::ggml) else() add_subdirectory(ggml) endif() endif()
- Default
OFF->add_subdirectory(ggml): pre-existing standalone flow, byte-identical to before. ON(set by thetts-cppvcpkg port at configure time): pullsggml::ggmlfrom a separately-installed ggml package, ignores the localggml/tree.
- Default
-
ggml/andpatches/directories are kept on the branch as-is.scripts/setup-ggml.shandpatches/ggml-metal-chatterbox-ops.patchremain the canonical reference for re-applying the Metal patch against future ggml syncs.
- An external
ggmloverlay port was published off ggmlmaster(same commitstable-diffusion-cppbuilds against) with the same Metal patch we ship underpatches/ggml-metal-chatterbox-ops.patchapplied as real source commits. The patch file itself is retained alongside the overlay as the source-of-truth artefact for re-application against future ggml syncs. - A vcpkg registry now publishes:
ggml— REPO/REF bumped to the overlay head carrying the Metal chatterbox ops. Backward compatible forstable-diffusion-cpp/whisper-cpp(additive Metal kernels + opt-in fusion gated by function constants).tts-cpp— REF bumped to the chatterbox.cpp commit that introducesTTS_CPP_USE_SYSTEM_GGML; passes-DTTS_CPP_USE_SYSTEM_GGML=ON; drops every-DGGML_*configure option, the Android Vulkan-Headers download block, theGGML_VULKAN_DISABLE_COOPMAT*knobs and the NDK glslc detection — all of those now live inside theggmlport. Declares an explicitggmldependency withmetal/vulkanfeature forwarding (mirrorsstable-diffusion-cpp/vcpkg.json).
- chatterbox.cpp standalone (Apple M4, Metal): clean configure +
build of every target with default
-DTTS_CPP_USE_SYSTEM_GGML=OFF;test-metal-opsparity-checks all four patched ops (diag_mask_inf,pad_extwithlp0..lp3,conv_transpose_1dat the three chatterbox upsample stages and the tiny edge case); CLI smoke synth produces an 86 KB WAV in 3.2 s (T3 642 ms / S3Gen 635 ms / 1.84 s audio, RTF 0.34). - Downstream addon (darwin-arm64, Metal): cold-cache vcpkg resolve
picks up both new ports, the addon links against
ggml::ggmlwith no further changes; unit suite 38/38, integration 4/4 (Whisper round-trip 0.0% WER on "How are you doing today?", native chunk streaming emits 8 chunks, sentence streaming RTF 0.5448).
§3.20 left the multilingual M4 baseline at RTF 1.37 / 1.65 (Q4_0 /
F16) and itemised three follow-ups the §3.20 optimisation didn't touch:
runtime CFM step count, MTL T3 step batching, and a faster MLP path.
This pass picks them up on M3 Ultra Metal (96 GB unified memory) and
hits RTF 0.30 (Q4_0) / 0.32 (F16) end-to-end on the same Spanish
prompt, seed 42, --temp 0 --top-k 1, voice = jfk.wav. Pre-rationale
in /Users/user002/.cursor/plans/mtl_metal_optimization_breadth_7807d6e0.plan.md;
this section is the post-mortem with positive and negative findings.
M3 Ultra baseline (before this pass), prompt + seed identical to the §3.19 reference, 3 warm-run averages excluding T3 load:
| Model | T3 (84/89 tok) | S3Gen (3.48/3.68 s audio, N=10) | Total | RTF |
|---|---|---|---|---|
| Q4_0 | 872 ms / 84 tok | 740 ms | 1612 ms | 0.46 |
| F16 | 1099 ms / 89 tok | 844 ms | 1943 ms | 0.53 |
(M3 Ultra was already well under RTF 1.0 — its 60-core GPU is ~6× the M4's 10-core GPU — so this pass is about how much further we can push, not about clearing the real-time gate. The relative gains transfer to M4: see "What this means for M4" at the end of the section.)
Bench matrix (M3 Ultra Metal, 3-warm-run averages, T3_INFER_MS only, unless otherwise noted). Each row is cumulative — adding the optimisation in the column heading on top of everything to its left.
| Variant | baseline | +P1: B=2 CFG | +P1+P2: F16 KV | +P1+P4: SwiGLU split | +P1+P3+P4 N=7 (final) |
|---|---|---|---|---|---|
| Q4_0 T3 | 872 ms | 502 ms (-42%) | 507 ms (≈) | 482 ms (-4% vs P1) | 478 ms (-45%) |
| Q4_0 S3Gen | 740 ms | 720 ms | 723 ms (≈) | 730 ms (≈) | 576 ms (-22%) |
| Q4_0 Total | 1612 ms | 1219 ms (-24%) | 1230 ms | 1212 ms | 1054 ms (-35%) |
| Q4_0 RTF | 0.46 | 0.35 | 0.35 | 0.35 | 0.30 |
| F16 T3 | 1099 ms | 602 ms (-45%) | 600 ms (≈) | 635 ms (+5% noise) | 579 ms (-47%) |
| F16 S3Gen | 844 ms | 752 ms | 743 ms (≈) | 778 ms (≈) | 586 ms (-31%) |
| F16 Total | 1943 ms | 1354 ms (-30%) | 1343 ms | 1413 ms | 1165 ms (-40%) |
| F16 RTF | 0.53 | 0.37 | 0.36 | 0.38 | 0.32 |
Raw stderr per phase saved under artifacts/bench/mtl-metal-m3u-*.txt
(baseline + per-phase + cfm-sweep + final). Audio-quality gates against
N=10 / phase-1 reference WAVs:
-
Phase 1 vs baseline: byte-exact WAV (cond+uncond batching is numerically identical to two sequential cond/uncond forwards on the same backend; the unified KV buffer plus
b_offset_elems = 0 | kv_layer_elemsreproduces the per-pass slab layout). -
Phase 4 (
ggml_swiglu_split) vs Phase 1: byte-exact WAV (Metal'skernel_swiglu_f32is bit-equivalent to the manualggml_silu(gate) * up). -
--cfm-stepssweep (computed via librosa log-mel cosine, seeartifacts/bench/mtl-metal-m3u-cfm-sweep-q4_0.txt):N S3Gen ms log-mel cos vs N=10 PCM cos vs N=10 6 518 ms 0.9897 0.8836 7 571 ms 0.9954 0.9414 8 629 ms 0.9972 0.9702 10 730 ms 1.0000 1.0000 N=7 cleanly clears the cos ≥ 0.99 gate; N=6 sits right on the threshold (PCM cosine drops to 0.88 — phase-coherent attack reconstruction starts to drift) so it's left as opt-in only.
Phase 1 — CFG cond+uncond batched into one Metal forward (B=2) — biggest win on both Q4_0 (-42%) and F16 (-45%).
The §3.19 multilingual T3 ran CFG as two sequential
run_step_pass/run_prompt_pass calls per token, each rebuilding +
computing a 30-layer Llama graph with a separate memory_k_uncond /
memory_v_uncond KV cache. On Metal this doubled the per-step kernel-
dispatch + weight-read overhead — exactly the regression use_b2
already paid off for S3Gen's CFM (src/chatterbox_tts.cpp:1994 /
§3.19). This pass mirrors that on T3:
- New
build_step_graph_mtl_b2(model, n_past)andbuild_prompt_graph_mtl_b2(model, n_text_tokens)in [src/t3_mtl.cpp]. cond + uncond pack into the batch dim (ne[3]=2) forinputs_embeds,pos_ids,kq_mask, and the per-layer Q/K/V activations. RoPE +flash_attn_extboth broadcast the head/seq dims over batch out of the box, sobuild_llama_blockonly grew anint Bparameter andint b_offset_elems(one cache slab offset for the legacy B=1 CPU fallback). - KV layout rework. The two parallel 1-D F32 KV buffers
(
memory_k+memory_k_uncond) are now a single contiguous2 × kv_layer_elemsbuffer per layer, cond at offset 0, uncond at offsetkv_layer_elems. Per-layer slab stride is therefore2 * head_dim * n_ctx * n_kv_head * sizeof(F). The B=2 graph views the same buffer as(head_dim, n_ctx, n_kv_head, B=2)withbatch_stride = kv_layer_elems * sizeof(F); the legacy B=1 CPU path selects the right half viab_offset_elems = is_uncond ? kv_layer_elems : 0. Total backend allocation is unchanged (still 2 × kv_elements per cache); we just dropped twoggml_new_tensor_1dcalls. eval_step_mtl/eval_prompt_mtldispatch the B=2 path when!ggml_backend_is_cpu(model.backend)— exactly mirrorsuse_b2in S3Gen. CPU keeps the two-call path for the same reason §3.19 found for S3Gen B=2: the per-op B=2 work doubles without saving ops on ggml-cpu, so the two-call path remains the winner there.
Parity gates passed:
- Greedy decode token parity at
--temp 0 --top-k 1: first 100 tokens identical to the two-call baseline on seed 42. - End-to-end WAV byte-exact match vs the §3.19 reference run on Q4_0
and F16 (
cmp /tmp/baseline_q4_0_r3.wav /tmp/phase1_q4_0.wav→ identical, same for F16). - CPU smoke test (
--n-gpu-layers 0) still produces audio with the B=1 fallback path.
Phase 3 — --cfm-steps N for non-streaming MTL
— biggest S3Gen win when set to N=7 (-22% S3Gen vs N=10).
Pre-§3.21, only --stream-cfm-steps propagated into
s3gen_synthesize_opts.cfm_steps; non-streaming MTL was locked at the
GGUF's n_timesteps=10. Even though s3gen_synthesize_opts.cfm_steps
existed (and was honoured by the inner CFM loop in
chatterbox_tts.cpp:1973), [src/chatterbox_cli.cpp] never surfaced it.
A 6-line CLI flag (--cfm-steps N) routed into all three non-streaming
s3gen_synthesize_opts setup sites + a sweep block:
N=6 S3Gen 518 ms log-mel-cos 0.990 PCM-cos 0.88 (borderline)
N=7 S3Gen 571 ms log-mel-cos 0.995 PCM-cos 0.94 ← recommended knee
N=8 S3Gen 629 ms log-mel-cos 0.997 PCM-cos 0.97
N=10 S3Gen 730 ms log-mel-cos 1.000 PCM-cos 1.00 (default)
The default stays at 10 (no behaviour change for callers that don't
pass the flag); the README's MTL bench table now has both N=10 and
N=7 rows so users can pick.
Phase 4 — ggml_swiglu_split on the Llama MLP
— marginal on M3 Ultra (Q4_0 -4% within the plan's 5% gate; F16 within
noise) but kept for code clarity + future ggml-metal kernel improvements.
Each Llama block in build_llama_block did silu(gate) * up as three
separate ggml ops — ggml_silu(...), ggml_mul_mat(mlp_up, ...),
ggml_mul(silu_out, up_out) — i.e. a silu + mul element-wise pair
on top of the two mul_mats, at 30 dispatches/token across layers.
Upstream ggml already exposes this as a single op: ggml_swiglu_split(ctx, gate, up) lowers to GGML_OP_GLU / GGML_GLU_OP_SWIGLU, which Metal
maps to kernel_swiglu_f32 (one fused kernel per layer instead of two
elementwise dispatches). The pre-norm ggml_mul(ggml_rms_norm(...), g)
pattern was already auto-fused upstream by ggml-metal's
can_fuse(RMS_NORM, MUL) path (kernel_rms_norm_mul_f32); we left it
written as the two obvious ops so CPU + non-Metal backends get the same
shape. Net WAV output: byte-exact vs Phase 1.
The plan called out three "trades to verify empirically". All three got measured; two were reverted.
Phase 2 — F16 KV cache. Reverted: neutral on M3 Ultra.
Switching memory_k/memory_v from F32 to F16 was the predicted-large
bandwidth win (30 layers × 4096 ctx × 16 heads × 64 head_dim × 2 batches
per step on the hot path). The change is small and clean — the strides
in build_llama_block were already routed through
ggml_type_size(memory_k->type), flash_attn_ext consumes F16 K/V
directly, and the per-step ggml_cpy writing new K/V from F32
activations does the F32→F16 conversion for free. But the bench was a
wash on M3 Ultra:
| Variant | F32 KV (Phase 1) | F16 KV (Phase 2) | Δ |
|---|---|---|---|
| Q4_0 T3 | 502 ms (avg) | 507 ms (avg) | +1% (≈) |
| F16 T3 | 602 ms (avg) | 600 ms (avg) | -0% (≈) |
Audio output byte-exact vs Phase 1 — i.e. the F16 storage didn't even
change the compute precision. The combination strongly suggests
ggml-metal's flash_attn_ext was already running its inner matmul
at F16 precision regardless of K/V storage dtype (Apple GPUs have F16
matrix-multiply hardware; storage→register conversion is free, so the
F32 K/V cache was effectively a no-op buffer). Reverted to F32 storage
to keep the §3.19 numerics envelope exactly preserved; the
type-size-aware strides stay in place as a one-character flip
(GGML_TYPE_F32 → GGML_TYPE_F16 in load_model_gguf_mtl) so a
memory-bound backend (e.g. an M4 with 10 GPU cores where bandwidth
does matter) can opt back in without a code change. Bench artefacts
under artifacts/bench/mtl-metal-m3u-phase2-{q4_0,f16}.txt.
Phase 4-stretch: explicit RMS_NORM + MUL(g) and
MUL_MAT + ADD(bias) fusions in
patches/ggml-metal-chatterbox-ops.patch. Not shipped.
Audit of upstream ggml/src/ggml-metal/:
kernel_rms_norm_mul_f32(and_4SIMD variant) already exists upstream;ggml-metal-ops.cpp:can_fuse(RMS_NORM, MUL)triggers it automatically for ourggml_mul(ggml_rms_norm(x), g)patterns.kernel_rms_norm_mul_add_f32is the next-level-up fusion (RMS_NORM + MUL + ADD); not used by our T3 (no bias on the RMSNorm gain).kernel_bin_fuse_implalready chains element-wise ops.- The Q-variant
mul_mat + add(bias)fast path is already in the Chatterbox patch (get_pipeline_mul_mv(..., has_bias, has_residual),FC_MUL_MV + 2/+3constants); extending it to F16 src0 was the Phase 4c stretch goal. Skipped because the F16 build hits Phase 1's -45% T3 win first and lands at the same RTF 0.32 as Q4_0+--cfm-steps; the marginal win available from F16 mat_vec+bias fusion (Llama's Q/K/V/O have no bias in this model —cond_spkr/bis the only bias-bearing tensor, hit once per cond pass) is below the bench gate.
Net: zero new lines of Metal-kernel patch. Upstream's fusion coverage
already maps onto every fusable op we have, and the one slot we'd need
to extend (F16 mul_mat + add(bias)) is dispatched ≤ 1× per cond pass
in our model so the win is below the floor.
§3.19's M4 numbers are now stale on Q4_0 + F16; the same Phase 1 + 3
combination should bring multilingual M4 RTF down from 1.37 → ≈ 0.95
(if T3 scales with the same -42% as M3 Ultra: 1865 ms × 0.58 = 1082 ms,
combined with --cfm-steps 7 which scales linearly with N: 2247 ms × 7
/ 10 = 1573 ms; total 2655 ms vs 2.56 s audio → RTF 1.04). Worth re-
benchmarking on real M4 hardware before claiming the speedup. The Phase
2 (F16 KV) revert may also flip on M4: with 6× less GPU compute, the
KV-bandwidth headroom that's slack on M3 Ultra could become the binding
constraint on M4. Flipping the one-line dtype back to F16 + re-bench on
M4 is the way to confirm.
Vulkan / CUDA: the B=2 batching change is backend-agnostic (it's a
graph-shape change, not a Metal patch), so it should land the same
-30..-45% win on any GPU backend; the --cfm-steps flag is wholly
backend-independent. No measurements collected here — left as a
follow-up.
| File | Change |
|---|---|
| src/chatterbox_t3_internal.h | Comment-only: KV layout doc updated to describe the unified cond+uncond buffer; memory_k_uncond/memory_v_uncond are now nullable view aliases for legacy callers (none on the MTL hot path). |
| src/t3_mtl.cpp | build_llama_block gains int B, size_t b_offset_elems; new build_step_graph_mtl_b2, build_prompt_graph_mtl_b2, run_step_pass_b2, run_prompt_pass_b2; eval_step_mtl / eval_prompt_mtl dispatch B=2 on non-CPU backends; KV allocation is now a single 2× tensor; MLP uses ggml_swiglu_split. |
| src/chatterbox_cli.cpp | New --cfm-steps N flag wired into all three non-streaming s3gen_synthesize_opts setup sites + help text. |
| README.md | Multilingual table + per-stage block grew M3 Ultra rows alongside the existing M4 rows; tts-cli example mentions --cfm-steps. |
artifacts/bench/mtl-*-m3u-*.txt |
Raw stderr per phase + cfm-sweep + final. |
T3 Q4/Q5/Q8 quantisation— shipped in §3.19 (reused via_load_requantize_policy).Quantised CFM estimator weights— shipped in §3.20.Runtime— shipped in §3.21.--cfm-steps NFixing— checked; not on the multilingual hot path (conv1d_f32arg order on MTL S3Genuse_b2 = !cpualready routes through the batch-2 conv path).- Heterogeneous-core aware thread default for CPU MTL — still on the table; orthogonal to this Metal pass.
- ja / he / ru / zh / hi tokenizer support — separate sub-projects; out of scope for §3.21.
- Speculative decoding for T3 — long-tail item from §3.20 backlog.
- F16 KV cache on M4 — left as opt-in flip; needs M4 measurement before shipping.
3.22 MTL allocator-overhead clean-up — drop redundant gallocr_reserve + cache HiFT/time_mlp scaffolding
Three small allocator-side cleanups on top of §3.21. The bench
deltas are within run-to-run noise on M3 Ultra (~1% on T3, ~2% on
CFM and HiFT individually, ~0.6% on total wall) but they remove
unambiguously wasted work that lands harder on slower CPUs and
older Metal builds where the topology-walk and 64 MB memset are
proportionally more expensive. All three pass the byte-exact WAV
gate against §3.21 HEAD (md5 79002f09bc48dda95ec0c2cfc2b895bd).
Three changes, listed in order of attack-surface:
-
Drop
ggml_gallocr_reservebeforeggml_gallocr_alloc_graph.alloc_graphalready callsggml_gallocr_needs_reallocand only triggers a re-reservation when the graph's per-node sizes actually grew. T3's per-step graph keeps the same node count and same per-node tensor shapes for everyn_past >= 1(the K/V views intomemory_k/memory_vchange strides but not sizes; only the persistent slab grows), so 83 of the 84 step-pass reserves were doing a full O(n_nodes) topology walk for nothing. Affects all fourrun_*_pass[_b2]paths int3_mtl.cpp. -
run_hift_decode64 MB scratch buffer →thread_local. The previousstd::vector<uint8_t> buf(64MB)forced a 64 MB memset on every HiFT call (one per--outinvocation in batch mode, one per chunk in streaming).ggml_initresets the arena pointer between calls, so the buffer is reused safely without leaking tensor metadata across invocations. -
compute_time_mlpgraph + gallocr →thread_local time_mlp_cache. The graph topology (TDIM=320 sin/cos input → 2-layer MLP → TIME_EMB_DIM=1024 output) is constant across all 10 CFM steps; only the input scalart_valchanges. The cache key is(backend)so a backend swap rebuilds. Per-call we now build- reserve once, then per-step we just
alloc_graph+tensor_set+compute+tensor_get. Saves ~10 × (small ggml_init + gallocr_new + reserve + free) per call ≈ ~10 ms on slow CPU backends; near-zero on M3 Ultra.
- reserve once, then per-step we just
Bench (M3 Ultra, Q4_0, ES prompt, seed 42, --temp 0 --top-k 1, jfk.wav voice, 3 invocations averaged)
| Stage | §3.21 base | §3.22 (this) | Δ |
|---|---|---|---|
| T3 ms | 479 | 470 | -1.9% |
| cfm_total | 561 | 550 | -2.0% |
| hift_decode | 128 | 125 | -2.3% |
| S3Gen ms | 730 | 722 | -1.1% |
| Total ms | 1209 | 1192 | -1.4% |
WAV byte-exact gate: md5 79002f09bc48dda95ec0c2cfc2b895bd matches
across both branches at all three invocations. Within-noise on M3
Ultra but unambiguous direction across runs.
The per-CFM-step empirical breakdown (from --verbose) is:
step 0 = 73 ms, step 1..9 ≈ 53 ms each. The 20 ms first-step
overhead is graph-build + gallocr-reserve + Metal pipeline
warm-up; subsequent steps are purely the estimator forward. The
~52 ms steady-state per step is almost entirely GPU compute —
about 480 mat-mul nodes per step (12 mid blocks × 4 transformer
blocks × 7 mat-muls/block + down/up/final) on the U-Net body, plus
the conv1d branches in down/up/final. Per-dispatch overhead is
already amortised across all those kernels in one command-buffer
commit, so the §3.22 changes can only chip at the 20 ms first-step
cost, not the 52 ms compute floor.
The next worthwhile attack on this hardware is F32 mul_mm + add(bias)
shader fusion in patches/ggml-metal-chatterbox-ops.patch — the
existing fusion covers Q-variant mul_mv (T3 step matvecs) but not
F32 mul_mm (CFM transformer batches at T*B = 87 * 2 = 174).
Estimate: ~280 fuse opportunities per CFM step × 10 steps =
~2800/call. Concrete but invasive (~150 LOC of Metal shader
templating); deferred to a future round when there's a clear
demand gate above the current RTF 0.30 / 0.32 multilingual numbers.
| File | Change |
|---|---|
| src/t3_mtl.cpp | Drop ggml_gallocr_reserve from run_step_pass, run_prompt_pass, run_step_pass_b2, run_prompt_pass_b2; alloc_graph covers the lazy-reserve case. |
| src/chatterbox_tts.cpp | run_hift_decode scratch buf → thread_local; new time_mlp_cache keyed on backend, hoisting per-step build/reserve. |
The Phase-1 of §3.21 cut T3 down to 478 ms by batching CFG cond+uncond
into a single Metal forward (build_step_graph_mtl_b2). Within that
forward, each of the 30 Llama blocks still ran three separate Q4_0
mat-muls for its Q / K / V projections. Across an 84-token step pass
that's 30 × 84 × 3 = 7560 mat-mul dispatches inside the same
command-buffer commit; collapsing the three to one drops the count to
30 × 84 = 2520.
Implementation. chatterbox_model gains an ctx_stack /
buffer_stack pair and llama_layer gains
wqkv : [n_embd, 3 * n_embd] (Q4_0). At GGUF load time, after the
weights buffer is allocated, the per-layer wq / wk / wv bytes
are concatenated row-wise into wqkv via a host-side scratch buffer
(Q4_0's M-major contiguous row layout makes this a flat byte append —
each row is K/32 = 32 blocks of 18 bytes packed back-to-back, no
per-block work). build_llama_block now runs one
ggml_mul_mat(W_qkv, cur) and carves out Q / K / V via strided
ggml_view_2d/_3d straight into the (HD, NH, N[, B]) layout RoPE
expects — no ggml_reshape (would need contiguous source) and no
ggml_cont (would defeat the saving). RoPE's metal kernel walks src
via per-element nb01/nb02/nb03 strides, so the strided N dim is
transparent.
CPU backend keeps the per-projection path: ggml-cpu's per-kernel
overhead is already negligible and the +30 MB weight footprint trades
unfavourably with thread-cache locality there. Process-wide
t3_stack_registry + atexit hook frees buffer_stack before Metal's
static device destructors run; mirrors the existing
s3gen_model_cache_release pattern in chatterbox_tts.cpp.
Why gate / up isn't stacked. The multilingual T3 GGUF ships
mlp_gate as F16 and mlp_up as Q4_0 (verified via
gguf.GGUFReader('models/chatterbox-t3-mtl-q4_0.gguf')). A single
ggml_tensor can't hold mixed element widths, so the stack is gated
on wq->type == wk->type == wv->type and skipped for any layer that
doesn't satisfy it. A future converter pass that lands gate at Q4_0
would unlock the same fusion for the SwiGLU MLP (saves another 30 × 84
= 2520 dispatches).
Why CFM transformer Q/K/V isn't stacked. Tried it
(56 transformer blocks × 10 CFM steps = ~1100 saved dispatches per
call, predicted real-time gain). CFM regresses by ~15 % on
cfm_total (549 → 632 ms). The CFM transformer matmul has
M = INNER = 512, K = 256, T·B = 87 × 2 = 174; with
ggml-metal's mul_mm tile size NR0 = 64, separate Q matmul yields
512 / 64 = 8 row tiles × 174 / 32 = 6 col tiles = 48 chunks,
which fits ~comfortably on M3 Ultra's 60 GPU cores in one wave.
Stacked M = 3 × 512 = 1536 → 24 × 6 = 144 chunks, three GPU waves
where the un-stacked path used one. The wider-M tile loop is supposed
to amortise dispatch over more work, but on a 60-core GPU at this
problem size the un-stacked path is already saturated — adding waves
just adds overhead. Reverted. (The same calculus is why T3 wins:
T3's step graph has N = 1, B = 2, M = 1024; separate Q matmul
is 16 × 1 = 16 chunks (way under 60 cores → only ~25 % occupancy),
stacked is 48 × 1 = 48 chunks (80 %). So the lever is exactly
"how undersaturated is the un-stacked GPU mat-mul".)
| Variant | T3 §3.22 base | T3 +Phase 15 | Δ T3 | Total §3.22 base | Total +P15 | Δ Total |
|---|---|---|---|---|---|---|
| Q4_0 | 474 ms | 433 ms | -8.7% | 1192 ms | 1153 ms | -3.3% |
| F16 | 522 ms | 493 ms | -5.5% | ~ | ~ | ~ |
Cumulative on the §3.21 baseline (pre-§3.21):
- Q4_0 T3: 872 ms → 433 ms (−50 % since §3.20)
- Q4_0 RTF: 0.46 → 0.29
- F16 T3: 1099 ms → 493 ms (−55 % since §3.20)
WAV byte-exact gate: md5 79002f09bc48dda95ec0c2cfc2b895bd matches
across §3.22 base and post-§3.23 at five separate invocations
(--temp 0 --top-k 1, deterministic).
| File | Change |
|---|---|
| src/chatterbox_t3_internal.h | llama_layer gains wqkv; chatterbox_model gains ctx_stack + buffer_stack. |
| src/t3_mtl.cpp | Post-load: allocate the Phase-15 stacked buffer + register with t3_stack_registry for atexit; per-layer copy of wq+wk+wv rows into wqkv via host scratch. build_llama_block: when l.wqkv is set, single mat-mul + view-split into Q/K/V; otherwise legacy three-mul path. New t3_stack_unregister() for free_t3() to call on error returns. |
| src/t3_mtl.h | Export t3_stack_unregister(). |
| src/chatterbox_cli.cpp | free_t3() calls t3_stack_unregister() then frees buffer_stack / ctx_stack. |
The §3.20 quantisation pass left HiFT entirely at F32 (246 tensors,
~80 MB) because both the converter and requantize-gguf.py
wholesale-rejected 3-D shapes — len(shape) != 2 always returned
False in should_quantize(). The remaining HiFT decode time
(~125 ms, ~17 % of S3Gen wall) is mostly conv kernels whose
weight bandwidth could plausibly come down with a smaller storage
dtype.
The plan's first prediction was that
should_quantize() could allow 3-D when K * IC % 32 == 0
(numpy shape[-1] * shape[-2] divisible by the Q4_0 block). Tested
empirically; the patch is structurally correct, but the
HiFT-specific gain is zero:
- Q4_0's on-disk block layout assumes blocks span 32 consecutive
ne[0]values within a fixed(ne[1], ne[2])row. For ggml conv kernel shape(K, IC, OC)that means K must be 32-aligned. - HiFT conv kernels have K ∈ {3, 7, 11, 16}. None of these are 32-aligned, so Q4_0 along K is structurally impossible.
- Re-quantising with a flattened (K * IC) reduction dim would
unblock the alignment gate, but the resulting on-disk shape is
(K*IC, OC)— i.e. 2-D — which then breaksggml_im2col(kernel, ...)on the C++ side (it derives the kernel size fromkernel->ne[0]). That's a structural change toconv1d_f32and gated on a future commit.
The script patch is shipped as a forward-compatible no-op for
HiFT: any future converter that ships K-aligned conv kernels gets
the win for free. Tested by re-quantising
chatterbox-s3gen-mtl-f16.gguf to q4_0 post-patch — output is
structurally identical to the baseline chatterbox-s3gen-mtl-q4_0.gguf
GGUF for HiFT (still 246 F32, no Q4_0).
F16 has block_size = 1 in GGML_QUANT_SIZES, so the alignment
gate is a no-op for any shape. Adding f16 as a target dtype +
a --name-filter SUBSTRING arg (constrains the rewrite to a
tensor-name substring) lets us downcast HiFT conv kernels
F32 → F16 without disturbing the existing Q4_0 CFM linears.
Two-pass recipe:
python scripts/requantize-gguf.py \
models/chatterbox-s3gen-mtl-f16.gguf \
/tmp/intermediate.gguf f16 --name-filter hift/
python scripts/requantize-gguf.py \
/tmp/intermediate.gguf \
models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0Of the 246 HiFT tensors:
- 159 are 1-D biases / scalars — kept F32 by the
n_elements >= 1024len(shape) == {2,3}shape gates.
- 64 are 2-D / 3-D conv weights — converted to F16.
- 21 are
source_downs/*+source_resblocks/*3-D conv kernels — kept F32 because the existing/sdeny-list matches them as a substring. Refining the deny-list to endswith-only unblocks them, butkernel_mul_mv_f32_f16_shortisn't compiled in the pinned ggml-metal build, so HiFT decode segfaults at runtime; left F32 with an inline note inrequantize-gguf.pyfor the next round. - 2 small 2-D weights — kept F32 by
n_elements < 1024.
Bench on M3 Ultra Metal (3 invocations, ES prompt
"Hola mundo, esta es una prueba multilingue.", --seed 42 --temp 0 --top-k 1, jfk.wav voice):
| Metric | baseline q4_0 GGUF | q4_0 + HiFT F16 GGUF | Δ |
|---|---|---|---|
| GGUF size | 788.4 MB | 754.6 MB | −4.3 % |
[hift_decode] ms |
124.9 | 121.3 | −2.9 % |
[s3gen_total] ms |
727 | 726 | within noise |
[cfm_total] ms |
549 | 550 | within noise |
| T3 ms | 434 | 434 | unchanged |
Audio quality:
- WAV md5 differs (expected: F16 conversion is lossy):
baseline
79002f09bc48dda95ec0c2cfc2b895bdnewec58d3e65ab8e9c6f4edefb15b169ea5 - PCM cosine = 0.999851 across all 3 invocations
(deterministic on
--seed 42). - max abs i16 diff = 616 / 32768 ≈ 1.9 %, mean abs diff = 3.65.
- Subjectively indistinguishable from baseline. Cleanly above the §3.20 PCM-cos ≥ 0.99 quality gate.
The plan estimated a 25–45 ms HiFT win on the assumption that HiFT's bandwidth bottleneck would scale with weight storage. Two reasons the realised win is smaller:
- Half of HiFT's weight footprint is in the 21 source_* tensors that the deny-list guards (described above) — those stayed F32.
- Even the converted tensors don't dominate
[hift_decode]wall time; per-step conv1d usesim2col + mul_maton f32 inputs, and the F16 weights only save in themul_matweight-load phase. Activation traffic + im2col work stay F32.
- Patch the missing
kernel_mul_mv_f32_f16_shortvariant (or reshapesource_downs/*to a non-mat_mv shape) to unblock the remaining 21 conv kernels. Predicted additional ~2–4 ms HiFT speedup + ~16 MB GGUF size drop. - Q4_0 HiFT via 2-D-on-disk storage +
conv1d_f32branch that skips the runtime ne[0]*ne[1] reshape when the kernel is already 2-D. Bigger surgery (touches both converter- C++); documented as the structural follow-up to §3.24.
- F32
mul_mm + add(bias)shader fusion in patches/ggml-metal-chatterbox-ops.patch. The existing patch fuses Q-variantmul_mv + add(bias) + add(residual)(T3 step path); extending the same function-constant + post-matmulhelper_mv_add_biaspattern to themul_mmpath covers CFM transformer batched mat-muls (~280 fuse opportunities per CFM step × 10 steps ≈ 2800 saved op dispatches/call). Estimated +10–25 ms on chatterbox S3Gen. ~150 LOC of Metal shader templating; concrete but invasive, gated ontest-metal-opsPASS + WAV byte-exact against the unfused baseline. Deferred from §3.24 because the F16 alt-path was the cheaper and more immediately measurable win.
| File | Change |
|---|---|
| scripts/requantize-gguf.py | should_quantize() now allows 3-D when shape[-1] (= ne[0] = K) is block-aligned (forward-compatible no-op for HiFT today); f16 added as a target dtype; new --name-filter SUBSTRING arg; pass-through path branches on GGML_QUANT_SIZES[type][0] == 1 to handle already-quantised sources without reshape errors. |
models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf |
New GGUF artifact (gitignored, 754 MB). Recipe documented in the script's docstring + this section. |
Tried flipping src/chatterbox_tts.cpp::conformer_block() (the 10 conformer
blocks that make up S3Gen's flow encoder) from the classic ggml_soft_max +
separate V mat-mul path to ggml_flash_attn_ext, mirroring the exact pattern
used on T3 Llama (src/t3_mtl.cpp:221 / 425) and on CFM basic_tfm
(src/chatterbox_tts.cpp:712 / 800), plus the rel_pos_mha_graph fix just
landed on parakeet.cpp (§15.8 there).
Implementation (reverted, kept here as documentation):
const float scale = 1.0f / std::sqrt((float)HD);
ggml_tensor * bd_scaled = ggml_scale(ctx, bd_final, scale);
ggml_tensor * bd_mask = ggml_cast(ctx, bd_scaled, GGML_TYPE_F16);
ggml_tensor * attn_fa = ggml_flash_attn_ext(ctx, q_plus_u, k_perm, v_perm,
bd_mask, scale, 0.0f, 0.0f);
ggml_tensor * flat = ggml_reshape_2d(ctx, attn_fa, HD * H, T);Math is byte-correct: non-flash path is softmax(scale * (q*k^T + bd_final)) * v = softmax(scale * q*k^T + scale * bd_final) * v, and flash_attn_ext computes
softmax(scale * q*k^T + mask) * v, so mask = scale * bd_final is the
equivalent. Flow encoder runs single-window (no chunk mask) so no att_mask
to fold in.
| Stage (M3 Ultra, Metal, Q4_0, ES prompt, seed 42, 3 invocations averaged) | baseline | FA | Δ |
|---|---|---|---|
[encoder] ms |
~43 | 29.6 | −13 / −31 % (flow encoder only) |
| S3Gen ms | 721 | 708 | −13 / −1.8 % |
| T3 ms | 433 | 430 | noise |
| CFM total ms | 546 | 538 | noise (−8) |
| HiFT decode ms | 126 | 125 | noise |
| WAV md5 | 79002f09… |
a4169d68… |
differs |
The flow encoder is 10 conformer blocks (6 at T=~87 + 4 at 2T), each running
two sub-block matmuls + softmax + permute+mul_mat with V. Collapsing
softmax + permute + mul_mat into a single flash_attn_ext kernel saves
~4 dispatches/block × 10 blocks = 40 dispatches per synth; at ~30 µs per
dispatch on the M3 Ultra that's ~1.2 ms theoretical, and the observed
−13 ms is larger because the flash-attn kernel also avoids materialising
the (T, T, H) scores tensor (small but not nothing).
The ggml_flash_attn_ext contract requires an f16 mask
(ggml.c:5320 GGML_ASSERT(mask->type == GGML_TYPE_F16)). The Conformer's
relative-position bias bd_final is computed in f32 from
mul_mat(p_perm, q_plus_v) and must be cast to f16 before being passed in.
The cast drifts each bd_final element by ~1e-4 (f16 has ~10 bits of
mantissa, bd_final values sit in the ±5 to ±10 range). That drift is
well below what parakeet's downstream argmax classifier can see, but
chatterbox's downstream is very different:
- Flow encoder output → 10-step CFM estimator (a diffusion U-Net). Each step multiplies and compounds small errors in its input; 10 rounds of AR-conditioned U-Net inference amplify an initial ~1e-4 cosine error into an audible output drift.
- CFM output → HiFT vocoder, which produces a waveform. Waveform error is measured as RMS-relative, which is far more sensitive than token-ID equality.
Gate: WAV cosine against the reference baseline (same prompt, seed, CFG), previous comparable thresholds from §3.24 were cos > 0.9998. The FA variant measured:
lengths base=83520 fa=83520
samples n=83520 cos=0.998647
rms_diff=69.334 rms_base=1332.522
max_abs_diff=1702.0 gate: FAIL (threshold > 0.9998; got 0.998647)
Parakeet could absorb this drift (PR #1 §15.8 shipped it at exact token-ID
parity across 95 tokens). Chatterbox cannot. Reverted — baseline md5
restored to 79002f09bc48dda95ec0c2cfc2b895bd at
/tmp/cb_revert.wav == /tmp/cb_base_1.wav.
- Pass
bd_scaledin f32 viaggml_flash_attn_ext. Blocked by the hard assertion that mask must be f16. - Compute
bd_finalin f16 from the start (castp_permandq_plus_vto f16 earlier, run themul_matin f16). Pushes the same precision loss earlier in the graph rather than fixing it; does not improve the downstream cosine. - Skip the mask entirely (pass nullptr to flash_attn_ext). Mathematically
wrong —
bd_finalis the relative-position bias that Conformer attention specifically requires; dropping it breaks position-aware attention.
Conformer flow-encoder stays on the ggml_soft_max path. Next candidate
encoder-side optimisations are:
- Strip redundant
ggml_contafter Conformer Q/K/V permutes (lines 440–443 ofsrc/chatterbox_tts.cpp). Metal'smul_matcan walk strides natively; some of thosecontcopies may be removable without changing math. Tracked as QW-D in today's planning notes. - F32
mul_mm + add(bias)shader fusion inpatches/ggml-metal-chatterbox-ops.patch(the estimate +10–25 ms on S3Gen — CFM transformer batched mat-muls). Already queued in §3.24 follow-ups.
| File | Change |
|---|---|
| src/chatterbox_tts.cpp | 10-line commentary block added to conformer_block() explaining why the flash-attn path is intentionally not taken, pinning the negative-finding cosine number and the speed upside that was measured, and pointing at the parakeet §15.8 counterexample. No code change to the graph itself. |
Closes the open item from §3.24 §3.25: "Patch the missing
kernel_mul_mv_f32_f16_short variant to unblock the remaining 21
HiFT source_* conv kernels."
§3.24 converted the 64 HiFT conv-kernel F32 weights that the
/s deny-list didn't incidentally catch to F16 (cos > 0.9998 vs
the all-F32 baseline, [hift_decode] ~3 % faster, ~33 MB GGUF
shrink). The broad /s deny also caught every HiFT source_*
weight (source_downs/0..2, source_resblocks/0..2/{convs1,convs2}/*,
m_source/l_linear/* — 21 weight tensors, ~7.7 MB at F32) because
when you flip them to F16, HiFT's conv1d_f32 path runs the
ggml_mul_mat(im2col_f32, kernel_f16) mat-vec shape with T0=f32, T1=f16. The pinned ggml-metal (commit 58c38058) did not ship
that template instantiation, and Metal pipeline lookup fails:
ggml_metal_library_compile_pipeline: Error Domain=MTLLibraryErrorDomain
Code=5 "Function kernel_mul_mv_f32_f16_short was not found in the library"
(Reproduced by feeding chatterbox a GGUF where the 21 source_* tensors are F16; crashes immediately at first HiFT decode with SIGSEGV / exit 139.)
One line each per kernel family:
// kernel_mul_mv_t_t family (full-shape mat-vec)
template [[host_name("kernel_mul_mv_f32_f16")]] kernel mul_mv_t_t kernel_mul_mv_t_t <float, half>;
// kernel_mul_mv_t_t_4 family (vec4 dispatch path)
template [[host_name("kernel_mul_mv_f32_f16_4")]] kernel mul_mv_t_t_4 kernel_mul_mv_t_t_4 <float, float4, half, half4>;
// kernel_mul_mv_t_t_short family (short-axis dispatch path — this is the
// variant HiFT's small-OC source_downs/2/weight (OC=64) actually hits)
template [[host_name("kernel_mul_mv_f32_f16_short")]] kernel mul_mv_t_t_short_t kernel_mul_mv_t_t_short <float, half>;The mul_mv_t_t_short_impl body (lines ~4320–4355 of ggml-metal.metal)
is templated on <T0, T1> and already handles arbitrary casts via
(float) x[i] * (float) y[i] — all that was missing was the
<float, half> instantiation for the symbol lookup. Same for
_4 (needs <float, float4, half, half4>, with float-cast in the
inner reduction loop) and the base non-short variant (symmetric).
All three land as additions in patches/ggml-metal-chatterbox-ops.patch
(700 → 733 lines). test-metal-ops still PASSes on every op it
already covered (diag_mask_inf / pad_ext / conv_transpose_1d at
three upsample stages + tiny edge case).
Three changes so the recipe works end-to-end on the current gguf-0.18 writer:
-
Narrowed the deny glob
/sto/scale. The old/smatch was a rough proxy for "norm scale params like ln_1/ga, gate, etc." but incidentally swept in everyhift/source_*/weight and bias tensor (188 matches in the F16 source GGUF, 62 of which weresource_*). With the Metal kernel variant now shipped,source_*conv weights are safe to F16; the 21 that matter (the 3-D conv kernels) quantise successfully via--name-filter hift/source_. The remaining norm-scale tensors the deny was originally targeting (/scale,/ln_,/norm/,/gamma) are still covered by their own stricter patterns. -
Fixed the Q-type passthrough byte-shape bug.
gguf-0.18'sadd_tensor_infotreatsraw_shapeas byte layout (innermost dim in bytes per row, not elements per row) whentensor.dtype == np.uint8. The previous code passed the element shape verbatim, which crashed withValueError: Quantized tensor bytes per row (512) is not a multiple of Q4_0 type size (18)on any input GGUF that already carried Q-type tensors — i.e. every two-pass pipeline likef16 → q4_0orq4_0 → f16 --name-filter. Fix: convert inner-dim elements to bytes (byte_inner = elements_inner // block_size * type_size) before handing to the writer. Blocksblock_size==1(F16/F32/ BF16) keep the existing element-shape path. -
Docstring updated with the two-pass recipe showing the post-§3.26 configuration:
# Full recipe (Q4_0 everywhere except HiFT kept at F16 now # including the 21 source_* conv kernels unblocked in §3.26): python scripts/requantize-gguf.py \ models/chatterbox-s3gen-mtl-f16.gguf \ /tmp/intermediate.gguf f16 --name-filter hift/ python scripts/requantize-gguf.py \ /tmp/intermediate.gguf \ models/chatterbox-s3gen-mtl-q4_0_hift_f16.gguf q4_0
| §3.24 baseline | §3.26 (source_* F16) | Δ | |
|---|---|---|---|
[encoder] ms |
31.3 | 30.5 | −0.8 (noise) |
[cfm_total] ms |
541.9 | 550.4 | noise |
[hift_decode] ms |
121.3 | 121.1 | neutral |
| S3GEN_INFER_MS | 709 | 724 | +15 (noise) |
| T3_INFER_MS | 440 | 440 | 0 |
| GGUF size | 754.4 MB | 746.7 MB | −7.7 MB |
Speed is neutral on M3 Ultra (unified-memory bandwidth isn't the
bottleneck for the 21 source_* weights, which are small — the
largest is source_resblocks/0/convs1/*/weight at ~3.4 MB F32 /
~1.7 MB F16). The predicted +2–4 ms HiFT gain from §3.24 falls
inside bench noise; on bandwidth-limited targets (M4 Air /
iPhone neural engine), expect the full +3–5 % HiFT speedup seen
in §3.24's existing 64 tensors. The real win is the
7.7 MB GGUF shrink (~1.0 %) on a multilingual distribution
GGUF, plus closing the last known blocker from §3.24.
-
test-metal-ops: all four pre-existing ops (diag_mask_inf, pad_ext, conv_transpose_1d @ 3 upsample stages + tiny edge) PASS; no new tests added becausekernel_mul_mv_f32_f16{,_4,_short}is covered by the end-to-end audio parity below (same inner math as the existing<half, float>/<half, half>/<float, float>variants, differing only in type tags). -
WAV parity vs §3.24 baseline on ES-prompt / jfk-voice / seed 42 (per-invocation deterministic; md5 identical across 3x3 runs):
MD5 §3.24 baseline: ec58d3e65ab8e9c6f4edefb15b169ea5 MD5 §3.26 v2 (3 runs): d8a1b22375dbcb2259c686426a7d76c5 d8a1b22375dbcb2259c686426a7d76c5 d8a1b22375dbcb2259c686426a7d76c5audio comparison:
lengths 83520/83520 cos 1.000000 PASS (threshold > 0.9998) rms_diff 0.464 rms_base 1332.66 max_abs_diff 4 (out of ±32767) → 0.035 % relative RMS drift, 0.012 % max sample driftAuditorily identical (within the LSB of s16 output). Deterministic across invocations.
| File | Change |
|---|---|
| patches/ggml-metal-chatterbox-ops.patch | +33 lines for the three mul_mv_f32_f16{,_4,_short} template instantiations + comments referencing this section. Regenerated from the pinned commit 58c38058. |
| scripts/requantize-gguf.py | /s deny narrowed to /scale; Q-type passthrough byte-shape fix; docstring recipe updated. |
ggml/src/ggml-metal/ggml-metal.metal |
Local edit under the ggml/ worktree; not tracked in this repo. Recipe remains: run scripts/setup-ggml.sh to re-apply the patch after a ggml bump. |
All §3.24 follow-ups now closed:
kernel_mul_mv_f32_f16_short patch✓ shipped this section- Q4_0 HiFT via 2-D-on-disk storage +
conv1d_f32branch — still deferred, larger surgery (touches both converter + C++) - F32
mul_mm + add(bias)shader fusion — still deferred, ~150 LOC Metal kernel work + test-metal-ops gate; bigger potential (+10–25 ms S3Gen) but not "quick"
Closes the §3.22 §3.24 §3.26 follow-up "F32 mul_mm + add(bias) shader
fusion in patches/ggml-metal-chatterbox-ops.patch". The existing
fusion in the pinned ggml-metal pipeline covered only Q-variant
mul_mv (matrix-vector) kernels via helper_mv_add_bias
(Q4_0/Q4_1/Q5_0/Q5_1/Q8_0 with bias+residual function-constant
guards). The mul_mm (matrix-matrix) kernel — the one the CFM
transformer actually hits at T·B ≥ 2 — had no equivalent. This
section wires one in.
-
kernel_mul_mminggml-metal.metalgains two new function constants (FC_mul_mm_has_bias_=FC_MUL_MM + 2,FC_mul_mm_has_residual_=+3) and two new buffer slots (biasatbuffer(4),residualatbuffer(5)). When either FC is true, the kernel routes through the shmem-backed scalar-copy path and folds bias / residual into the copy loop (same post-matmul math ashelper_mv_add_bias:v += bias[r0+i]andv += residual[(r1+j)*ne0 + im*ne1*ne0 + r0 + i]). Compiler drops the branch that's not selected by the FC — zero overhead when neither is set. -
get_pipeline_mul_mminggml-metal-device.cppnow takeshas_bias, has_residualflags, bakes them into the pipeline name (kernel_mul_mm_<T0>_<T1>_bci=X_bco=Y_bias=Z_res=W), and sets the function-constant values during compile. Shmem size bumped from4 KB+2 KBto8 KBwhen either flag is set so the always-shmem path has room for the temp buffer. -
Dispatcher
ggml_metal_op_mul_matinggml-metal-ops.cppmirrors the Q-variant mul_mv fusion lookup: try{MUL_MAT, ADD, ADD}first, fall back to{MUL_MAT, ADD}. Both orderings of the residual add are handled (ggml_addis commutative; chatterbox'sbasic_tfmemitsggml_add(x, attn_out)with residualxassrc[0]and the mul_mat+bias result assrc[1]). Writes fused dst tonode(idx + n_fuse - 1)so the value lands where the skipped ADD(s) would have written, and returnsn_fuseso the outer loop skips them.
Verified via ggml_metal_library_compile_pipeline trace on first
invocation (M3 Ultra, Q4_0 + HiFT F16 + sample-16k voice):
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0 ← CFM transformer linears, in-bounds blocks
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0 ← CFM transformer linears, edge blocks
kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0 ← CFM time_mlp / final_proj
kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0 ← unfused matmuls (e.g. Q/K/V no-bias)
kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0
The bias=1 variants account for ~280 fuse opportunities per CFM
step × 10 steps × 2 CFG batches ≈ 1820 dispatches per synthesis
that the old code paid a separate ggml_add kernel for. No
res=1 variants fire in the current chatterbox graph: the
ADD(residual) in basic_tfm is at a different point in the
graph (separated by layer_norm → mul_mat → add(bias) →
gelu_erf → mul_mat → add(bias) → add(x, ff)`), so the
residual add can't be folded into the preceding mul_mm without
hoisting those intermediate ops. Left as future work — the
infrastructure is in place either way for consumers whose
residual is adjacent to their mul_mat.
5-invocation averages (WAV deterministic, md5 identical across all 5 runs):
| Metric | §3.26 baseline | §3.27 fused | Δ |
|---|---|---|---|
[encoder] ms |
31.3 | 30.5 | noise |
[cfm_total] ms |
541.9 | 542.2 (± 5 per-run) | neutral |
[hift_decode] ms |
121.3 | 121.2 | neutral |
| S3GEN_INFER_MS | 709 | 713.2 | +4 (noise) |
| T3_INFER_MS | 440 | 433.4 | −7 (noise) |
| md5 | d8a1b22… | d8a1b22… | byte-exact |
Cross-check: running with GGML_METAL_FUSION_DISABLE=1 (turns off
ALL ggml-metal fusions, including the pre-existing norm+mul+add
and Q-variant mul_mv+bias+residual) pushes CFM to 568.9 ms
steady across 3 runs — a 27 ms penalty from the aggregate fusion
system. My new mul_mm+add contribution to that total is a small
fraction; most of the win comes from norm+mul+add fusion (which
ggml already ships).
Two reasons. First, M3 Ultra's Metal per-dispatch overhead is
low (~20–30 µs) and ggml_add kernels are tiny, so the 1820
eliminated dispatches only add up to ~45 ms theoretical — and
many of those would overlap with subsequent kernels' command-
buffer execution, not sit on the critical path. Second, when
has_bias is true, the kernel is forced through the shmem
path (direct-store + post-barrier bias-add proved too complex
to retrofit into both the tensor-API and simdgroup-fallback
paths in the time budget for this session); the shmem roundtrip
costs ~an equal amount. Net: neutral on M3 Ultra.
- Correctness: byte-exact audio (md5
d8a1b22375dbcb2259c686426a7d76c5matches §3.26 across 5 runs).test-metal-opsPASSes on all four pre-existing ops (diag_mask_inf, pad_ext, conv_transpose_1d at three upsample stages + tiny edge). - Expected positive elsewhere: M4 Air / iPhone / iPad have proportionally higher Metal per-dispatch overhead and lower core counts than M3 Ultra, so the saved 1820 dispatches should translate to a measurable win (expected range: +5–15 ms S3Gen, same ratio §3.24's HiFT F16 result predicted). Can't verify on M3 Ultra alone.
- Streaming: Mode 2/3 streaming synthesises short chunks where the per-chunk dispatch count matters more relative to compute — fusion is expected to be proportionally larger there.
- Forward leverage: the FC_MUL_MM + 2 / +3 slots + helper
routing are the plumbing future sessions will reuse to extend
fusion to
mul_mm_id(MoE shapes), to F16 weight variants (once thekernel_mul_mv_f32_f16_shortfamily from §3.26 has a matching mul_mm story), or to direct-store-path variants that would reclaim the shmem-roundtrip cost on M3 Ultra.
| File | Change |
|---|---|
ggml/src/ggml-metal/ggml-metal.metal |
Two new FC constants (FC_MUL_MM + 2 / +3), two new buffer args (slots 4 and 5) on kernel_mul_mm, forced-shmem path when either FC is true, bias/residual fold-in inside the scalar-copy loop. Local edit under the ggml/ worktree; not tracked in this repo. |
ggml/src/ggml-metal/ggml-metal-device.{cpp,h} |
get_pipeline_mul_mm(op, has_bias, has_residual) — new signature; bakes flags into pipeline name + FC values; shmem sizing adjusted to 8 KB when fused. |
ggml/src/ggml-metal/ggml-metal-ops.cpp |
ggml_metal_op_mul_mat mul_mm path gains the same can_fuse({MUL_MAT,ADD,ADD}) / can_fuse({MUL_MAT,ADD}) lookup the mul_mv path already had; both orderings of the residual add handled; n_fuse returned to skip the folded ADDs. |
| patches/ggml-metal-chatterbox-ops.patch | +262 lines. Regenerated from pinned 58c38058. 733 → 995 lines. |
- Reclaim the shmem-roundtrip cost on M3 Ultra: add bias fold-in
to the direct-store paths (both the tensor-API
cT.storepath and the simdgroup-fallbacksimdgroup_storeloop). Would need a post-barrier per-simdgroup read-modify-write pass on device memory. 2–3 h of additional Metal kernel work; predicted to flip §3.27 from neutral to +5–10 ms on M3 Ultra. - Extend to
mul_mm_id(mixture-of-experts mat-muls) — same FC pattern applies. Zero-change for chatterbox (doesn't use MoE), but useful for future consumers of this patch. - Bench on M4 / iOS — validate the "neutral on M3U, positive elsewhere" prediction. Until measured the estimate is just that.
Builds directly on §3.27 infrastructure. Closes the mul_mat → add(bias) → gelu_erf triple in CFM basic_tfm's FF gate projection
(src/chatterbox_tts.cpp:738):
ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff0_w, nx2), w.ff0_b); // (mul_mat + bias) — fused by §3.27
ff = ggml_gelu_erf(ctx, ff); // §3.28 absorbs this into the same kernel
ff = ggml_add(ctx, ggml_mul_mat(ctx, w.ff2_w, ff), w.ff2_b); // ff2 remains a separate mul_mm + bias fusion§3.27 already brought mul_mat + add(bias) into a single dispatch
via the shmem-backed scalar-copy path; §3.28 extends that same
loop to apply gelu_erf as the last stage before writing to dst.
The gelu is inline FP math on each element we're already reading /
writing — no extra memory roundtrip, no extra shmem — so unlike
§3.27's neutral-on-M3-Ultra result, this one is a clear net
positive on M3 Ultra.
-
ggml-metal.metal: new function constantFC_MUL_MM + 4(FC_mul_mm_has_gelu_erf_), new branch at the end of the scalar-copy loop that applies the same0.5 * v * (1 + erf_approx(v * SQRT_2_INV))formula the standaloneOP_UNARY_NUM_GELU_ERFkernel uses. Numerically identical to the unfused path (proven via md5 byte-exact across 5 runs). -
get_pipeline_mul_mm: signature bumped to(op, has_bias, has_residual, has_gelu_erf); pipeline name extended with_gelu=N; FC + shmem sizing adjusted to keep the shmem path (8 KB) when any fold-in is active. -
Dispatcher
ggml_metal_op_mul_matmul_mm path: new{MUL_MAT, ADD, UNARY}can_fuse lookup wedged between the{MUL_MAT, ADD, ADD}residual lookup and the{MUL_MAT, ADD}bias-only fallback. Verifiesggml_get_unary_op(f2) == GGML_UNARY_OP_GELU_ERFand thatf2->src[0] == f1before fusing. Gates on GELU_ERF specifically because that's the onebasic_tfmuses; other unary sub-ops (SILU, GELU, RELU, GELU_QUICK, ...) are left as independent follow-up work — same pattern would extend trivially.
(from GGML_LOG_DEBUG compile trace on first invocation)
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=1 ← CFM ff0 (gelu_erf-activated)
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=1 ← ff0 edge blocks
kernel_mul_mm_q4_0_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← CFM ff2 / to_out (bias only, §3.27)
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=1_res=0_gelu=0
kernel_mul_mm_f32_f32_bci=0_bco=0_bias=1_res=0_gelu=0 ← time_mlp / final_proj
kernel_mul_mm_f32_f32_bci=0_bco=1_bias=1_res=0_gelu=0
kernel_mul_mm_q4_0_f32_bci=0_bco=1_bias=0_res=0_gelu=0 ← unfused (no-bias) passthroughs
kernel_mul_mm_f32_f32_bci=1_bco=1_bias=0_res=0_gelu=0
The gelu=1 variants correspond to 56 basic_tfm blocks × 10 CFM
steps × 2 CFG batches = 1120 saved gelu_erf dispatches per
synth (on top of the 1820 bias-add dispatches saved in §3.27).
| Metric | §3.27 (bias only) | §3.28 (+ gelu) | Δ |
|---|---|---|---|
[encoder] ms |
30.5 | 30.8 | noise |
[cfm_total] ms |
542.2 | 533.4 ± 1.0 | −8.8 / −1.6 % |
[hift_decode] ms |
121.2 | 120.8 | neutral |
| S3GEN_INFER_MS | 713.2 | 706.0 ± 0.8 | −7.2 / −1.0 % |
| T3_INFER_MS | 433.4 | 431.0 | noise |
| md5 | d8a1b22… |
d8a1b22… |
byte-exact ×5 |
test-metal-ops: all 4 pre-existing ops (diag_mask_inf, pad_ext, conv_transpose_1d × 3 + tiny) PASS.- WAV md5 byte-exact vs §3.26 / §3.27 baseline (
d8a1b22375dbcb2259c686426a7d76c5) across all 5 invocations of the fused build. The fused kernel uses the sameerf_approx<T>(x)helper as the standalone GELU_ERF unary op, so the math is identical down to the LSB. - Determinism across runs: md5 stable.
§3.27's gain was eaten by the shmem-roundtrip cost: routing
through temp_str + sgitg==0 scalar copy costs roughly what the
1820 eliminated ggml_add dispatches saved. §3.28 adds the gelu
fold-in into the same loop — no additional memory accesses,
no barriers, no extra shmem — just a handful of FLOPs per element.
So the 1120 saved gelu_erf dispatches show up as a clean net
positive: −8.8 ms CFM / −7.2 ms S3Gen.
This also refines the §3.27 story: the infrastructure we built there is what makes §3.28 cheap. Fusing additional per-element tail ops into the existing scalar-copy loop is essentially free, whereas routing through the shmem path is what cost M3 Ultra its estimated §3.27 win.
| File | Change |
|---|---|
ggml/src/ggml-metal/ggml-metal.metal |
New FC FC_MUL_MM + 4 (has_gelu_erf); gelu_erf branch in the scalar-copy loop using erf_approx<float>; shared early-out condition updated to include the new flag. Local edit under ggml/ worktree. |
ggml/src/ggml-metal/ggml-metal-device.{cpp,h} |
get_pipeline_mul_mm(op, has_bias, has_residual, has_gelu_erf) — new fourth parameter, pipeline name extended with _gelu=N, shmem sizing adjusted. |
ggml/src/ggml-metal/ggml-metal-ops.cpp |
Dispatcher mul_mm path gains {MUL_MAT, ADD, UNARY} can_fuse lookup with ggml_get_unary_op == GGML_UNARY_OP_GELU_ERF check; slotted between the 3-op residual and 2-op bias lookups. |
| patches/ggml-metal-chatterbox-ops.patch | Regenerated from pinned 58c38058. 995 → 1054 lines, +59. Applies cleanly via git apply --check. |
The same fold-in pattern extends trivially to other unary sub-ops
whenever the chatterbox (or downstream consumer) graph uses them
right after a mul_mat + add(bias):
- SILU (
t3_mtl.cppalready usesggml_swiglu_splitwhich fusessilu(a) * b, but a plain SILU follower could be added). - GELU (non-erf variant) — not in chatterbox today.
- RELU, GELU_QUICK — not in chatterbox.
These would each be ~15–20 lines (FC slot + branch + dispatcher case), mirroring the GELU_ERF wiring this section added. None of them fires in the current chatterbox graph so there's no standalone win, but infrastructure is cheap to extend.
Bigger next-step: reclaim the §3.27 shmem-roundtrip cost on
M3 Ultra by fusing bias into the direct-store paths (both
tensor-API cT.store and simdgroup-fallback simdgroup_store).
2–3 h of Metal kernel work; predicted to flip the §3.27 contribution
from neutral to +3–5 ms CFM on top of today's §3.28 gain.
Goal: reclaim the §3.27 neutral-on-M3-Ultra result by keeping the
fast cT.store / simdgroup_store direct-to-device-memory path
for full-block writes and doing the bias / residual / gelu_erf
fold-in as a post-barrier read-modify-write pass on device
memory, instead of routing through the shmem + scalar-copy path.
The shmem path that §3.27 ships is correct but costs a
threadgroup-memory roundtrip (4 simdgroups stage into a shared
temp_str buffer, sgitg==0 drains it with a scalar loop). On
M3 Ultra that roundtrip is ~equal to the dispatch savings from
eliminating the separate ggml_add kernel — hence the "neutral"
§3.27 result. §3.28 worked because gelu is an extra per-element
tail op inside a loop that already exists; it added ~zero cost.
§3.29 tried to do the same for bias, but on a different path.
if (_mm_use_direct) {
#ifdef GGML_METAL_HAS_TENSOR
cT.store(tC); // cooperative 64x32 store
#else
for (short i = 0; i < 8; i++) {
simdgroup_store(mc[i], ...); // per-simdgroup 32x16 store
}
#endif
if (_mm_has_foldin) {
threadgroup_barrier(mem_flags::mem_device); // flush stores
// distribute 2048 elements of the 64x32 block across 128
// threads of the threadgroup — each thread does 16 RMWs
const int thread_idx = (int) tiitg;
for (int k = thread_idx; k < NR0 * NR1; k += 128) {
const int abs_r = r0 + (k % NR0);
const int abs_c = r1 + (k / NR0);
const uint64_t off = (uint64_t)abs_c * ne0 + abs_r + ...;
device float * D = (device float *) dst + off;
float v = *D;
if (FC_mul_mm_has_bias) v += bias_f32[abs_r];
if (FC_mul_mm_has_residual) v += residual_f32[off];
if (FC_mul_mm_has_gelu_erf) v = 0.5f*v*(1.0f + erf_approx(v * SQRT_2_INV));
*D = v;
}
}
}get_pipeline_mul_mm sized back down to the non-fold-in shmem
(6 KB) when fold-ins are active, on the theory that only edge
blocks need temp_str.
test-metal-ops PASSed on all pre-existing ops (diag_mask_inf,
pad_ext, conv_transpose_1d × 3 + tiny edge) — the kernel compiled
clean, the new _short / _4 / bias=1 variants all built.
But the end-to-end chatterbox synth produced wrong output:
| Metric | §3.28 baseline | §3.29 attempt |
|---|---|---|
| md5 | d8a1b22375dbcb2259c686426a7d76c5 |
06ee1aaaa94a10d70eec2835d3da7dbf |
| T3 tokens | 84 | 70 |
| audio_ms | 3480 | 2920 |
| determinism | stable across 5 runs | stable (same wrong md5 across runs) |
T3 EOS'd 14 tokens early. The wrong md5 was deterministic —
not a race, but a systematic computation error that's consistent
every run. Reverted to the §3.28 shmem-forcing behaviour
(byte-exact to d8a1b22…).
-
Cooperative tensor-store layout:
cT.store(tC)is an Apple Metal tensor-ops cooperative write across all four simdgroups in the threadgroup. Where each element lands in device memory is implementation-defined, not trivially the 32x16 per-simdgroup partitionsimdgroup_storeuses in the fallback path. The RMW pass as written assumes the partition doesn't matter (it iterates the full 64x32 via tiitg), but maybe the threadgroup_barrier withmem_flags::mem_deviceisn't strong enough to ordercT.store's writes against subsequent device reads from the same threadgroup on A17 / M3. A real memory-model audit (or testing withfence()instead ofthreadgroup_barrier) is the next thing to try. -
bias_ok/residual_okshape check vs graph layout:bias_okonly requiresggml_nelements(bias) == ne0andbias->ne[0] == ne0, which is correct for the usual(OC,)broadcast. Butresidual_okrequiresggml_are_same_shape(resi, mul_mat_result). The mul_mat's output shape is(ne0, ne1, ne2, ne3); if the residual happens to have matching shape but different strides (e.g., a non-contiguous view), the RMW would silently read the wrong bytes. §3.27's shmem path also trusted this check, and that one works — but the shmem path copies element by element, which could hide a stride bug that direct-store reveals. Worth an audit. -
Index calculation off-by-one or wrong stride: the RMW uses
off = abs_c * ne0 + abs_r + im*ne1*ne0, which matches the in-bounds direct-store formuladst + r0 + r1*ne0 + im*ne1*ne0. But I didn't passnb0/nb1through — the direct-store usesargs.ne0as stride assuming contiguous f32 output. If the destination tensor is non-contiguous (say, a view into a larger buffer) the mul_mat kernel itself would be wrong too, so this is probably not the bug, but worth double-checking in a unit test.
There's no per-shape unit test for mul_mm + add(bias)
that compares fused-kernel output vs unfused-graph output
element-by-element. test-metal-ops only covers
diag_mask_inf, pad_ext, and conv_transpose_1d. Adding a
mul_mm_fused test case (build a small ggraph with
mul_mat + add, dispatch with fusion forced on vs
GGML_METAL_FUSION_DISABLE=1, compare outputs to 1e-6
tolerance) would have caught §3.29's bug in seconds. The
§3.27 and §3.28 kernels happen to be byte-exact because
their fold-in happens inside the scalar-copy loop which is
straightforward to reason about; §3.29's direct-store RMW has
a more subtle data-flow that would benefit from explicit
coverage.
| File | Change |
|---|---|
ggml/src/ggml-metal/ggml-metal.metal |
Direct-store RMW block removed; 21-line commentary added in place explaining §3.29 attempt + failure + suspected causes for the next person to read. _mm_use_direct reverts to §3.28's "no fold-in allowed on direct-store path" condition. |
ggml/src/ggml-metal/ggml-metal-device.cpp |
get_pipeline_mul_mm shmem sizing reverts to §3.28 behavior (8 KB when any of bc_out / has_bias / has_residual / has_gelu_erf is set). |
| patches/ggml-metal-chatterbox-ops.patch | Regenerated from pinned 58c38058. 1054 → 1070 lines (+16, the inline documentation block). |
cb_rev.wav md5 matches §3.26/§3.27/§3.28 baseline
d8a1b22375dbcb2259c686426a7d76c5 byte-exact. T3 back to 84
tokens / 3480 ms audio. No code change from §3.28 beyond the
documentation block.
M3 Ultra §3.27 shmem-roundtrip cost (~8 ms on CFM) remains standing. M4 / iOS predicted wins for §3.27 / §3.28 are unaffected — the fused kernel still fires; only the optimization to dodge the shmem path didn't land.
If you pick this up:
- Add a
test-metal-opscase for fusedmul_mm + add(bias)FIRST. Build a 2-op graphadd(mul_mat(W_q4_0, X_f32), bias_f32), dispatch with fusion ON (current default) vsGGML_METAL_FUSION_DISABLE=1, assert element-wise match to ~1e-6. Should be ~80 lines. - Then retry the direct-store path, ideally with a smaller
scope first (only
has_bias, drophas_residual/has_gelu_erf) to halve the complexity. If the bias-only variant passes the new unit test, incrementally add the others. - Apple's Metal Shading Language Specification,
§5.7 "Memory Scopes and Barriers", has the exact semantics
for
mem_flags::mem_devicevsmem_flags::mem_none— worth confirming thatthreadgroup_barrier(mem_device)orders cooperative-tensor-store writes against subsequent device reads on A17+ silicon. Cf.simdgroup_fence_tas an alternative tothreadgroup_barrier.
Two pieces, both closing §3.29 loose ends:
- Harness: new
test_mul_mm_fusedinsrc/test_metal_ops.cppbuilds a small graphadd(mul_mat(W_q4_0, X_f32), bias)(and with an optionalgelu_erffollow-up), runs it on CPU + Metal, and compares element-wise. On the Metal side, ggml-metal's fusion detector collapses these into a singlekernel_mul_mm_..._bias=1_res=X_gelu=Ydispatch; CPU is always the unfused triple. Any numerical drift beyond tolerance indicates a kernel bug. Tolerance picked at 2e-2 absolute after observing the Q4_0-dequant-order CPU-vs-GPU noise on K=256..1024 shapes runs ~5–11e-3 max abs (4× margin over the noise floor). - Bias-only direct-store (§3.29 retry): full-block writes
with
has_bias && !has_residual && !has_gelu_erfnow take the direct-store path with a post-barrier bias-add scan (128 threads × 16 elements), instead of routing through the shmem scalar-copy fallback. Residual / gelu fold-ins still route through shmem — §3.29's negative finding on those paths stands (root cause unresolved), so keeping the proven path for them. This is the minimum-scope slice of §3.29 that the new harness proves byte-stable.
8 fused-mul_mm shape variants, gated under the same test-metal-ops
binary so CI/ship criteria run them alongside diag_mask_inf /
pad_ext / conv_transpose_1d:
[mul_mm_fused cfm-attn-qkv] OK (K=256 N=256 T=87 B=2 fuse=bias, max_abs=5.2e-03)
[mul_mm_fused cfm-attn-out] OK (K=256 N=512 T=87 B=2 fuse=bias, max_abs=5.7e-03)
[mul_mm_fused cfm-ff-gate-bias] OK (K=256 N=1024 T=87 B=2 fuse=bias, max_abs=5.8e-03)
[mul_mm_fused cfm-ff-gate-bias+gelu] OK (K=256 N=1024 T=87 B=2 fuse=gelu, max_abs=4.9e-03)
[mul_mm_fused cfm-ff-down] OK (K=1024 N=256 T=87 B=2 fuse=bias, max_abs=1.1e-02)
[mul_mm_fused cfm-b1] OK (K=256 N=512 T=87 B=1 fuse=bias, max_abs=5.7e-03)
[mul_mm_fused bco-bias] OK (K=256 N=320 T=87 B=2 fuse=bias, max_abs=5.8e-03)
[mul_mm_fused bco-gelu] OK (K=256 N=320 T=87 B=2 fuse=gelu, max_abs=5.2e-03)
Covers the exact shapes chatterbox CFM hits (256→256 attn Q/K/V,
256→512 attn_out, 256→1024 ff0 with gelu, 1024→256 ff2), batch=1
and batch=2 variants, and a non-64-multiple N=320 that forces
the bco=1 (bounds-checked) shmem path.
The bias-only direct-store path passes the harness byte-stably
and produces byte-exact WAV output end-to-end
(md5 d8a1b22375dbcb2259c686426a7d76c5 across 5 runs, T3 84
tokens, audio_ms 3480).
Measured impact on M3 Ultra (5 invocations, Q4_0 + HiFT F16):
| Metric | §3.28 | §3.30 | Δ |
|---|---|---|---|
[cfm_total] ms |
533.4 ± 1.0 | 534.0 ± 0.9 | noise |
S3GEN_INFER_MS |
706.0 ± 0.8 | 706.2 ± 3.2 | noise |
[hift_decode] ms |
121.2 | 121.8 | noise |
Neutral on M3 Ultra, same as §3.27. Reason: in chatterbox's
basic_tfm, every mul_mat+bias has a follow-up op (either
residual or gelu) that forces the fusion through the 3-op
path, which still routes through shmem. The 2-op
{MUL_MAT, ADD(bias)} path §3.30 optimises only fires for
a few tensors outside basic_tfm (time_mlp / final_proj /
resnet t_mlp) that contribute negligibly to wall time.
The harness itself is the real deliverable — any future attempt at the residual / gelu direct-store paths now has a way to get fast feedback on whether a change is correct before spending 2–3 h on an end-to-end chatterbox run.
The {MUL_MAT, ADD, ADD} residual fusion and {MUL_MAT, ADD, GELU_ERF} gelu fusion on the direct-store path were what
failed in §3.29 (the test-metal-ops gate I've just added would
have immediately flagged them as wrong output, avoiding the
revert). Fixing them needs either:
- a deeper audit of
cT.store's cooperative write layout vs Metal memory ordering withmem_flags::mem_device— likely where §3.29 broke; OR - a different strategy entirely (e.g., inline residual read
into the simdgroup accumulator before
simdgroup_store, avoiding the post-barrier RMW round-trip).
Either is 2–3 h of Metal-specific debugging. Left for a future session; the harness now makes that session tractable.
| File | Change |
|---|---|
src/test_metal_ops.cpp |
New test_mul_mm_fused(cpu, gpu, K, N, T, B, fuse_mode, label) helper + 8 test invocations covering the CFM shape space. New #include "ggml-cpu.h" for the CPU reference backend (via the existing include cluster). |
ggml/src/ggml-metal/ggml-metal.metal |
Bias-only direct-store path: full-block write via cT.store / simdgroup_store, then threadgroup_barrier(mem_flags::mem_device), then a 128-thread scan adding bias[r0 + row_off] to each of the 2048 elements. Only fires when FC_mul_mm_has_bias && !FC_mul_mm_has_residual && !FC_mul_mm_has_gelu_erf — gated narrowly to the scope the harness validates. |
ggml/src/ggml-metal/ggml-metal-device.cpp |
Shmem sizing: 8 KB when `bc_out |
| patches/ggml-metal-chatterbox-ops.patch | Regenerated from pinned 58c38058. 1070 → 1088 lines, +18 (direct-store bias scan + shmem-sizing comment). Applies cleanly. |
Three items still deferred:
- Residual direct-store — needs the cooperative-store barrier audit mentioned above. Harness is ready.
- Gelu direct-store — same as residual. The inline-math cost is cheap, so the win is mostly avoiding the shmem roundtrip (like bias). Estimated +2–5 ms on M3 Ultra if it works; infra pattern identical to §3.28 and §3.30.
- Extend fusion to other unary sub-ops (SILU, GELU non-erf, RELU, GELU_QUICK) — trivial copy-paste of §3.28; not done because chatterbox / T3 / CFM don't emit those after a mul_mat+bias pair. Useful infra for downstream consumers of this patch (stable-diffusion.cpp / tts-cpp).
Closes the validation gap left by §3.24 / §3.26 / §3.27 / §3.28 / §3.30 — all of those predict positive-on-bandwidth-limited-hardware (M4 Air / iPhone / iPad) but were measured only on M3 Ultra where per-dispatch overhead is so low that the fusion wins largely cancel out against kernel-path overhead. Two pieces:
Cross-compiled libggml-metal.a + libtts-cpp.a for iOS 14.0+
arm64 on this M3 Ultra host (Xcode 16 / iOS 18.5 SDK):
cmake -S . -B build-ios \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_SYSROOT=iphoneos \
-DCMAKE_OSX_ARCHITECTURES=arm64 \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON \
-DGGML_NATIVE=OFF -DGGML_BLAS=OFF -DGGML_ACCELERATE=OFF
cmake --build build-ios --target tts-cpp ggml-metal -j
Both libraries produce clean arm64-only archives:
build-ios/ggml/src/ggml-metal/libggml-metal.a: arm64
build-ios/libtts-cpp.a: arm64
That's the structural validation that §3.26's
kernel_mul_mv_f32_f16{,_4,_short} variants and §3.27 / §3.28 /
§3.30's kernel_mul_mm FC-gated bias / gelu_erf fold-ins are
iOS-portable — none of the kernel code uses macOS-only
intrinsics. Runtime validation still requires a real iOS device
(TestFlight / Xcode device provisioning); this confirms there's
no compile-time barrier to shipping.
Self-contained harness the user runs on any Apple-silicon Mac (M4 Air / M4 Pro / M3 / etc.) or any host that mounts the model GGUFs. Pipeline:
-
Apply the pinned ggml patch via
scripts/setup-ggml.sh -
Configure + build
build-metal(Release, GGML_METAL=ON, GGML_BLAS=OFF, GGML_NATIVE=ON) -
Run
test-metal-ops— asserts all 14 gates PASS (3 base diag/pad + 3 conv_transpose_1d HiFT + 8 fused-mul_mm) -
Run 5 invocations of
chatterboxon the Spanish-prompt baseline (Q4_0 + HiFT F16 v2 GGUF + seed 42) -
Collect per-run
[encoder]/[cfm_total]/[hift_decode]/S3GEN_INFER_MS/T3_INFER_MS -
Compute means, compare against the M3 Ultra reference baked into the script header:
M3U CFM = 534.0 ms M3U S3Gen = 706.6 ms M3U T3 = 432.6 ms M3U HiFT = 121.1 ms -
Check WAV determinism (all 5 runs same md5) and byte-exactness vs the M3U reference md5
d8a1b22375dbcb2259c686426a7d76c5 -
Write
artifacts/bench/m4-validation.jsonwith the full comparison + host info (chip, model)
Dependencies on the target host:
- macOS + Xcode command-line tools (
cmake,clang++) - Python 3 (for
scripts/setup-ggml.sh's gguf tooling) - Model GGUFs at the usual paths (or override via env vars:
T3_GGUF=... S3GEN_GGUF=... REF_WAV=... RUNS=... bash scripts/bench-m4-validation.sh) - ~16 GB disk for model + build artefacts
Example predicted output on M4 Air (hypothetical; actual to be captured when the script runs on M4 hardware):
=== Summary: Apple M4 vs M3 Ultra reference ===
stage M3 Ultra (ref) this host Δ vs M3U
[cfm_total] ms 534.0 ~XXX.X -A / -B%
S3GEN_INFER_MS 706.6 ~YYY.Y -C / -D%
The Δ column tells us whether the §3.27 / §3.28 / §3.30
predicted-positive story holds. If M4 shows noticeably smaller
CFM than M3U after accounting for M4's higher single-core clock,
the shipping portfolio is vindicated. If M4 matches M3U or
regresses, §3.27 / §3.30 should be re-examined.
Ran the script locally as a sanity check — expected to show "this host == reference" with no deltas:
=== Summary: Apple M3 Ultra vs M3 Ultra reference ===
stage M3 Ultra (ref) this host Δ vs M3U
[cfm_total] ms 534.0 533.7 -0.3 (-0.1%)
S3GEN_INFER_MS 706.6 707.4 +0.8 (+0.1%)
T3_INFER_MS 432.6 434.6 +2.0 (+0.5%)
[hift_decode] ms 121.1 123.1 +2.0 (+1.7%)
=== Parity ===
determinism: PASS (md5 d8a1b22375dbcb2259c686426a7d76c5 stable across 5 runs)
byte-exact vs M3 Ultra: PASS (d8a1b22375dbcb2259c686426a7d76c5)
All deltas within per-invocation stdev. Script is ready to scp + run on any M4 / M3 / M2 box.
| File | Change |
|---|---|
| scripts/bench-m4-validation.sh | New 150-line bash script. Self-contained: pins the M3 Ultra reference numbers, runs test-metal-ops, 5-invocation bench, compares, writes JSON. |
- Run the script on an M4 Air (user action:
scp -r chatterbox.cpp m4:+scp models/*.gguf m4:.../models/+ssh m4 'bash chatterbox.cpp/scripts/bench-m4-validation.sh'+scp m4:.../artifacts/bench/m4-validation.json .). - If M4 results confirm the prediction: update the §3.27 / §3.28 / §3.30 sections with the M4 numbers alongside M3U.
- If M4 results contradict the prediction: file a follow-up to revisit the fusion costs on smaller Apple silicon.
Target: Termux on Snapdragon / Adreno 830 using GGML_OPENCL=ON, with
LD_LIBRARY_PATH including /data/data/com.termux/files/home/lib so the
OpenCL loader and ggml DSOs resolve.
The first OpenCL smoke runs only offloaded T3; S3Gen/HiFT still had to stay on CPU because ggml-opencl rejected missing ops during graph execution. The sequence of blockers observed on-device was:
CONV_TRANSPOSE_1Din HiFT.SIN/COSin HiFT's oscillator / phase path.LEAKY_RELUin the S3Gen encoder.UNARY(ELU)andABSin the f0 predictor.
- Added
GGML_USE_OPENCLwiring to the C++ side (init_backendfor T3 ands3gen_init_backendfor S3Gen/HiFT), so--n-gpu-layers > 0actually attemptsggml_backend_opencl_init()before CPU fallback. - Added
patches/ggml-opencl-chatterbox-ops.patchand updatedscripts/setup-ggml.shso a freshggml/checkout is reset to the pinned commit and receives both the Metal and OpenCL patches. - Extended ggml-opencl with the missing ops:
GGML_OP_CONV_TRANSPOSE_1D(f32andf16kernel /f32input paths).GGML_OP_SIN,GGML_OP_COS.GGML_OP_LEAKY_RELU.GGML_UNARY_OP_ABS,GGML_UNARY_OP_ELU(f32paths used by f0).
- Optimized the first
CONV_TRANSPOSE_1DOpenCL kernel: instead of scanning every input position and discarding almost all of them, each output sample now computes the exact input index range that can contribute. - Exposed
--cfm-steps Nfor normal batch synthesis (previously only the streaming path had--stream-cfm-steps). Default remains 2 for Python-like meanflow quality;--cfm-steps 1is the lower-latency mode.
Remote build:
cd /data/data/com.termux/files/home/qvac-chatterbox.cpp
git pull --ff-only
./scripts/setup-ggml.sh
cmake -S . -B build-opencl -DCMAKE_BUILD_TYPE=Release -DGGML_OPENCL=ON
cmake --build build-opencl -j$(nproc) --target tts-cliRuntime command:
export LD_LIBRARY_PATH="/data/data/com.termux/files/home/lib:${LD_LIBRARY_PATH:-}"
./build-opencl/tts-cli \
--model /data/data/com.termux/files/home/chatterbox.cpp/models/chatterbox-t3-turbo.gguf \
--s3gen-gguf /data/data/com.termux/files/home/chatterbox.cpp/models/chatterbox-s3gen.gguf \
--text "Hello" --n-gpu-layers 99 --verbose --out test-gpu.wavOpenCL now runs end-to-end and writes a WAV:
init_backend: using OpenCL backend
[encoder] ~167 ms
[cfm_total] ~921 ms (2-step default)
[f0_predictor] ~6 ms
[hift_decode] ~217-222 ms after conv_transpose_1d range tightening
S3GEN_INFER_MS ~1396-1450 for 800 ms audio (RTF ~1.74-1.81)
T3_INFER_MS ~772-846
Full generated-audio RTF on the short "Hello" smoke test:
| Mode | T3 infer | S3Gen+HiFT infer | Audio | Full RTF |
|---|---|---|---|---|
| default 2-step CFM | ~772 ms | ~1396 ms | 800 ms | ~2.71 |
--cfm-steps 1 |
~772 ms | ~887 ms | 800 ms | ~2.07 |
The 1-step mode is deliberately opt-in because it trades some meanflow quality for latency; it is useful for interactive/mobile experiments where CFM dominates the wall clock.
Baseline for this log: Termux phone held awake with termux-wake-lock,
T3 Q4_0 + S3Gen Q4_0, short "Hello" smoke test (800 ms audio),
--n-gpu-layers 99 --cfm-steps 1 unless otherwise noted.
| Step | Change | Result |
|---|---|---|
| CFM attention precision | Added --cfm-f16-kv-attn: CFM flash attention uses F32 Q and F16 K/V so OpenCL dispatches flash_attn_f32_f16. |
Best useful CFM win so far: attention kernel went from ~257 ms (flash_attn_f32) to ~102 ms; S3Gen dropped to ~726-740 ms; full RTF ~1.38-1.39 in best phone-awake samples. |
| Model mix: S3Gen F16 | T3 Q4_0 + S3Gen full/F16-ish GGUF with --cfm-f16-kv-attn. |
Not better overall: CFM ~346-354 ms, S3Gen ~743-749 ms. |
| Model mix: S3Gen Q8_0 | Quantized S3Gen to Q8_0 and tested with T3 Q4_0. | Worse than S3Gen Q4_0: CFM ~391 ms, S3Gen ~789 ms. |
| Q4_0 GEMV epilogue fusion | Added optional bias/residual epilogue operands to Adreno token GEMV and graph fusion for MUL_MAT+ADD(+ADD). |
Correct, but only a tiny T3/S3Gen movement on the short run; not a major bottleneck. |
| Batched Q4_0 GEMM epilogue fusion | Added optional bias/residual epilogue to kernel_mul_mm_q4_0_f32_l4_lm, targeting CFM projection GEMMs. |
Correct after arg-placement fix, but core GEMM time stayed ~138 ms in the CFM graph, so surrounding adds were not the real cost. |
| Q4_0 GEMM tile BN=32 | Changed kernel_mul_mm_q4_0_f32_l4_lm from BN=64 to BN=32 for the hot 256 x 540 CFM output shape. |
Regression: CFM Q4_0 GEMM grew from ~138 ms to ~181 ms. Reverted to the original 64x64 tile. |
| Q4_0 GEMM tile BK=64 | Changed kernel_mul_mm_q4_0_f32_l4_lm from BK=32 to BK=64 while keeping BM=64/BN=64. |
Regression: CFM Q4_0 GEMM again grew to ~180 ms and cfm_total ~436 ms. Revert to BK=32. |
| Q4_0 GEMM tile BM=32 | Changed kernel_mul_mm_q4_0_f32_l4_lm from BM=64 to BM=32 while keeping BN=64/BK=32. |
Regression: CFM Q4_0 GEMM grew to ~213 ms and cfm_total ~445 ms. Revert to BM=64. |
| Q4_0 GEMM thread tile TN=4 | Changed per-thread output from TM=4/TN=8 to TM=4/TN=4, keeping BM=64/BN=64/BK=32. | Mild regression: CFM Q4_0 GEMM rose to ~147 ms and cfm_total ~411 ms. Revert to TN=8. |
| CFM attention F16 Q/K/V | Cast Q/K/V to F16 for flash_attn_f16, then copy output back to F32 before projection. |
Not better than F16 K/V only: flash attention dropped to ~92 ms, but extra copies raised total CFM to ~369 ms vs ~355 ms. Remove the flag; keep --cfm-f16-kv-attn. |
Direct conv1d via CONV_2D |
Tested an env-gated path that reshaped 1D convs to height-1 ggml_conv_2d_direct, bypassing explicit im2col -> mul_mat. |
Rejected and removed. Profiling run improved HiFT (hift_decode ~169 ms), but a non-profile phone-awake sample regressed overall (S3GEN_INFER_MS ~845 ms, cfm_total ~404 ms), so the code path was deleted. |
Current measured bottlenecks after the useful attention change:
CFM graph (cl_profiling_0022.csv):
kernel_mul_mm_q4_0_f32_l4_lm ~138 ms
flash_attn_f32_f16 ~102 ms
Next experiments should target the core Q4_0 batched GEMM math itself
(kernel_mul_mm_q4_0_f32_l4_lm), not epilogue/add fusion.