Chatterbox vulkan on multilingual#5
Open
Zbig9000 wants to merge 1 commit into
Open
Conversation
Ports the Vulkan-side optimizations from
chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan
(branched from upstream/main) onto the multilingual_merged base.
Two ggml-vulkan patches + chatterbox-source host-side optimizations
that benefit BOTH the Turbo (meanflow) and the multilingual (standard
CFM with CFG) variants. No GGUF format change.
Headline (RTX 5090, regress-tight aggregate, n=75 chunks, Turbo model
since multilingual model files weren't available locally):
metric | mtl_merged base | + this PR | Δ
S3GEN_INFER | 76.6 ms | 65.4 ms | -11.2 ms (-14.6 %)
cfm_total | 40.3 ms | 28.7 ms | -11.6 ms (-28.8 %)
encoder | 19.9 ms | 20.7 ms | noise
hift_decode | 10.9 ms | 11.6 ms | noise
cfm_total ranges fully separated on n=120 samples
(base [38.3, 42.8] vs final [27.1, 30.1]). All F32 invariants
(round-1 single-shot, round-2 multi-synth ident, round-3 multi-synth
varied) stay bit-exact; F16 invariant requires the C1 follow-up
(opt-in CHATTERBOX_F16_CFM env var, not in this commit).
ggml-vulkan patches (apply via setup-ggml.sh, inert without
GGML_VULKAN=ON):
- patches/ggml-vulkan-pipeline-cache.patch (+199 lines)
Persistent VkPipelineCache across processes, keyed by
<vendorID>-<deviceID>-<driverVersion>. Recovers ~91 % of the
cold-to-warm gap on the first warm run. Disabled by setting
GGML_VK_PIPELINE_CACHE_DIR="". Entire bench: 2.69 s -> 0.25 s
fresh-process wall on RTX 5090 with cache populated.
- patches/ggml-vulkan-eager-cache-save.patch (+104 lines)
Write back the pipeline cache after every ggml_vk_load_shaders
compile batch (crash-safety against SIGKILL/abort losing freshly
compiled pipelines). Stacks on the first patch.
Chatterbox-source host-side optimizations (src/chatterbox_tts.cpp,
+~250 lines):
1. Persistent CFM estimator graph cache (~ -10 ms / chunk).
cfm_estimator_cache was local-scope in s3gen_synthesize_to_wav()
-- every synth call paid the full graph rebuild cost. Refactored
to follow the same explicit-destroy() global-lifetime pattern as
the existing thread_local time_mlp_cache. Both the batch=1
(Turbo) and batch=2 (multilingual CFG) paths reuse the same
cache; the cache.b2 flag triggers a rebuild when mode changes.
Cache cleared in s3gen_model_cache_release BEFORE the backend is
freed (Vulkan/Metal device-teardown ordering). Also cleared on
s3gen_model_cache_get cache-miss (backend swap).
2. Time-embedding result memoisation (~ -4 ms / inf).
Both Turbo (t_span = [0, 0.5, 1]) and multilingual (cosine-
scheduled, default 10 steps) produce the same set of t-values
across all subsequent synth calls. Added two-layer cache:
- g_time_mlp_results: keyed by uint32_t bitcast of t_val
- g_time_emb_results: keyed by uint64_t = (kt << 32) | kr
(Turbo only; multilingual skips the mixer)
compute_time_mlp_cached + compute_time_emb_cached wrappers.
6 graph submissions / inference -> 0 after first inference for
Turbo; 9-19 -> 0 for multilingual (10-step).
3. CPU mirror cache for large per-synth weight downloads (~ -1 ms
/ inf). flow/input_embedding (~13.4 MB Turbo / ~28 MB MTL) +
flow/spk_embed_affine/{w,b} were re-downloaded GPU->CPU on every
synth call. New cached_cpu_weights_f32(t) helper +
g_weight_cpu_mirror map (keyed by ggml_tensor *).
4. Three HiFT cont sites removed (perf-neutral, code quality).
conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp
permute -- all bit-exact-preserving (consumers accept strided
sources: ggml_add for bias, ggml_clamp element-wise, ggml_mul_mat
src1 for f32 matmul).
Test infrastructure (scripts/dump-s3gen-reference.py +65 lines,
src/test_s3gen.cpp +6 lines):
- G2 dump-script gap closure: cfm_concat / cfm_h_conv / cfm_h_ln /
hift_s_stft .npy files now produced. Plus ggml_set_output(xc)
in stage_G2 so the gallocator preserves the diagnostic
intermediate (was returning garbage because xc's slot got reused
by downstream intermediates after the conv1d consumer completed).
- regress-tensor-compare.sh now runs end-to-end through
G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT
(PyTorch FFT vs hand-built DFT, expected), max <= 4.7e-5
everywhere else; final waveform max_abs = 8.20e-08.
Negative result documented (inline comments + FINDINGS doc): tried
skip-upload of mu/spks/cond across cfm_steps within one synthesize
call. Broke F32 single-shot. Root cause: ggml's gallocator REUSES
input-tensor buffer slots once their consumers complete. Skip-upload
only works for inputs referenced THROUGHOUT the graph (encoder
pos_emb pattern works, CFM mu/spks/cond pattern doesn't).
Deferred follow-ups (in OPTIMIZATION_PLAN_NEXT.md):
- C1: F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM env var).
Saves ~125 MB device memory + helps bandwidth-bound mobile.
Needs new MD5 baselines.
- Round-4/6 QKV fusion: multilingual_merged uses zero-cont strided
Q/K/V views (Metal-tuned). Our Vulkan fused mul_mat would need
careful integration to compose with that approach.
- HiFT decoder graph caching: multilingual_merged HiFT rebuilds
every chunk (no g_hift_cache equivalent). Same persistent-cache
pattern as round-HIFT could apply.
- Multilingual model file regression: multilingual model not
available locally; only Turbo F32 invariants verified bit-exact
on multilingual_merged base.
Files: src/chatterbox_tts.cpp +252 / -19
src/test_s3gen.cpp +6
scripts/dump-s3gen-reference.py +65
scripts/setup-ggml.sh +20 / -8
patches/ggml-vulkan-pipeline-cache.patch +199 (NEW)
patches/ggml-vulkan-eager-cache-save.patch +104 (NEW)
patches/README.md +13 / -8
CHANGELOG.md +603 (NEW)
Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
pushed a commit
to ogad-tether/chatterbox.cpp
that referenced
this pull request
May 6, 2026
…etry/scratch) Five targeted fixes surfaced by review of the multilingual_merged tip after the origin/main merge. Three are real bugs (CFG, top_k, engine crash on MTL GGUFs); one is a perf regression with audible behaviour on MTL (spurious T3 retries); one is a defensive cleanup. 1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly computes (1+cfg)*cond - cfg*uncond, but the else branch only computed the conditional pass and silently dropped CFG on every non-Metal backend. Restores the §3.19 (3f0a8da) behaviour: when !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU backend where the b2 path was disabled), run cond + uncond back-to-back on the same B=1 graph (cfm_estimator_cache key (T, b2=false) reuses the cached graph across both calls) and combine via the standard CFG mix. Smoke-tested on CPU (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs meanflow as expected (12 CFM steps × 2 forward calls). 2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after nth_element(begin, begin+k, end, greater) the (k+1)-th largest sits at idx[k] and positions [0, k) hold the top-k UNORDERED. The previous code took cut = l[idx[k-1]] which is some arbitrary top-k element (often not the smallest), making cut too large and the `x < cut` filter then erased legitimate top-k logits. Fix: partition to begin+(k-1) so idx[k-1] is the k-th largest exactly. Mostly masked by the default top_k=1000 vs an 8194-vocab where the threshold falls into the noise floor; the bug bites at small top_k (e.g. greedy --top-k 1 where the wrong cut could pessimise tie handling). The Turbo sample_next_token_ex in src/main.cpp uses a different (correct) approach via tmp[k] + per-element rescan for ties; left untouched. 3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves layers empty), but synthesize() unconditionally calls eval_prompt -> build_prompt_graph -> build_transformer_core, which iterates model.layers[il] -- empty std::vector, UB or crash. Add a clean rejection guard right after the load: if model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and throw a clear error pointing the user at the CLI / internal eval_*_mtl helpers. Wiring MTL through the public Engine API (extend EngineOptions with language / cfg_weight / min_p / exaggeration, branch synthesize() on variant) is left as a follow-up; this just stops the crash on the public surface. 4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the 0cad44d merge commit said the 5x speech-tokens-per-BPE-token floor (calibrated for English Turbo / GPT-2 BPE) should be gated to non-MTL because MTL's Llama tokenizer has a ~1.7x ratio. The gating wasn't actually in the code -- a clean stop-token termination on a short MTL segment looked "implausible" and triggered up to 3 spurious retries. `plausible = is_mtl || (int)generated.size() >= min_tokens;` restores the intent. The 3x-repeated-token early-stop above still guards MTL's catastrophic case. Measured on M4 Metal with the ES reference prompt + jfk/gianni voice: T3 wall time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) -- ~4x speedup just from removing the wasted retries. WAV md5 stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2. 5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early type-equality guard implies wq/wk/wv have identical sizes today, but max over all three so a future shape divergence (e.g. an MTL variant with non-square Q/K/V) can't silently truncate a per-layer copy via undersized scratch. No behaviour change today; defensive only. Validation (Apple M4, Metal, Release): - cmake --build: clean, no warnings, all targets link. - test-metal-ops: 14/14 PASS, 0 FAIL. - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy): md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the pre-fix baseline. T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4). - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es): completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward calls (cond + uncond), produces valid 1.1s WAV. Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7 (/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate cli_main.cpp build) are tracked but intentionally not folded in here -- the reviewer flagged them as cosmetic / trivial / fine. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Vulkan-side optimization work for chatterbox.cpp on the multilingual_merged
base. Two ggml-vulkan patches + four host-side optimizations in
src/chatterbox_tts.cppthat benefit BOTH the Turbo (meanflow) and themultilingual (standard CFM with CFG) variants. All bit-exact on F32
across NVIDIA + AMD/RADV.
patches/ggml-vulkan-pipeline-cache.patchVkPipelineCachekeyed by<vendorID>-<deviceID>-<driverVersion>. Disabled via emptyGGML_VK_PIPELINE_CACHE_DIR.patches/ggml-vulkan-eager-cache-save.patchggml_vk_load_shaderscompile batch.cfm_estimator_cachewas local-scope insynthesize(); promoted to global with explicitdestroy()lifetime tied tos3gen_model_cache_release.flow/input_embedding(~13.4 MB Turbo / ~28 MB multilingual) + speaker affine matrices were re-downloaded every synth.conv_transpose_1d_f32exit, ISTFTy_trimexit,f0_predictorxppermute.regress-tensor-compare.shnow runs end-to-end through G2/G3/G4/H1/H3/H4/H5.Headline numbers — RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325, Turbo model, regress-tight aggregate, n=75 chunks
cfm_totalranges fully separated on n=120 total samples(base
[38.3, 42.8]vs final[27.1, 30.1]) — real signal, not noise.Cold-start (round-1 patch)
Round-1 alone recovers 91 % of the cold→fully-warm gap with only
ggml's cache populated — the headline win for Mesa / Adreno / Mali
where there's no per-driver shader cache to fall back on.
Bit-exactness
454b4cc14538e8ef917930b110d1e5044c83f367e6ca2b02fefbd480519ea3f69252253ee532cb7928639a0f644a25da713fe5aed997002379a12383d3795584a84623b784b5e47dc95f62229773e81b694410826f1025e9888d1029a5cf2bc0chatterbox-s3gen.gguffor the MTL variant, MTL T3 GGUFs) were not available locally. The optimizations apply at the host-side / cache-management layer and are model-agnostic by construction; they should continue to be bit-exact for multilingual too.What this PR does
Part 1 — ggml-vulkan patches (no chatterbox-source dependency)
Two opt-in patches applied via
scripts/setup-ggml.sh, completely inert when configuring without-DGGML_VULKAN=ON:patches/ggml-vulkan-pipeline-cache.patch(199 lines): persistentVkPipelineCacheacross processes, keyed by<vendorID>-<deviceID>-<driverVersion>. Resolved from$GGML_VK_PIPELINE_CACHE_DIR→$XDG_CACHE_HOME/ggml/vulkan→$HOME/.cache/ggml/vulkan. Disabled by setting the env var to the empty string (byte-identical to upstream).patches/ggml-vulkan-eager-cache-save.patch(104 lines): write back the pipeline-cache blob after everycompiles.wait()batch inggml_vk_load_shaders(crash-safety against SIGKILL/abort losing freshly compiled pipelines). Trackspipeline_cache_last_sizeso warm-cache hits skip the disk write.Part 2 — Persistent CFM estimator graph cache (the headline)
multilingual_merged's
cfm_estimator_cachewas local-scope insynthesize()— every synth call paid the full graph rebuild cost (~5500-node CFM graph build +gallocr_reserveallocates the device-side buffer pool, ~10 ms wall on RTX 5090 with the 64 MB buf).Refactored to follow the same explicit-
destroy()global-lifetime pattern as the existingthread_local time_mlp_cache(which already documents the same Vulkan/Metal device-teardown ordering constraint):Both
cfm_estimator_forward(batch=1, Turbo) andcfm_estimator_forward_b2(batch=2 CFG, multilingual) use the same cache object — the existing(cache.T != T) || (cache.b2 != current_b2)rebuild logic handles mode switches correctly.Cache is destroyed in
s3gen_model_cache_releaseBEFOREggml_backend_free(Vulkangallocr_freeagainst a danglingvk_devicewould assert) and ons3gen_model_cache_getcache-miss (backend swap). Same constraint already documented forthread_local time_mlp_cache.Part 3 — Time-embedding result memoisation
multilingual_merged has
thread_local time_mlp_cache(graph cached) but no result cache. Two-layer cache transparently plugs in:t_span = [0, 0.5, 1]):compute_time_mlp(0.5)is called twice per inference (as r in step 0, as t in step 1). After warm-up: 6 graph submissions / inference → 0.Each
compute_time_mlpgraph has 3 dispatches (~18 µs GPU compute) but the wall-clock cost is ~700 µs due to fixed cmd-buffer + queue-submit + sync + tensor_get overhead — the per-graph fixed cost is 30× actual compute. Memoisation saves the full submit cost.Caches cleared in
g_cfm_estimator_cache_destroyalongside the graph cache (this also handles a futureCHATTERBOX_F16_CFMopt-in mode flip — model reload → cache clear).Float keys use
bitcast → uint32_tso IEEE equality matches the literal const-folded values fromt_span[i].Part 4 — CPU mirror cache for large per-synth weight downloads
synthesize()reads three large model tensors viaggml_backend_tensor_geton every call:flow/input_embeddingflow/spk_embed_affine/wflow/spk_embed_affine/bOn a GPU backend each is a real device→host transfer plus sync (~600-1000 µs / synth on RTX 5090 for input_embedding). These weights are CONSTANT for the model lifetime — cache them.
Three call-site swaps in
synthesize(). Cleared ing_cfm_estimator_cache_destroybecause theggml_tensor *keys belong to the soon-to-be-freed model context.Part 5 — Three HiFT cont sites removed
Round-AUDIT-style cleanup applied to the HiFT decoder:
conv_transpose_1d_f32exit contggml_add(x, reshape_2d(bias))pre_lookaheadexit.y_trimexit contggml_clamp(element-wise) → outputf0_predictorxppermute contggml_mul_matsrc1Perf-neutral on RTX 5090 (HiFT-section CONT contribution is 0.13 % of HiFT runtime per the perf logger). Code-quality + future-proofing wins, same character as the upstream's earlier cont-removal work.
Part 6 — G2 dump-script gap closure (test-infra)
regress-tensor-compare.shwas previously aborting at stage G2 withcannot open cfm_concat.npy. Four files added toscripts/dump-s3gen-reference.py:cfm_concat.npyConditionalDecoder.forward.cfm_h_conv.npyblock1.block[0](CausalConv1d).cfm_h_ln.npyblock1.block[3](Transpose back to (B, C, T) after LayerNorm).hift_s_stft.npyhift._stftfollowed bycat([real, imag], dim=1).Plus a one-line C++ fix in
test_s3gen.cpp'sstage_G2: addggml_set_output(xc)so the gallocator preserves the diagnostic intermediate (was returning garbage because xc's slot got reused by downstream intermediates after the conv1d consumer completed).Full pipeline now runs end-to-end through G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max ≤ 4.7e-5 everywhere else, final waveform max_abs = 8.20e-08.
Negative result documented
Tried adding
last_mu_ptr / last_spks_ptr / last_cond_ptrtracking tocfm_estimator_cacheto skip the redundantggml_backend_tensor_setfor mu/spks/cond on the 2nd CFM step within one synthesize() call (those inputs are constant across cfm_steps).F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete. In CFM:
Skip-upload only works for inputs referenced THROUGHOUT the graph (like
pos_embin encoder, which feeds every conformer block). CFM mu/spks/cond are referenced only at the start, so they're recyclable immediately.Reverted. General rule for ggml's gallocator: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes. Detailed analysis in
FINDINGS_ROUND_HIFT.md§2-bis.4.Why a fresh squashed commit instead of rebase
The original optimization work was 8 commits on
chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan(branched fromupstream/main).git rebase upstream/multilingual_mergedhit conflicts immediately at commit 2 (e3d5707, round-2 stage-graph caches) because multilingual_merged already added similar caches in different form (thread_local time_mlp_cache,cfm_estimator_cachewithb2flag, etc.). Subsequent commits would have stacked conflicts.A clean linear port re-applies only the optimizations that still provide measurable value on top of the multilingual_merged base, with comments explaining the fit on the new code structure. Detailed analysis in
FINDINGS_ROUND_MULTILINGUAL_PORT.md.The original main-base branch is preserved at the
backup-pre-multilingual-rebasetag for reference.Deferred follow-ups (separate PRs)
CHATTERBOX_F16_CFMenv var)load_s3gen_ggufusesggml_dup_tensor + ggml_backend_alloc_ctx_tensors(different from main); needs adapting our F16 conversion path. ~100 lines + new MD5 baselines (NVIDIA + AMD, F32 + F16).849507a). Composing with our batched matmul approach is non-trivial. Pick one approach + bench on Vulkan.run_hift_decodeallocatesggml_gallocr_t + ggml_context *fresh on every call. Same persistent-cache pattern as round-HIFT could apply.How was it tested