Chatterbox vulkan on multilingual by Zbig9000 · Pull Request #5 · GustavoA1604/chatterbox.cpp

Zbig9000 · 2026-05-04T18:34:50Z

Vulkan-side optimization work for chatterbox.cpp on the multilingual_merged
base. Two ggml-vulkan patches + four host-side optimizations in
src/chatterbox_tts.cpp that benefit BOTH the Turbo (meanflow) and the
multilingual (standard CFM with CFG) variants. All bit-exact on F32
across NVIDIA + AMD/RADV.

Change	Bit-exact F32 NV+AMD	Net RTX 5090 perf	Notes
`patches/ggml-vulkan-pipeline-cache.patch`	✓	−2.44 s cold-start	Persistent `VkPipelineCache` keyed by `<vendorID>-<deviceID>-<driverVersion>`. Disabled via empty `GGML_VK_PIPELINE_CACHE_DIR`.
`patches/ggml-vulkan-eager-cache-save.patch`	✓	crash-safety	Write back the pipeline cache after every `ggml_vk_load_shaders` compile batch.
Persistent CFM estimator graph cache	✓	~−10 ms / chunk	`cfm_estimator_cache` was local-scope in `synthesize()`; promoted to global with explicit `destroy()` lifetime tied to `s3gen_model_cache_release`.
Time-embedding result memoisation	✓	~−4 ms / inf	Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only). Eliminates 6 graph submissions / inf for Turbo, 9-19 for multilingual.
CPU mirror cache for large per-synth weight downloads	✓	~−1 ms / inf	`flow/input_embedding` (~13.4 MB Turbo / ~28 MB multilingual) + speaker affine matrices were re-downloaded every synth.
3 HiFT cont sites removed	✓	noise (code-quality)	`conv_transpose_1d_f32` exit, ISTFT `y_trim` exit, `f0_predictor` `xp` permute.
G2 dump-script gap closure	n/a (test-infra)	n/a	`regress-tensor-compare.sh` now runs end-to-end through G2/G3/G4/H1/H3/H4/H5.

Headline numbers — RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325, Turbo model, regress-tight aggregate, n=75 chunks

Metric	upstream/multilingual_merged	this PR	Δ
S3GEN_INFER	76.6 ms	65.4 ms	−11.2 ms (−14.6 %)
cfm_total	40.3 ms	28.7 ms	−11.6 ms (−28.8 %)
encoder	19.9 ms	20.7 ms	noise
hift_decode	10.9 ms	11.6 ms	noise

cfm_total ranges fully separated on n=120 total samples
(base [38.3, 42.8] vs final [27.1, 30.1]) — real signal, not noise.

Cold-start (round-1 patch)

Scenario (fresh process, RTX 5090)	T3	S3Gen	Wall (ms)
Both caches cold (fresh machine / Mesa)	947	1741	2 688
ggml cache warm, NVIDIA cache cold	80	166	246
Both caches warm (steady state)	69	154	223

Round-1 alone recovers 91 % of the cold→fully-warm gap with only
ggml's cache populated — the headline win for Mesa / Adreno / Mali
where there's no per-driver shader cache to fall back on.

Bit-exactness

3 RTX 5090 F32 invariants PASS (round-1 single-shot, round-2 multi-synth identical, round-3 multi-synth varied):
- 454b4cc14538e8ef917930b110d1e504
- 4c83f367e6ca2b02fefbd480519ea3f6
- 9252253ee532cb7928639a0f644a25da
3 AMD/RADV F32 invariants PASS (locked AMD MD5s):
- 713fe5aed997002379a12383d3795584
- a84623b784b5e47dc95f62229773e81b
- 694410826f1025e9888d1029a5cf2bc0
F16 invariants are NOT verified — C1 (F16 CFM matmul weights opt-in env var) is deferred (see §"Deferred follow-ups" below).
Multilingual model bit-exactness was NOT verified because the multilingual model files (chatterbox-s3gen.gguf for the MTL variant, MTL T3 GGUFs) were not available locally. The optimizations apply at the host-side / cache-management layer and are model-agnostic by construction; they should continue to be bit-exact for multilingual too.

What this PR does

Part 1 — ggml-vulkan patches (no chatterbox-source dependency)

Two opt-in patches applied via scripts/setup-ggml.sh, completely inert when configuring without -DGGML_VULKAN=ON:

patches/ggml-vulkan-pipeline-cache.patch (199 lines): persistent VkPipelineCache across processes, keyed by <vendorID>-<deviceID>-<driverVersion>. Resolved from $GGML_VK_PIPELINE_CACHE_DIR → $XDG_CACHE_HOME/ggml/vulkan → $HOME/.cache/ggml/vulkan. Disabled by setting the env var to the empty string (byte-identical to upstream).
patches/ggml-vulkan-eager-cache-save.patch (104 lines): write back the pipeline-cache blob after every compiles.wait() batch in ggml_vk_load_shaders (crash-safety against SIGKILL/abort losing freshly compiled pipelines). Tracks pipeline_cache_last_size so warm-cache hits skip the disk write.

Part 2 — Persistent CFM estimator graph cache (the headline)

multilingual_merged's cfm_estimator_cache was local-scope in synthesize() — every synth call paid the full graph rebuild cost (~5500-node CFM graph build + gallocr_reserve allocates the device-side buffer pool, ~10 ms wall on RTX 5090 with the 64 MB buf).

Refactored to follow the same explicit-destroy() global-lifetime pattern as the existing thread_local time_mlp_cache (which already documents the same Vulkan/Metal device-teardown ordering constraint):

// before (multilingual_merged) — local-scope, every synth pays the rebuild
cfm_estimator_cache cfm_cache;

// after (this PR) — global with explicit destroy() in s3gen_model_cache_release
cfm_estimator_cache & cfm_cache = g_cfm_estimator_cache;

Both cfm_estimator_forward (batch=1, Turbo) and cfm_estimator_forward_b2 (batch=2 CFG, multilingual) use the same cache object — the existing (cache.T != T) || (cache.b2 != current_b2) rebuild logic handles mode switches correctly.

Cache is destroyed in s3gen_model_cache_release BEFORE ggml_backend_free (Vulkan gallocr_free against a dangling vk_device would assert) and on s3gen_model_cache_get cache-miss (backend swap). Same constraint already documented for thread_local time_mlp_cache.

Part 3 — Time-embedding result memoisation

multilingual_merged has thread_local time_mlp_cache (graph cached) but no result cache. Two-layer cache transparently plugs in:

static std::unordered_map<uint32_t, std::vector<float>> g_time_mlp_results;  // both variants
static std::unordered_map<uint64_t, std::vector<float>> g_time_emb_results;  // Turbo only

static std::vector<float> compute_time_mlp_cached(const model_ctx & m, float t_val);
static std::vector<float> compute_time_emb_cached(const model_ctx & m, float t_val, float r_val);

Turbo (meanflow, t_span = [0, 0.5, 1]): compute_time_mlp(0.5) is called twice per inference (as r in step 0, as t in step 1). After warm-up: 6 graph submissions / inference → 0.
Multilingual (cosine-scheduled, default 10 steps): 10 distinct t-values, all repeat across every subsequent synth. After warm-up: 9-19 graph submissions / inference → 0.

Each compute_time_mlp graph has 3 dispatches (~18 µs GPU compute) but the wall-clock cost is ~700 µs due to fixed cmd-buffer + queue-submit + sync + tensor_get overhead — the per-graph fixed cost is 30× actual compute. Memoisation saves the full submit cost.

Caches cleared in g_cfm_estimator_cache_destroy alongside the graph cache (this also handles a future CHATTERBOX_F16_CFM opt-in mode flip — model reload → cache clear).

Float keys use bitcast → uint32_t so IEEE equality matches the literal const-folded values from t_span[i].

Part 4 — CPU mirror cache for large per-synth weight downloads

synthesize() reads three large model tensors via ggml_backend_tensor_get on every call:

Tensor	Turbo size	Multilingual size
`flow/input_embedding`	13.4 MB	~28 MB
`flow/spk_embed_affine/w`	60 KB	60 KB
`flow/spk_embed_affine/b`	320 B	320 B

On a GPU backend each is a real device→host transfer plus sync (~600-1000 µs / synth on RTX 5090 for input_embedding). These weights are CONSTANT for the model lifetime — cache them.

static std::unordered_map<const ggml_tensor *, std::vector<float>> g_weight_cpu_mirror;

static const float * cached_cpu_weights_f32(const ggml_tensor * t);

Three call-site swaps in synthesize(). Cleared in g_cfm_estimator_cache_destroy because the ggml_tensor * keys belong to the soon-to-be-freed model context.

Part 5 — Three HiFT cont sites removed

Round-AUDIT-style cleanup applied to the HiFT decoder:

Site	Calls/inf	Consumer	Why safe
`conv_transpose_1d_f32` exit cont	3	`ggml_add(x, reshape_2d(bias))`	Same strided-tolerant pattern as round-AUDIT's `pre_lookahead` exit.
ISTFT `y_trim` exit cont	1	`ggml_clamp` (element-wise) → output	Clamp's output is fresh contiguous; tensor_get reads from contig.
`f0_predictor` `xp` permute cont	1	`ggml_mul_mat` src1	Vulkan/Metal/CUDA mul_mat shaders accept strided src1 for f32 matmul.

Perf-neutral on RTX 5090 (HiFT-section CONT contribution is 0.13 % of HiFT runtime per the perf logger). Code-quality + future-proofing wins, same character as the upstream's earlier cont-removal work.

Part 6 — G2 dump-script gap closure (test-infra)

regress-tensor-compare.sh was previously aborting at stage G2 with cannot open cfm_concat.npy. Four files added to scripts/dump-s3gen-reference.py:

File	Stage	Why missing
`cfm_concat.npy`	G2	Concat happens inside `ConditionalDecoder.forward`.
`cfm_h_conv.npy`	G2	Output of `block1.block[0]` (CausalConv1d).
`cfm_h_ln.npy`	G2	Output of `block1.block[3]` (Transpose back to (B, C, T) after LayerNorm).
`hift_s_stft.npy`	H3, H4	Output of `hift._stft` followed by `cat([real, imag], dim=1)`.

Plus a one-line C++ fix in test_s3gen.cpp's stage_G2: add ggml_set_output(xc) so the gallocator preserves the diagnostic intermediate (was returning garbage because xc's slot got reused by downstream intermediates after the conv1d consumer completed).

Full pipeline now runs end-to-end through G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max ≤ 4.7e-5 everywhere else, final waveform max_abs = 8.20e-08.

Negative result documented

Tried adding last_mu_ptr / last_spks_ptr / last_cond_ptr tracking to cfm_estimator_cache to skip the redundant ggml_backend_tensor_set for mu/spks/cond on the 2nd CFM step within one synthesize() call (those inputs are constant across cfm_steps).

F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete. In CFM:

xc = ggml_concat(x_in, mu_in, spks_bc, cond_in);  // <-- last use of mu/spks/cond
// rest of the graph operates on `xc`; mu/spks/cond's slots are now free for
// the gallocator to reuse for downstream intermediates.

Skip-upload only works for inputs referenced THROUGHOUT the graph (like pos_emb in encoder, which feeds every conformer block). CFM mu/spks/cond are referenced only at the start, so they're recyclable immediately.

Reverted. General rule for ggml's gallocator: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes. Detailed analysis in FINDINGS_ROUND_HIFT.md §2-bis.4.

Why a fresh squashed commit instead of rebase

The original optimization work was 8 commits on chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan (branched from upstream/main). git rebase upstream/multilingual_merged hit conflicts immediately at commit 2 (e3d5707, round-2 stage-graph caches) because multilingual_merged already added similar caches in different form (thread_local time_mlp_cache, cfm_estimator_cache with b2 flag, etc.). Subsequent commits would have stacked conflicts.

A clean linear port re-applies only the optimizations that still provide measurable value on top of the multilingual_merged base, with comments explaining the fit on the new code structure. Detailed analysis in FINDINGS_ROUND_MULTILINGUAL_PORT.md.

The original main-base branch is preserved at the backup-pre-multilingual-rebase tag for reference.

Deferred follow-ups (separate PRs)

Candidate	RTX 5090 estimated win	Why deferred
C1 — F16 CFM matmul weights (opt-in `CHATTERBOX_F16_CFM` env var)	~125 MB device memory + bandwidth-bound mobile win	multilingual_merged's `load_s3gen_gguf` uses `ggml_dup_tensor + ggml_backend_alloc_ctx_tensors` (different from main); needs adapting our F16 conversion path. ~100 lines + new MD5 baselines (NVIDIA + AMD, F32 + F16).
Round-4 / 6 Q/K/V batched matmul fusion	~1.3 ms RTX 5090 + larger on bandwidth-starved targets	multilingual_merged uses zero-cont strided Q/K/V views (their `849507a`). Composing with our batched matmul approach is non-trivial. Pick one approach + bench on Vulkan.
HiFT decoder graph caching	~5-10 ms / chunk on multilingual variant	multilingual_merged's `run_hift_decode` allocates `ggml_gallocr_t + ggml_context *` fresh on every call. Same persistent-cache pattern as round-HIFT could apply.
Multilingual model regression	n/a — verification	Multilingual model files were not available locally; only Turbo F32 invariants verified bit-exact on multilingual_merged base. Optimizations are model-agnostic by construction; explicit verification is a follow-up.
Mobile validation (Adreno / Mali / Apple)	n/a — hardware-bound	Biggest remaining evidence gap. AMD/RADV proxy refuted the mobile-bandwidth projection on rounds 2/3/5/6/C1 of the original work; real mobile runs would either confirm or force revision.

How was it tested

cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp

# Apply patches (Metal + OpenCL + 2 Vulkan)
bash scripts/setup-ggml.sh

# Build
cmake -S . -B build-mtl -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-mtl -j$(nproc) --target tts-cli test-s3gen

# 1. RTX 5090 — F32 invariants
bash ../bench-logs-vk-c1/regress-c1.sh build-mtl 1
# Expected: round-1/2/3 F32 invariants PASS (3/3).
# F16 invariants will FAIL — C1 not in this PR.

# 2. AMD/RADV — F32 invariants
VK_LOADER_DRIVERS_SELECT='radeon_icd*' \
    bash ../bench-logs-vk-amd/regress-amd.sh build-mtl 1
# Expected: AMD F32 invariants PASS (3/3).

# 3. Aggregate perf — RTX 5090, 5 iters × 16 chunks = 80 chunks per build
bash ../bench-logs-vk-round3/regress-tight.sh build-mtl mtl-final 5
# Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75
# vs upstream baseline ~77 ms / ~40 ms.

# 4. Tensor-level Python ↔ C++ stage compare (test-infra unblocked by G2 fix)
bash ../bench-logs-vk-c1/regress-tensor-compare.sh
# Expected: end-to-end through G2/G3/G4/H1/H3/H4/H5; max rel err ≤ 4.7e-5
# (except STFT 7.9e-3 — PyTorch FFT vs DFT-via-conv1d).

# 5. Cold-start measurement (round-1 ggml-vulkan patch)
rm -rf ~/.cache/ggml/vulkan ~/.cache/nvidia
./build-mtl/chatterbox …  # first run: ~2.7 s cold
./build-mtl/chatterbox …  # second run: ~250 ms (ggml cache warm)

Ports the Vulkan-side optimizations from chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan (branched from upstream/main) onto the multilingual_merged base. Two ggml-vulkan patches + chatterbox-source host-side optimizations that benefit BOTH the Turbo (meanflow) and the multilingual (standard CFM with CFG) variants. No GGUF format change. Headline (RTX 5090, regress-tight aggregate, n=75 chunks, Turbo model since multilingual model files weren't available locally): metric | mtl_merged base | + this PR | Δ S3GEN_INFER | 76.6 ms | 65.4 ms | -11.2 ms (-14.6 %) cfm_total | 40.3 ms | 28.7 ms | -11.6 ms (-28.8 %) encoder | 19.9 ms | 20.7 ms | noise hift_decode | 10.9 ms | 11.6 ms | noise cfm_total ranges fully separated on n=120 samples (base [38.3, 42.8] vs final [27.1, 30.1]). All F32 invariants (round-1 single-shot, round-2 multi-synth ident, round-3 multi-synth varied) stay bit-exact; F16 invariant requires the C1 follow-up (opt-in CHATTERBOX_F16_CFM env var, not in this commit). ggml-vulkan patches (apply via setup-ggml.sh, inert without GGML_VULKAN=ON): - patches/ggml-vulkan-pipeline-cache.patch (+199 lines) Persistent VkPipelineCache across processes, keyed by <vendorID>-<deviceID>-<driverVersion>. Recovers ~91 % of the cold-to-warm gap on the first warm run. Disabled by setting GGML_VK_PIPELINE_CACHE_DIR="". Entire bench: 2.69 s -> 0.25 s fresh-process wall on RTX 5090 with cache populated. - patches/ggml-vulkan-eager-cache-save.patch (+104 lines) Write back the pipeline cache after every ggml_vk_load_shaders compile batch (crash-safety against SIGKILL/abort losing freshly compiled pipelines). Stacks on the first patch. Chatterbox-source host-side optimizations (src/chatterbox_tts.cpp, +~250 lines): 1. Persistent CFM estimator graph cache (~ -10 ms / chunk). cfm_estimator_cache was local-scope in s3gen_synthesize_to_wav() -- every synth call paid the full graph rebuild cost. Refactored to follow the same explicit-destroy() global-lifetime pattern as the existing thread_local time_mlp_cache. Both the batch=1 (Turbo) and batch=2 (multilingual CFG) paths reuse the same cache; the cache.b2 flag triggers a rebuild when mode changes. Cache cleared in s3gen_model_cache_release BEFORE the backend is freed (Vulkan/Metal device-teardown ordering). Also cleared on s3gen_model_cache_get cache-miss (backend swap). 2. Time-embedding result memoisation (~ -4 ms / inf). Both Turbo (t_span = [0, 0.5, 1]) and multilingual (cosine- scheduled, default 10 steps) produce the same set of t-values across all subsequent synth calls. Added two-layer cache: - g_time_mlp_results: keyed by uint32_t bitcast of t_val - g_time_emb_results: keyed by uint64_t = (kt << 32) | kr (Turbo only; multilingual skips the mixer) compute_time_mlp_cached + compute_time_emb_cached wrappers. 6 graph submissions / inference -> 0 after first inference for Turbo; 9-19 -> 0 for multilingual (10-step). 3. CPU mirror cache for large per-synth weight downloads (~ -1 ms / inf). flow/input_embedding (~13.4 MB Turbo / ~28 MB MTL) + flow/spk_embed_affine/{w,b} were re-downloaded GPU->CPU on every synth call. New cached_cpu_weights_f32(t) helper + g_weight_cpu_mirror map (keyed by ggml_tensor *). 4. Three HiFT cont sites removed (perf-neutral, code quality). conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp permute -- all bit-exact-preserving (consumers accept strided sources: ggml_add for bias, ggml_clamp element-wise, ggml_mul_mat src1 for f32 matmul). Test infrastructure (scripts/dump-s3gen-reference.py +65 lines, src/test_s3gen.cpp +6 lines): - G2 dump-script gap closure: cfm_concat / cfm_h_conv / cfm_h_ln / hift_s_stft .npy files now produced. Plus ggml_set_output(xc) in stage_G2 so the gallocator preserves the diagnostic intermediate (was returning garbage because xc's slot got reused by downstream intermediates after the conv1d consumer completed). - regress-tensor-compare.sh now runs end-to-end through G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max <= 4.7e-5 everywhere else; final waveform max_abs = 8.20e-08. Negative result documented (inline comments + FINDINGS doc): tried skip-upload of mu/spks/cond across cfm_steps within one synthesize call. Broke F32 single-shot. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete. Skip-upload only works for inputs referenced THROUGHOUT the graph (encoder pos_emb pattern works, CFM mu/spks/cond pattern doesn't). Deferred follow-ups (in OPTIMIZATION_PLAN_NEXT.md): - C1: F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM env var). Saves ~125 MB device memory + helps bandwidth-bound mobile. Needs new MD5 baselines. - Round-4/6 QKV fusion: multilingual_merged uses zero-cont strided Q/K/V views (Metal-tuned). Our Vulkan fused mul_mat would need careful integration to compose with that approach. - HiFT decoder graph caching: multilingual_merged HiFT rebuilds every chunk (no g_hift_cache equivalent). Same persistent-cache pattern as round-HIFT could apply. - Multilingual model file regression: multilingual model not available locally; only Turbo F32 invariants verified bit-exact on multilingual_merged base. Files: src/chatterbox_tts.cpp +252 / -19 src/test_s3gen.cpp +6 scripts/dump-s3gen-reference.py +65 scripts/setup-ggml.sh +20 / -8 patches/ggml-vulkan-pipeline-cache.patch +199 (NEW) patches/ggml-vulkan-eager-cache-save.patch +104 (NEW) patches/README.md +13 / -8 CHANGELOG.md +603 (NEW) Co-authored-by: Cursor <cursoragent@cursor.com>

…etry/scratch) Five targeted fixes surfaced by review of the multilingual_merged tip after the origin/main merge. Three are real bugs (CFG, top_k, engine crash on MTL GGUFs); one is a perf regression with audible behaviour on MTL (spurious T3 retries); one is a defensive cleanup. 1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly computes (1+cfg)*cond - cfg*uncond, but the else branch only computed the conditional pass and silently dropped CFG on every non-Metal backend. Restores the §3.19 (3f0a8da) behaviour: when !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU backend where the b2 path was disabled), run cond + uncond back-to-back on the same B=1 graph (cfm_estimator_cache key (T, b2=false) reuses the cached graph across both calls) and combine via the standard CFG mix. Smoke-tested on CPU (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs meanflow as expected (12 CFM steps × 2 forward calls). 2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after nth_element(begin, begin+k, end, greater) the (k+1)-th largest sits at idx[k] and positions [0, k) hold the top-k UNORDERED. The previous code took cut = l[idx[k-1]] which is some arbitrary top-k element (often not the smallest), making cut too large and the `x < cut` filter then erased legitimate top-k logits. Fix: partition to begin+(k-1) so idx[k-1] is the k-th largest exactly. Mostly masked by the default top_k=1000 vs an 8194-vocab where the threshold falls into the noise floor; the bug bites at small top_k (e.g. greedy --top-k 1 where the wrong cut could pessimise tie handling). The Turbo sample_next_token_ex in src/main.cpp uses a different (correct) approach via tmp[k] + per-element rescan for ties; left untouched. 3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves layers empty), but synthesize() unconditionally calls eval_prompt -> build_prompt_graph -> build_transformer_core, which iterates model.layers[il] -- empty std::vector, UB or crash. Add a clean rejection guard right after the load: if model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and throw a clear error pointing the user at the CLI / internal eval_*_mtl helpers. Wiring MTL through the public Engine API (extend EngineOptions with language / cfg_weight / min_p / exaggeration, branch synthesize() on variant) is left as a follow-up; this just stops the crash on the public surface. 4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the 0cad44d merge commit said the 5x speech-tokens-per-BPE-token floor (calibrated for English Turbo / GPT-2 BPE) should be gated to non-MTL because MTL's Llama tokenizer has a ~1.7x ratio. The gating wasn't actually in the code -- a clean stop-token termination on a short MTL segment looked "implausible" and triggered up to 3 spurious retries. `plausible = is_mtl || (int)generated.size() >= min_tokens;` restores the intent. The 3x-repeated-token early-stop above still guards MTL's catastrophic case. Measured on M4 Metal with the ES reference prompt + jfk/gianni voice: T3 wall time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) -- ~4x speedup just from removing the wasted retries. WAV md5 stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2. 5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early type-equality guard implies wq/wk/wv have identical sizes today, but max over all three so a future shape divergence (e.g. an MTL variant with non-square Q/K/V) can't silently truncate a per-layer copy via undersized scratch. No behaviour change today; defensive only. Validation (Apple M4, Metal, Release): - cmake --build: clean, no warnings, all targets link. - test-metal-ops: 14/14 PASS, 0 FAIL. - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy): md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the pre-fix baseline. T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4). - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es): completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward calls (cond + uncond), produces valid 1.1s WAV. Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7 (/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate cli_main.cpp build) are tracked but intentionally not folded in here -- the reviewer flagged them as cosmetic / trivial / fine. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 mentioned this pull request May 5, 2026

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chatterbox vulkan on multilingual#5

Chatterbox vulkan on multilingual#5
Zbig9000 wants to merge 1 commit into
GustavoA1604:mainfrom
Zbig9000:chatterbox-Vulkan-on-multilingual

Zbig9000 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zbig9000 commented May 4, 2026

Headline numbers — RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325, Turbo model, regress-tight aggregate, n=75 chunks

Cold-start (round-1 patch)

Bit-exactness

What this PR does

Part 1 — ggml-vulkan patches (no chatterbox-source dependency)

Part 2 — Persistent CFM estimator graph cache (the headline)

Part 3 — Time-embedding result memoisation

Part 4 — CPU mirror cache for large per-synth weight downloads

Part 5 — Three HiFT cont sites removed

Part 6 — G2 dump-script gap closure (test-infra)

Negative result documented

Why a fresh squashed commit instead of rebase

Deferred follow-ups (separate PRs)

How was it tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant