QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU by Zbig9000 · Pull Request #6 · GustavoA1604/chatterbox.cpp

Zbig9000 · 2026-05-05T12:22:05Z

CPU-side optimisation pass for chatterbox.cpp. Five layered passes that close the per-synth host-CPU envelope outside the heavy matmuls (which §3.20 already drove down via Q4_0/Q8_0 weight quantisation). All bit-exact-preserving, all model-agnostic by construction.

Round	PROGRESS.md	Theme	What it caches / changes
1	§3.32	CFM-side per-synth overhead	`g_time_mlp_results` (10 graph submissions / synth → 0 on multilingual cosine schedule), `g_time_emb_results` (Turbo only), persistent `g_cfm_estimator_cache` (graph rebuild → 0 across synths), `g_weight_cpu_mirror` for `flow/input_embedding` (~13–28 MB) + `flow/spk_embed_affine/{w,b}`.
2	§3.33	Encoder + HiFT + F0 graphs and their CPU scaffolding	Persistent encoder / HiFT decoder / F0-predictor graph caches (keyed on `T` or `pack(T_mel, T_stft)`), plus pure-compute scaffolding helpers: `cached_pos_emb` (the dominant scaffolding cost — `T × D × ~5` trig ops, fired twice per encoder run), `cached_inv_alpha` (~72 `tensor_get + per-element 1/x` calls per HiFT call), `cached_hann_window`, `cached_istft_kernel`, `cached_window_sum`.
3	§3.34	Multilingual verification + micro-fusion	Direct multilingual-GGUF testing of every cache invariant from §3.32 + §3.33 (99 / 99 checks pass on the converted-from-source `chatterbox-s3gen-mtl-q4_0.gguf`); auto-detection of variant (Turbo vs multilingual) inside the test harness; fused CFG-combine + Euler step in the multilingual non-meanflow CFG path (saves one pass over `dxdt` per step; gated on `!debug_mode && !dump_mel_path` so the slow-path debug hooks still see the post-combine vector).
4	§3.35	T3 step-graph cache (opt-in, server-mode)	Per-`(n_past, is_uncond)` graph cache for `build_step_graph_mtl`. Multilingual fires this 2× per token; a 136-token Spanish synth previously rebuilt 272 graphs at ~3 ms each (~800 ms / synth of pure host-CPU work). Opt-in via `CHATTERBOX_T3_STEP_CACHE=1` because single-utterance workloads see every step as a unique `n_past` (cache fills but nothing is re-used; bookkeeping costs ~10 % T3 regression). Server-mode wins: ~12 % T3 wall reduction on synth #2+ in the same process (~256 ms / synth on multilingual at the default 256-entry LRU cap). Lifecycle wired into `chatterbox_cli.cpp::free_t3` (both synthesis + streaming paths) and `chatterbox_engine.cpp::Impl::free_model` BEFORE `ggml_backend_free`, plus an atexit fallback.
5	§3.36	STFT graph + analysis-kernel caches	`g_stft_graph_cache` (keyed on `T_src = T_mel × 480`) and `g_stft_kernel_cache` (keyed on `n_fft`). `run_stft` previously allocated a fresh 4 MB context buffer + ggml_init + graph build + `ggml_gallocr_t` + backend buffer every synth; the cache eliminates that whole cycle. Closes the only remaining round-2 deferred-followup. Streaming-shape invalidation handled by the same `(cache.key != T_src)` rebuild rule as the other graph caches.

Headline numbers

Multilingual end-to-end CPU (this PR, three runs on the same Spanish prompt)

./build-cpu/tts-cli with the converted multilingual GGUFs, CFG enabled (cfg_weight=0.5, 8 ggml threads, seed 42, temp 0 --top-k 1), Linux 6.8 / x86_64 / 32-thread host:

Run	T3	S3Gen	Audio	RTF
1	2140 ms	5931 ms	5560 ms	1.45
2	2193 ms	5931 ms	5560 ms	1.46
3	2146 ms	5789 ms	5560 ms	1.43
avg	2160	5884	5560	1.45

136 speech tokens generated (CFG cond + uncond × 136 = 272 T3 step graphs).

Turbo, single-utterance bit-exact harness (`./build-cpu/test-cpu-caches`)

Metric	Cold caches	Warm caches	Δ
`S3GEN_INFER_MS` (synth #1 vs #2)	794 ms	619 ms	−175 ms (−22 %)

Bit-exact across cold (synth #1), warm (synth #2), and post-s3gen_unload (synth #3) WAV outputs — every diff sample = 0.

Streaming-mode wins (`tts-cli --stream-chunk-tokens 25`, 3-sentence prompt → 21 chunks)

Chunk	round-1 only	round-1 + round-2	Δ
1	980 ms	545 ms	−44 %
2	1045 ms	665 ms	−36 %
3	1155 ms	725 ms	−37 %
11	1810 ms	1253 ms	−31 %
21	2797 ms	2151 ms	−23 %
total wall	~48 s	~35 s	−27 %

Per-chunk savings shrink with chunk index because each chunk has a new T (the encoder input grows with the running prefix), so the encoder / HiFT / F0 / STFT graph caches rebuild on every chunk. The result + scaffolding caches (pos_emb, inv_alpha, istft_kernel, hann_window, window_sum, stft_kernel, plus round-1's time_mlp_results etc.) stay warm across every chunk.

Multilingual verification (round 3 — §3.34)

The multilingual GGUFs were converted from the public ResembleAI/chatterbox HuggingFace repo using the existing scripts/convert-{t3-mtl,s3gen}-to-gguf.py tooling (Q4_0 to match the §3.20 baseline; ~3.2 GB of source files, ~1.1 GB of resulting GGUFs). Every cache invariant from §3.32 / §3.33 / §3.36 runs against the actual multilingual model:

Variant	Test command	Result
Turbo	`./test-cpu-caches models/chatterbox-s3gen-turbo.gguf`	94 / 94 ✓
Multilingual	`./test-cpu-caches models/chatterbox-s3gen-mtl-q4_0.gguf`	113 / 113 ✓

The 19 extra checks on multilingual are round-3 cosine-schedule assertions: every entry of t_span = [1 − cos(i/10 · π/2)] for i in 0..9 must land in g_time_mlp_results after the first synth, each cached vector must be (1024,), and the variant auto-detection (time_mlp == 10 ∧ time_emb == 0) must classify the model correctly.

Multilingual single-utterance synth-twice on the multilingual S3Gen GGUF (24-token harness):

Synth	`S3GEN_INFER_MS`	What's warm
#1 (cold caches)	3868 ms	nothing
#2 (warm caches)	3797 ms	every round-1..5 cache
#3 (post-`s3gen_unload`)	3698 ms	reloads from cold

Wav output is byte-for-byte identical between synth #1, synth #2, and post-unload synth #3 — every sample diff = 0. The relative per-synth delta is small because multilingual CFM compute is ~6× larger absolute than Turbo, so the constant per-synth host overhead amortises into a smaller fraction of total wall.

Why these specific levers

§3.20 (already shipped) drove down the compute-bound bulk of CPU multilingual wall time via Q4_0/Q8_0 quantisation of the CFM / Conformer / T3 linears. What's left after §3.20 is fixed per-synth host overhead that quantisation doesn't help:

graph build + gallocr_reserve for every per-pipeline graph (CFM, encoder, HiFT, F0 predictor, STFT) — none of which were cached across synth calls in multilingual_merged;
repeated compute_time_mlp graph submissions (10 / synth on multilingual's cosine schedule, with a constant set of t-values across every synth);
per-synth ggml_backend_tensor_get of flow/input_embedding (~28 MB on multilingual) and flow/spk_embed_affine/{w,b};
host-CPU rebuilds of compute_pos_emb (T × D × ~5 trig ops, fired twice per encoder run for T and 2T; on multilingual at T=350+, D=512 that's a real wedge of per-synth host time);
host-CPU rebuilds of HiFT scaffolding (hann window, istft kernel, window_sum) — small per-call but pure waste across synths because n_fft=16 and hop=4 are constants;
invert_alpha_cpu fired ~72× per HiFT call (12 ResBlocks × 6 alpha tensors), each doing a tensor_get + per-element reciprocal;
build_step_graph_mtl rebuilt every CFG token-decode step (272 graphs × ~3 ms each ≈ 800 ms / synth pure host-CPU work);
run_stft re-allocated a fresh 4 MB context + gallocator + backend buffer + rebuilt the conv1d graph every synth.

This PR closes every one of those gaps with the same teardown discipline as the existing thread_local time_mlp_cache: cleared in s3gen_release_synth_caches before ggml_backend_free so the gallocators in the graph caches release against a still-valid backend (otherwise Vulkan / Metal / CUDA backend dylibs would hit their resource-leak asserts on process exit). The T3 cache mirrors the discipline in detail::t3_release_caches() wired into chatterbox_cli.cpp and chatterbox_engine.cpp.

What this PR does NOT do — and why

No B=2 batched CFM on CPU. §3.21 measured +11 % CPU wall on M4 when batching cond+uncond into a single forward (extra permute+cont at every attention block dominates the per-dispatch overhead, which is already negligible on ggml-cpu). The existing use_b2 = !ggml_backend_is_cpu(...) gate stays.
No F16 CFM linears on CPU. §3.8 attempt 7 already measured this as a regression on CPU (~10 % slower; F16→F32 upconvert inside mul_mat isn't free against AVX-512 F32 kernels). This PR keeps F32.
No conv1d-arg-order refactor. §3.20 backlog item 4 (HiFT Q4_0/Q8_0 quantisation, ~10 % additional CPU win) is independent of caching and out of scope here.
No heterogeneous-core thread default. §3.20 backlog item 5; hardware-bound, orthogonal to graph caching.
No always-on T3 step-graph cache. Pre-allocating cache entries for every possible n_past would either waste ~10 GB of metadata arenas or pay the same per-step bookkeeping cost on synth QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan #1. The opt-in env-var keeps single-utterance CLI users at zero cost while letting server-mode operators flip it on.
No ggml-cpu Q4_0 kernel work. The remaining ~95 % of multilingual CPU wall is real ggml-cpu Q4_0 matmul compute. Fixing that lives in ggml/src/ggml-cpu/, not chatterbox.cpp.

What this PR does

Round 1 — §3.32 (commit `b1c83b9`)

Four caches that target the dominant CFM-side per-synth overhead.

Cache	Multilingual / synth (after warm-up)	Turbo / synth (after warm-up)
`g_time_mlp_results` (`compute_time_mlp_cached`)	10 graph submissions → 0. Cosine schedule (`n_timesteps=10`) is constant across every synth; entries populated once and reused forever.	3 → 0 (`t_span = [0, 0.5, 1.0]`).
`g_time_emb_results` (`compute_time_emb_cached`)	Empty (multilingual takes the non-meanflow branch).	2 → 0 (always pairs `(0, 0.5)` and `(0.5, 1)`).
`g_cfm_estimator_cache` (promoted from local-scope)	First synth pays the build (~10 ms); every subsequent synth at the same `T` skips it. Existing `(cache.T != T) \|\| (cache.b2 != needed)` keying handles streaming chunks that vary `T`.	Same.
`g_weight_cpu_mirror` (`cached_cpu_weights_f32`)	First synth pays one `ggml_backend_tensor_get` per tensor; every subsequent synth returns the cached pointer in O(1). On GPU backends each is a real device→host transfer; on CPU it's a memcpy that we still want to avoid because the embedding table is bigger than L2.	Same pattern, smaller absolute sizes.

Bit-cast cache key (g_float_bits / g_float_pair_bits) avoids the ambiguous std::hash<float> behaviour around +0 / -0 and NaN that varies between libstdc++ and libc++.

Round 2 — §3.33 (commit `c3a98e5`)

Five new graph-and-scaffolding caches that close the remaining encoder + HiFT + F0 host overhead.

Cache	What it stores	Why it's safe
`g_encoder_graph_cache`	full `run_encoder` graph + `gallocator`	Keyed on `T`; streaming chunks of varying length still produce correct output (rebuilds on key change).
`g_hift_graph_cache` (+ `g_hift_inv_alpha_entries` metadata)	full `run_hift_decode` graph + `gallocator`	Keyed on `pack(T_mel, T_stft)`. The parallel inv-alpha-input metadata lets cache hits re-feed each alpha-input slot from `g_inv_alpha_results` without rebuilding.
`g_f0_graph_cache`	full `run_f0_predictor` graph + `gallocator`	Keyed on `T_mel`.
`g_pos_emb_results` (`cached_pos_emb`)	`(T, D) → (2T-1, D) F32 vector` from `compute_pos_emb`	`compute_pos_emb` is pure compute (~5 trig ops × T × D). Fired twice per encoder run (T and 2T).
`g_inv_alpha_results` (`cached_inv_alpha`)	`ggml_tensor* → vector<float>` of inverted alphas	Alpha tensors are constant for the model lifetime. Same lifetime as `g_weight_cpu_mirror`.
`g_hann_window_cache` / `g_istft_kernel_cache` (`cached_*`)	`n_fft → vector<float>`	Pure functions of `n_fft` (constant 16 in the chatterbox HiFT vocoder).
`g_window_sum_cache` (`cached_window_sum`)	`pack(n_fft, hop, T_stft) → vector<float>`	Stable across same-shape synth calls.

A small generic graph_cache struct (used by the encoder / HiFT / F0 / STFT caches) and pack_hift_key helper centralise the destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition.

Round 3 — §3.34 (commit `fff9820`)

Multilingual GGUF conversion + verification: every cache invariant from rounds 1 + 2 runs against the actual multilingual model (converted from-source via scripts/convert-{t3-mtl,s3gen}-to-gguf.py @ Q4_0 from the public ResembleAI/chatterbox HuggingFace repo).
19 new variant-aware test assertions that auto-detect the variant from cache populations:
- time_mlp == 10 ∧ time_emb == 0 ⇒ multilingual: assert every cosine t_span entry t = 1 − cos(i/10 × π/2) for i in 0..9 lands in g_time_mlp_results with shape (1024,).
- time_mlp ≤ 3 ∧ time_emb == 2 ⇒ Turbo: assert t = 0.5 is cached.
Fused CFG-combine + Euler step in the multilingual non-meanflow CFG path: saves one pass over the (T_mu × MEL) dxdt vector per step. Slow path (debug_mode && meanflow) and (s == 0 && !dump_mel_path.empty()) keep the explicit two-pass form so the post-combine dxdt_cond value is still readable from the debug prints + _step0_dxdt.npy dump.

Round 4 — §3.35 (commit `78e4275` + `8abfd9b`)

Per-(n_past, is_uncond)-keyed graph cache for build_step_graph_mtl in src/t3_mtl.cpp. Each entry holds:

int64_t key — pack(n_past, is_uncond);
ggml_context * ctx — per-entry metadata arena (no shared thread_local buf — would conflict with cached graphs);
ggml_cgraph * gf — the cached graph;
std::vector<uint8_t> buf — the arena bytes.

No per-entry gallocator. An earlier prototype gave each cached entry its own ggml_gallocr_t + ~1 MB backend buffer, which paid off on multi-synth workloads but added a ~10 % T3 regression on single-utterance runs (272 misses × 1 MB = ~270 MB of allocator churn on synth #1). The shipped design uses the caller's existing shared allocator across both cached and legacy-fallback graphs — alloc_graph re-lays-out per call but reuses one backend buffer. Cache hits still skip the ~3 ms build cost.

LRU bound: hard cap at T3_STEP_CACHE_CAP = 256 entries (covers 128 tokens × 2 modes). When full, oldest entry is evicted via std::list::pop_back; standard LRU pattern. Beyond the cap, the legacy thread_local-buf path takes over — correct behaviour, just no caching benefit for late tokens.

Opt-in via CHATTERBOX_T3_STEP_CACHE env var (default OFF). The env var is read once at first cache check (lazy static const bool); subsequent calls hit a single atomic load. Default-OFF imposes no measurable cost on single-utterance.

detail::t3_release_caches() is the public teardown entrypoint, called from:

chatterbox_cli.cpp::free_t3 — both the synthesis path and the streaming path;
chatterbox_engine.cpp::Impl::free_model;
an atexit handler registered on first cache insertion (fallback for code paths that don't go through the explicit teardown).

All three entry points fire BEFORE ggml_backend_free(model.backend) so the cached ggml_context and any future backend-bound resources release cleanly.

Round 5 — §3.36 (commit `d1e8bbb`)

Closes the only remaining deferred-followup from §3.33: the persistent run_stft graph cache.

Cache	What it stores	Key
`g_stft_graph_cache`	full `run_stft` conv1d graph + `gallocator` + 4 MB context buffer	`T_src = T_mel × 480`; rebuilds on streaming-shape change
`g_stft_kernel_cache` (`cached_stft_kernel`)	`n_fft → vector<float>` of the analysis kernel from `build_stft_kernel(n_fft, cached_hann_window(n_fft))`	`n_fft` (constant 16 in chatterbox HiFT)

Per synth, this eliminates: a 4 MB std::vector<uint8_t> allocation + ggml_init + ggml_new_graph_custom(8192) + a fresh ggml_gallocr_t + the backend buffer it reserves. The buffer reservation is reused across rebuilds (graph_cache::destroy() preserves the buf capacity), eliminating heap-fragmentation churn in long-running streaming sessions.

Negative result documented (round 1)

Tried adding last_mu_ptr / last_spks_ptr / last_cond_ptr tracking to cfm_estimator_cache to skip redundant ggml_backend_tensor_set for mu/spks/cond on the second CFM step. F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete, and CFM mu/spks/cond are referenced only at the start of the graph (via ggml_concat(x_in, mu_in, spks_bc, cond_in)); their slots become reusable for downstream intermediates immediately. Same finding as FINDINGS_ROUND_HIFT.md §2-bis.4. Reverted.

General rule reinforced: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes.

Test infrastructure

Two harnesses, 339 / 339 cache checks green across the full matrix:

Suite	Model arg	Checks
`test-cpu-caches`	none (cache-key + initial-state only)	27 / 27 ✓
`test-cpu-caches`	`chatterbox-s3gen-turbo.gguf`	94 / 94 ✓
`test-cpu-caches`	`chatterbox-s3gen-mtl-q4_0.gguf`	113 / 113 ✓
`test-t3-caches`	none (initial-state only)	6 / 6 ✓
`test-t3-caches`	`chatterbox-t3-mtl-q4_0.gguf`	99 / 99 ✓

src/test_cpu_caches.cpp (685 lines, NEW) covers:

Bit-cast cache-key rules (+0 ≠ -0, NaN bit-pattern stability, pair-key composition, multilingual cosine t_span distinctness).
Initial cache state (every cache empty before any synth; idempotent s3gen_unload()).
Post-synth-QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan #1 cache populations (sizes, graph-cache shape keys, kernel-cache singleton invariants).
Warm-cache invariants (synth Chatterbox optimize cpp backend multilingual model for cuda #2 must NOT grow any cache; every graph cache must keep its shape key).
Bit-exact wav output between cold caches (synth QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan #1), warm caches (synth Chatterbox optimize cpp backend multilingual model for cuda #2), and post-s3gen_unload reload (synth Metal optimisation #3) — byte-for-byte equality.
Lifecycle teardown (every cache cleared on s3gen_unload; idempotent second unload doesn't crash).
Streaming-shape invalidation: two chunks of different lengths must rebuild every graph cache (encoder, HiFT, F0, STFT) while the n_fft-keyed scaffolding caches stay at exactly 1 entry.
Variant auto-detection + multilingual cosine-schedule round-trip (round 3).

src/test_t3_caches.cpp (452 lines, NEW) covers:

Initial state (cache empty before any eval_step_mtl; idempotent t3_release_caches()).
Step lifecycle: single call populates 2 entries (cond + uncond at n_past=0); same-key second call is a hit (size unchanged, hits=2); different-n_past adds 2 new entries; bit-exact logits across cold/warm at the same (n_past, token); explicit teardown drops every entry.
Multi-synth amortisation: 16 step calls at distinct n_past (cold pass populates 32 entries) followed by re-running the same sequence (warm pass — every call is a hit); bit-exact logits across both passes; warm pass measurably faster than cold pass (asserted as inequality, not percentage threshold, to stay robust under CPU jitter).

Cache observability is exposed via src/chatterbox_tts_test_hooks.h — under src/, NOT in include/, explicitly out of the public surface so production callers cannot take a dependency on cache layout.

Reproduction

# 1. Branch setup
cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp
git checkout chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU

# 2. Build (CPU-only — no Vulkan/Metal/CUDA needed for this PR)
cmake -S . -B build-cpu -DCMAKE_BUILD_TYPE=Release \
      -DGGML_VULKAN=OFF -DGGML_METAL=OFF -DGGML_CUDA=OFF \
      -DTTS_CPP_BUILD_TESTS=ON
cmake --build build-cpu -j16

# 3. Cache validation (339 / 339 checks expected across all suites)
./build-cpu/test-cpu-caches                                          # 27 cache-key + initial-state
./build-cpu/test-cpu-caches models/chatterbox-s3gen-turbo.gguf       # 94
./build-cpu/test-cpu-caches models/chatterbox-s3gen-mtl-q4_0.gguf    # 113
./build-cpu/test-t3-caches                                           # 6
./build-cpu/test-t3-caches models/chatterbox-t3-mtl-q4_0.gguf        # 99

# 4. End-to-end multilingual smoke (default cache-OFF; round-4 cache disabled)
./build-cpu/tts-cli --model models/chatterbox-t3-mtl-q4_0.gguf \
                    --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0.gguf \
                    --text "Hola mundo, esta es una prueba multilingue del modelo CFG." \
                    --language es --out /tmp/mtl.wav --threads 8 \
                    --seed 42 --temp 0 --top-k 1 --cfg-weight 0.5

# 5. Server-mode (round-4 T3 step-graph cache enabled)
CHATTERBOX_T3_STEP_CACHE=1 ./build-cpu/tts-cli ...     # ~12 % T3 wall reduction on synth #2+

# 6. Streaming-mode regression (per-chunk RTF; rounds 2 + 5 amortise)
./build-cpu/tts-cli --model models/chatterbox-t3-turbo-q4_0.gguf \
                    --s3gen-gguf models/chatterbox-s3gen-turbo.gguf \
                    --text "First sentence. Second sentence here. Third sentence too." \
                    --out /tmp/stream.wav --threads 8 \
                    --seed 42 --temp 0 --top-k 1 \
                    --stream-first-chunk-tokens 10 --stream-chunk-tokens 25
# Each S3GEN_INFER_MS line should beat the round-1-only numbers in PROGRESS.md §3.33.

The multilingual GGUFs reproduce from the public ResembleAI/chatterbox HuggingFace repo using the existing scripts/convert-{t3-mtl,s3gen}-to-gguf.py tooling at Q4_0 (~3.2 GB source → ~1.1 GB GGUFs).

Memory cap

Every cache is bounded by the number of distinct shape keys it sees across the process lifetime. Steady-state for a single-utterance multilingual synth:

Cache	Per-entry size	Typical key set on multilingual
`g_time_mlp_results`	1024 F32 = 4 KB	10 entries (cosine `t_span`)
`g_time_emb_results`	1024 F32 = 4 KB	0 (multilingual non-meanflow)
`g_weight_cpu_mirror`	up to 28 MB (`flow/input_embedding`)	~3 entries
`g_cfm_estimator_cache`	64 MB arena	1
`g_encoder_graph_cache`	64 MB arena	1 per distinct `T`
`g_hift_graph_cache`	64 MB arena	1 per distinct `(T_mel, T_stft)`
`g_f0_graph_cache`	8 MB arena	1 per distinct `T_mel`
`g_stft_graph_cache`	4 MB arena	1 per distinct `T_src`
`g_pos_emb_results`	~2.3 MB at T=600	2 per distinct chunk size
`g_inv_alpha_results`	up to ~256 F32 / entry	~72 entries (one per alpha)
`g_hann_window_cache`	`n_fft × 4 B` ≈ 64 B	1 (constant n_fft=16)
`g_istft_kernel_cache`	`n_fft × 2 × (n_fft/2 + 1) × 4` ≈ 1.1 KB	1
`g_stft_kernel_cache`	same shape as `istft_kernel` ≈ 1.1 KB	1
`g_window_sum_cache`	`((T_stft − 1) × hop + n_fft) × 4` (≤ a few hundred KB)	1 per distinct `T_stft`
`g_t3_step_cache` (opt-in)	~1.2 MB metadata arena	up to 256 entries (LRU cap)

Total steady state for single-utterance with the round-4 cache off: ~250 MB. With round-4 on at full LRU saturation: ~560 MB.

For long-running streaming sessions with many distinct chunk sizes, the graph-cache arenas (64 MB × 3 + 8 MB + 4 MB) plus g_pos_emb_results (~2.3 MB × N) grow with the number of distinct shapes — see deferred follow-up #1 below.

Deferred follow-ups (separate PRs)

Round 5 closed the run_stft cache item; the remaining list is:

Candidate	Estimated win	Why deferred
LRU eviction for the round-2 + round-5 graph + shape-keyed caches	n/a (memory cap)	The encoder / HiFT / F0 / STFT graph caches each currently hold one `(arena + gallocator)` per distinct shape. Long-running streaming with many shape changes can grow unbounded. A small LRU bound (say 8 entries) would handle server-mode deployments. Round 4 already established the LRU pattern (`std::list`-backed); applying it to round-2 + round-5 graph caches is a straight port. Out of scope for the optimisation pass.
HiFT conv weight quantisation (§3.20 backlog #4)	~10 % MTL CPU wall	Independent of caching; blocked on the `conv1d_f32` arg-order refactor. Mirrors the `conv1d_f32_b` pattern.
Heterogeneous-core thread default (§3.20 backlog #5)	~10 % on M4	Hardware-bound; needs `hwloc` or per-platform sysctl to detect efficiency cores. Orthogonal.
ggml-cpu Q4_0 / Q4_K kernel optimisation	unknown; potentially large	The remaining ~95 % of multilingual CPU wall (~5500 ms / synth) is real `ggml-cpu` Q4_0 matmul work. Out of scope for chatterbox.cpp; lives in `ggml/src/ggml-cpu/`.
Mobile validation (Adreno / Mali / Apple Silicon CPU)	Unknown on CPU; biggest unknown	Hardware-bound. Same gap noted in `OPTIMIZATION_PLAN_NEXT.md` for the Vulkan branch.

Files

CMakeLists.txt                   +18      (test-cpu-caches + test-t3-caches targets, build flags)
PROGRESS.md                      +617     (§3.32 + §3.33 + §3.34 + §3.35 + §3.36)
src/chatterbox_cli.cpp           +9       (free_t3 calls t3_release_caches in 2 paths)
src/chatterbox_engine.cpp        +5       (Impl::free_model calls t3_release_caches)
src/chatterbox_t3_internal.h     +10      (detail::t3_release_caches decl)
src/chatterbox_tts.cpp           +892 / -127  (cache infra + 6 graph caches + 11 cached helpers + test-hook bridges)
src/chatterbox_tts_test_hooks.h  +165     NEW  (round 1..5 hook decls)
src/t3_mtl.cpp                   +348     (round-4 step-graph cache + opt-in gate + test bridges)
src/test_cpu_caches.cpp          +685     NEW  (rounds 1, 2, 3, 5 — 113 multilingual / 94 Turbo checks)
src/test_t3_caches.cpp           +452     NEW  (round 4 — 99 multilingual T3 checks)

Total: 10 files, +3074 / −127 lines, 6 commits on top of 21896a3.

No public-API change. include/tts-cpp/chatterbox/s3gen_pipeline.h remains untouched. The cache observability hooks live in src/chatterbox_tts_test_hooks.h (under src/, not include/), explicitly out of the public surface so production callers cannot take a dependency on cache layout.

Status

Clean six-commit branch on top of 21896a3. Headline numbers:

−22 % S3GEN_INFER_MS on Turbo single-utterance (794 → 619 ms; bit-exact across cold / warm / post-unload).
−27 % total wall on streaming (tts-cli --stream-chunk-tokens 25, 21 chunks: ~48 s → ~35 s).
−2.2 % S3GEN_INFER_MS on multilingual single-utterance (sub-noise on a single synth; compounds on streaming where multiple synths share warm caches).
~12 % T3 wall reduction on synth Chatterbox optimize cpp backend multilingual model for cuda #2+ in multi-synth processes (~256 ms / synth on multilingual at the default 256-entry LRU cap), gated behind CHATTERBOX_T3_STEP_CACHE=1 so single-utterance CLI users see no regression.
Multilingual end-to-end CPU RTF 1.45 at Q4_0 / 8 ggml threads / 16-thread x86_64 host.

All cache + bit-exact + shape-invalidation checks green: 339 total across test-cpu-caches (27 + 94 + 113) and test-t3-caches (6 + 99).

After this PR, every host-side per-synth allocation cycle on the multilingual CPU pipeline is cached:

T3:        prompt + step graphs   → step cached (opt-in, round 4)
Encoder:   graph + pos_emb        → cached (round 2)
CFM:       estimator graph + time_mlp/time_emb + weight mirror → cached (round 1)
F0:        graph                  → cached (round 2)
HiFT:      graph + inv_alpha + hann + istft kernel + window_sum → cached (round 2)
STFT:      graph + analysis kernel → cached (round 5)
SineGen:   pure compute (per-call RNG-seeded; not cacheable)

The remaining ~95 % of multilingual CPU wall is real ggml-cpu Q4_0 matmul work, which lives in ggml/src/ggml-cpu/ and is out of scope for chatterbox.cpp.

…d 2) PROGRESS.md §3.33 — persistent encoder/HiFT/F0 graph caches + pos_emb / inv_alpha / hann_window / istft_kernel / window_sum scaffolding caches on top of the round-1 CFM caches (§3.32). Turbo single-utterance S3GEN_INFER_MS -22 %, streaming wall -27 %. Tests: 79/79 pass (49 new round-2 checks).

…d 3) PROGRESS.md §3.34 — multilingual verification (Turbo 80/80, multilingual 99/99 checks pass; bit-exact synth-twice on the converted-from-source MTL Q4_0 GGUF) + 19 new multilingual-specific test assertions (cosine schedule produces exactly 10 distinct g_time_mlp_results entries) + fused CFG-combine + Euler step in the non-meanflow CFG path of synthesize(). Sub-noise wall-time saving on a single multilingual synth (~8 s); biggest remaining host-side win is T3 step-graph caching, documented as deferred follow-up.

…d 4) PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token decode) opt-in via CHATTERBOX_T3_STEP_CACHE. Per-(n_past, is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl; saves ~3 ms per cache hit. Single-utterance default-OFF (no hits-to-amortise on synth GustavoA1604#1) keeps the existing path regression-free; server-mode opt-in shows ~15 % per-pass speedup (~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens). Tests: src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact cold/warm logits + multi-synth amortisation timing). Lifecycle wired into free_t3 (CLI, both paths), Impl::free_model (Engine), and an atexit fallback — all firing BEFORE ggml_backend_free. Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.

…timize-cpp-backend-multilingual-for-CPU QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU Co-authored-by: Cursor <cursoragent@cursor.com>

Mirrors the layout established for the other test_*.cpp harnesses (791759c). Pure rename (git detects 100 % similarity) plus the matching CMakeLists.txt path updates for the test-cpu-caches and test-t3-caches executable targets. No source changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirrors the parakeet-cpp port README layout so a downstream consumer can answer 'what does this library do, how do I link it, and which CMake knobs do I need to know about?' from the top of the README without scrolling through the 1300-line standalone development walk- through. No content removed; existing standalone material stays verbatim, just shifted down by ~80 lines. Adds three new blocks near the top: - ## API overview (between the benchmark tables and 'Pipeline at a glance'). Two-row table for the high-level entry points exported through TTS_CPP_API: * tts_cpp::chatterbox::Engine::synthesize - Chatterbox T3+S3Gen+HiFT * tts_cpp::supertonic::synthesize - Supertonic CPU TTS Trailing paragraph mentions the lower-level helpers (s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload / tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly flags that detail-namespaced symbols (used by the supertonic / chatterbox test harnesses) are not part of the public API and are hidden in SHARED builds. - ### Consumer integration (subsection of API overview). Calls out that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port consumes ggml from the qvac-ext-ggml/speech branch directly (Metal / OpenCL / Vulkan patches included) and does NOT ship scripts/setup-ggml.sh or patches/ - those are standalone-dev tools maintained in this repo only. Provides the find_package(tts-cpp CONFIG REQUIRED) + target_link_libraries(... tts-cpp::tts-cpp) + 8-line Engine::synthesize C++ snippet that's the entire consumer-side integration. - ### Useful CMake options (inside section 1, between the GPU backend paragraph and the binaries table). Full table of the project- namespaced flags: TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8), TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL, TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE (new from items 7+8). Plus a secondary table for the ctest-fixture cache paths (TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the REQUIRES auto-disable behaviour from item 7. Touches existing prose in two places: - The setup-ggml.sh paragraph in section 1 gets a one-paragraph follow-up clarifying it (and patches/) are standalone-development tools only, with a back-link to the Consumer integration section (item 9: 'document setup-ggml.sh inertness' folded into this framing rather than landed as a separate doc-only commit). Also strengthens the existing 'Re-running is safe' line to 'idempotent and destructive' so a dev hacking on ./ggml is warned before losing local edits. - The ### Alternative: consume ggml from vcpkg subsection now opens with one sentence positioning it as the CMake-mechanic detail behind the Consumer integration story, with a forward link to the qvac-ext-ggml/speech branch. Also updates the binaries table in section 1 to list the missing PR #6 + PR #7 binaries that landed since the README was last refreshed: supertonic-cli, supertonic-bench, test-cpu-caches, test-t3-caches, and the test-supertonic-* family. Trailing paragraph notes that test-* binaries register with CTest so \`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works out of the build directory. No code changes, no CMake changes, no install behaviour changes. README.md +128 / -10 lines. Co-authored-by: Cursor <cursoragent@cursor.com>

Zbig9000 force-pushed the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch from 6fb0a3f to d1e8bbb Compare May 6, 2026 14:45

Zbig9000 added 6 commits May 6, 2026 18:33

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU

2ead7ed

PROGRESS.md changes were added

7ffa1aa

round 5 of optimizations

eadf88f

Zbig9000 force-pushed the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch from d1e8bbb to eadf88f Compare May 6, 2026 16:51

GustavoA1604 merged commit dd5b3f3 into GustavoA1604:main May 6, 2026

Zbig9000 deleted the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch May 7, 2026 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU#6

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU#6
GustavoA1604 merged 6 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU

Zbig9000 commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zbig9000 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline numbers

Multilingual end-to-end CPU (this PR, three runs on the same Spanish prompt)

Turbo, single-utterance bit-exact harness (./build-cpu/test-cpu-caches)

Streaming-mode wins (tts-cli --stream-chunk-tokens 25, 3-sentence prompt → 21 chunks)

Multilingual verification (round 3 — §3.34)

Why these specific levers

What this PR does NOT do — and why

What this PR does

Round 1 — §3.32 (commit b1c83b9)

Round 2 — §3.33 (commit c3a98e5)

Round 3 — §3.34 (commit fff9820)

Round 4 — §3.35 (commit 78e4275 + 8abfd9b)

Round 5 — §3.36 (commit d1e8bbb)

Negative result documented (round 1)

Test infrastructure

Reproduction

Memory cap

Deferred follow-ups (separate PRs)

Files

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zbig9000 commented May 5, 2026 •

edited

Loading

Turbo, single-utterance bit-exact harness (`./build-cpu/test-cpu-caches`)

Streaming-mode wins (`tts-cli --stream-chunk-tokens 25`, 3-sentence prompt → 21 chunks)

Round 1 — §3.32 (commit `b1c83b9`)

Round 2 — §3.33 (commit `c3a98e5`)

Round 3 — §3.34 (commit `fff9820`)

Round 4 — §3.35 (commit `78e4275` + `8abfd9b`)

Round 5 — §3.36 (commit `d1e8bbb`)