Skip to content

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU#6

Merged
GustavoA1604 merged 6 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU
May 6, 2026
Merged

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU#6
GustavoA1604 merged 6 commits into
GustavoA1604:mainfrom
Zbig9000:chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU

Conversation

@Zbig9000
Copy link
Copy Markdown

@Zbig9000 Zbig9000 commented May 5, 2026

CPU-side optimisation pass for chatterbox.cpp. Five layered passes that close the per-synth host-CPU envelope outside the heavy matmuls (which §3.20 already drove down via Q4_0/Q8_0 weight quantisation). All bit-exact-preserving, all model-agnostic by construction.

Round PROGRESS.md Theme What it caches / changes
1 §3.32 CFM-side per-synth overhead g_time_mlp_results (10 graph submissions / synth → 0 on multilingual cosine schedule), g_time_emb_results (Turbo only), persistent g_cfm_estimator_cache (graph rebuild → 0 across synths), g_weight_cpu_mirror for flow/input_embedding (~13–28 MB) + flow/spk_embed_affine/{w,b}.
2 §3.33 Encoder + HiFT + F0 graphs and their CPU scaffolding Persistent encoder / HiFT decoder / F0-predictor graph caches (keyed on T or pack(T_mel, T_stft)), plus pure-compute scaffolding helpers: cached_pos_emb (the dominant scaffolding cost — T × D × ~5 trig ops, fired twice per encoder run), cached_inv_alpha (~72 tensor_get + per-element 1/x calls per HiFT call), cached_hann_window, cached_istft_kernel, cached_window_sum.
3 §3.34 Multilingual verification + micro-fusion Direct multilingual-GGUF testing of every cache invariant from §3.32 + §3.33 (99 / 99 checks pass on the converted-from-source chatterbox-s3gen-mtl-q4_0.gguf); auto-detection of variant (Turbo vs multilingual) inside the test harness; fused CFG-combine + Euler step in the multilingual non-meanflow CFG path (saves one pass over dxdt per step; gated on !debug_mode && !dump_mel_path so the slow-path debug hooks still see the post-combine vector).
4 §3.35 T3 step-graph cache (opt-in, server-mode) Per-(n_past, is_uncond) graph cache for build_step_graph_mtl. Multilingual fires this 2× per token; a 136-token Spanish synth previously rebuilt 272 graphs at ~3 ms each (~800 ms / synth of pure host-CPU work). Opt-in via CHATTERBOX_T3_STEP_CACHE=1 because single-utterance workloads see every step as a unique n_past (cache fills but nothing is re-used; bookkeeping costs ~10 % T3 regression). Server-mode wins: ~12 % T3 wall reduction on synth #2+ in the same process (~256 ms / synth on multilingual at the default 256-entry LRU cap). Lifecycle wired into chatterbox_cli.cpp::free_t3 (both synthesis + streaming paths) and chatterbox_engine.cpp::Impl::free_model BEFORE ggml_backend_free, plus an atexit fallback.
5 §3.36 STFT graph + analysis-kernel caches g_stft_graph_cache (keyed on T_src = T_mel × 480) and g_stft_kernel_cache (keyed on n_fft). run_stft previously allocated a fresh 4 MB context buffer + ggml_init + graph build + ggml_gallocr_t + backend buffer every synth; the cache eliminates that whole cycle. Closes the only remaining round-2 deferred-followup. Streaming-shape invalidation handled by the same (cache.key != T_src) rebuild rule as the other graph caches.

Headline numbers

Multilingual end-to-end CPU (this PR, three runs on the same Spanish prompt)

./build-cpu/tts-cli with the converted multilingual GGUFs, CFG enabled (cfg_weight=0.5, 8 ggml threads, seed 42, temp 0 --top-k 1), Linux 6.8 / x86_64 / 32-thread host:

Run T3 S3Gen Audio RTF
1 2140 ms 5931 ms 5560 ms 1.45
2 2193 ms 5931 ms 5560 ms 1.46
3 2146 ms 5789 ms 5560 ms 1.43
avg 2160 5884 5560 1.45

136 speech tokens generated (CFG cond + uncond × 136 = 272 T3 step graphs).

Turbo, single-utterance bit-exact harness (./build-cpu/test-cpu-caches)

Metric Cold caches Warm caches Δ
S3GEN_INFER_MS (synth #1 vs #2) 794 ms 619 ms −175 ms (−22 %)

Bit-exact across cold (synth #1), warm (synth #2), and post-s3gen_unload (synth #3) WAV outputs — every diff sample = 0.

Streaming-mode wins (tts-cli --stream-chunk-tokens 25, 3-sentence prompt → 21 chunks)

Chunk round-1 only round-1 + round-2 Δ
1 980 ms 545 ms −44 %
2 1045 ms 665 ms −36 %
3 1155 ms 725 ms −37 %
11 1810 ms 1253 ms −31 %
21 2797 ms 2151 ms −23 %
total wall ~48 s ~35 s −27 %

Per-chunk savings shrink with chunk index because each chunk has a new T (the encoder input grows with the running prefix), so the encoder / HiFT / F0 / STFT graph caches rebuild on every chunk. The result + scaffolding caches (pos_emb, inv_alpha, istft_kernel, hann_window, window_sum, stft_kernel, plus round-1's time_mlp_results etc.) stay warm across every chunk.


Multilingual verification (round 3 — §3.34)

The multilingual GGUFs were converted from the public ResembleAI/chatterbox HuggingFace repo using the existing scripts/convert-{t3-mtl,s3gen}-to-gguf.py tooling (Q4_0 to match the §3.20 baseline; ~3.2 GB of source files, ~1.1 GB of resulting GGUFs). Every cache invariant from §3.32 / §3.33 / §3.36 runs against the actual multilingual model:

Variant Test command Result
Turbo ./test-cpu-caches models/chatterbox-s3gen-turbo.gguf 94 / 94
Multilingual ./test-cpu-caches models/chatterbox-s3gen-mtl-q4_0.gguf 113 / 113

The 19 extra checks on multilingual are round-3 cosine-schedule assertions: every entry of t_span = [1 − cos(i/10 · π/2)] for i in 0..9 must land in g_time_mlp_results after the first synth, each cached vector must be (1024,), and the variant auto-detection (time_mlp == 10 ∧ time_emb == 0) must classify the model correctly.

Multilingual single-utterance synth-twice on the multilingual S3Gen GGUF (24-token harness):

Synth S3GEN_INFER_MS What's warm
#1 (cold caches) 3868 ms nothing
#2 (warm caches) 3797 ms every round-1..5 cache
#3 (post-s3gen_unload) 3698 ms reloads from cold

Wav output is byte-for-byte identical between synth #1, synth #2, and post-unload synth #3 — every sample diff = 0. The relative per-synth delta is small because multilingual CFM compute is ~6× larger absolute than Turbo, so the constant per-synth host overhead amortises into a smaller fraction of total wall.


Why these specific levers

§3.20 (already shipped) drove down the compute-bound bulk of CPU multilingual wall time via Q4_0/Q8_0 quantisation of the CFM / Conformer / T3 linears. What's left after §3.20 is fixed per-synth host overhead that quantisation doesn't help:

  • graph build + gallocr_reserve for every per-pipeline graph (CFM, encoder, HiFT, F0 predictor, STFT) — none of which were cached across synth calls in multilingual_merged;
  • repeated compute_time_mlp graph submissions (10 / synth on multilingual's cosine schedule, with a constant set of t-values across every synth);
  • per-synth ggml_backend_tensor_get of flow/input_embedding (~28 MB on multilingual) and flow/spk_embed_affine/{w,b};
  • host-CPU rebuilds of compute_pos_emb (T × D × ~5 trig ops, fired twice per encoder run for T and 2T; on multilingual at T=350+, D=512 that's a real wedge of per-synth host time);
  • host-CPU rebuilds of HiFT scaffolding (hann window, istft kernel, window_sum) — small per-call but pure waste across synths because n_fft=16 and hop=4 are constants;
  • invert_alpha_cpu fired ~72× per HiFT call (12 ResBlocks × 6 alpha tensors), each doing a tensor_get + per-element reciprocal;
  • build_step_graph_mtl rebuilt every CFG token-decode step (272 graphs × ~3 ms each ≈ 800 ms / synth pure host-CPU work);
  • run_stft re-allocated a fresh 4 MB context + gallocator + backend buffer + rebuilt the conv1d graph every synth.

This PR closes every one of those gaps with the same teardown discipline as the existing thread_local time_mlp_cache: cleared in s3gen_release_synth_caches before ggml_backend_free so the gallocators in the graph caches release against a still-valid backend (otherwise Vulkan / Metal / CUDA backend dylibs would hit their resource-leak asserts on process exit). The T3 cache mirrors the discipline in detail::t3_release_caches() wired into chatterbox_cli.cpp and chatterbox_engine.cpp.

What this PR does NOT do — and why

  • No B=2 batched CFM on CPU. §3.21 measured +11 % CPU wall on M4 when batching cond+uncond into a single forward (extra permute+cont at every attention block dominates the per-dispatch overhead, which is already negligible on ggml-cpu). The existing use_b2 = !ggml_backend_is_cpu(...) gate stays.
  • No F16 CFM linears on CPU. §3.8 attempt 7 already measured this as a regression on CPU (~10 % slower; F16→F32 upconvert inside mul_mat isn't free against AVX-512 F32 kernels). This PR keeps F32.
  • No conv1d-arg-order refactor. §3.20 backlog item 4 (HiFT Q4_0/Q8_0 quantisation, ~10 % additional CPU win) is independent of caching and out of scope here.
  • No heterogeneous-core thread default. §3.20 backlog item 5; hardware-bound, orthogonal to graph caching.
  • No always-on T3 step-graph cache. Pre-allocating cache entries for every possible n_past would either waste ~10 GB of metadata arenas or pay the same per-step bookkeeping cost on synth QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan #1. The opt-in env-var keeps single-utterance CLI users at zero cost while letting server-mode operators flip it on.
  • No ggml-cpu Q4_0 kernel work. The remaining ~95 % of multilingual CPU wall is real ggml-cpu Q4_0 matmul compute. Fixing that lives in ggml/src/ggml-cpu/, not chatterbox.cpp.

What this PR does

Round 1 — §3.32 (commit b1c83b9)

Four caches that target the dominant CFM-side per-synth overhead.

Cache Multilingual / synth (after warm-up) Turbo / synth (after warm-up)
g_time_mlp_results (compute_time_mlp_cached) 10 graph submissions → 0. Cosine schedule (n_timesteps=10) is constant across every synth; entries populated once and reused forever. 3 → 0 (t_span = [0, 0.5, 1.0]).
g_time_emb_results (compute_time_emb_cached) Empty (multilingual takes the non-meanflow branch). 2 → 0 (always pairs (0, 0.5) and (0.5, 1)).
g_cfm_estimator_cache (promoted from local-scope) First synth pays the build (~10 ms); every subsequent synth at the same T skips it. Existing (cache.T != T) || (cache.b2 != needed) keying handles streaming chunks that vary T. Same.
g_weight_cpu_mirror (cached_cpu_weights_f32) First synth pays one ggml_backend_tensor_get per tensor; every subsequent synth returns the cached pointer in O(1). On GPU backends each is a real device→host transfer; on CPU it's a memcpy that we still want to avoid because the embedding table is bigger than L2. Same pattern, smaller absolute sizes.

Bit-cast cache key (g_float_bits / g_float_pair_bits) avoids the ambiguous std::hash<float> behaviour around +0 / -0 and NaN that varies between libstdc++ and libc++.

Round 2 — §3.33 (commit c3a98e5)

Five new graph-and-scaffolding caches that close the remaining encoder + HiFT + F0 host overhead.

Cache What it stores Why it's safe
g_encoder_graph_cache full run_encoder graph + gallocator Keyed on T; streaming chunks of varying length still produce correct output (rebuilds on key change).
g_hift_graph_cache (+ g_hift_inv_alpha_entries metadata) full run_hift_decode graph + gallocator Keyed on pack(T_mel, T_stft). The parallel inv-alpha-input metadata lets cache hits re-feed each alpha-input slot from g_inv_alpha_results without rebuilding.
g_f0_graph_cache full run_f0_predictor graph + gallocator Keyed on T_mel.
g_pos_emb_results (cached_pos_emb) (T, D) → (2T-1, D) F32 vector from compute_pos_emb compute_pos_emb is pure compute (~5 trig ops × T × D). Fired twice per encoder run (T and 2T).
g_inv_alpha_results (cached_inv_alpha) ggml_tensor* → vector<float> of inverted alphas Alpha tensors are constant for the model lifetime. Same lifetime as g_weight_cpu_mirror.
g_hann_window_cache / g_istft_kernel_cache (cached_*) n_fft → vector<float> Pure functions of n_fft (constant 16 in the chatterbox HiFT vocoder).
g_window_sum_cache (cached_window_sum) pack(n_fft, hop, T_stft) → vector<float> Stable across same-shape synth calls.

A small generic graph_cache struct (used by the encoder / HiFT / F0 / STFT caches) and pack_hift_key helper centralise the destroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition.

Round 3 — §3.34 (commit fff9820)

  • Multilingual GGUF conversion + verification: every cache invariant from rounds 1 + 2 runs against the actual multilingual model (converted from-source via scripts/convert-{t3-mtl,s3gen}-to-gguf.py @ Q4_0 from the public ResembleAI/chatterbox HuggingFace repo).
  • 19 new variant-aware test assertions that auto-detect the variant from cache populations:
    • time_mlp == 10 ∧ time_emb == 0 ⇒ multilingual: assert every cosine t_span entry t = 1 − cos(i/10 × π/2) for i in 0..9 lands in g_time_mlp_results with shape (1024,).
    • time_mlp ≤ 3 ∧ time_emb == 2 ⇒ Turbo: assert t = 0.5 is cached.
  • Fused CFG-combine + Euler step in the multilingual non-meanflow CFG path: saves one pass over the (T_mu × MEL) dxdt vector per step. Slow path (debug_mode && meanflow) and (s == 0 && !dump_mel_path.empty()) keep the explicit two-pass form so the post-combine dxdt_cond value is still readable from the debug prints + _step0_dxdt.npy dump.

Round 4 — §3.35 (commit 78e4275 + 8abfd9b)

Per-(n_past, is_uncond)-keyed graph cache for build_step_graph_mtl in src/t3_mtl.cpp. Each entry holds:

  • int64_t keypack(n_past, is_uncond);
  • ggml_context * ctx — per-entry metadata arena (no shared thread_local buf — would conflict with cached graphs);
  • ggml_cgraph * gf — the cached graph;
  • std::vector<uint8_t> buf — the arena bytes.

No per-entry gallocator. An earlier prototype gave each cached entry its own ggml_gallocr_t + ~1 MB backend buffer, which paid off on multi-synth workloads but added a ~10 % T3 regression on single-utterance runs (272 misses × 1 MB = ~270 MB of allocator churn on synth #1). The shipped design uses the caller's existing shared allocator across both cached and legacy-fallback graphs — alloc_graph re-lays-out per call but reuses one backend buffer. Cache hits still skip the ~3 ms build cost.

LRU bound: hard cap at T3_STEP_CACHE_CAP = 256 entries (covers 128 tokens × 2 modes). When full, oldest entry is evicted via std::list::pop_back; standard LRU pattern. Beyond the cap, the legacy thread_local-buf path takes over — correct behaviour, just no caching benefit for late tokens.

Opt-in via CHATTERBOX_T3_STEP_CACHE env var (default OFF). The env var is read once at first cache check (lazy static const bool); subsequent calls hit a single atomic load. Default-OFF imposes no measurable cost on single-utterance.

detail::t3_release_caches() is the public teardown entrypoint, called from:

  • chatterbox_cli.cpp::free_t3 — both the synthesis path and the streaming path;
  • chatterbox_engine.cpp::Impl::free_model;
  • an atexit handler registered on first cache insertion (fallback for code paths that don't go through the explicit teardown).

All three entry points fire BEFORE ggml_backend_free(model.backend) so the cached ggml_context and any future backend-bound resources release cleanly.

Round 5 — §3.36 (commit d1e8bbb)

Closes the only remaining deferred-followup from §3.33: the persistent run_stft graph cache.

Cache What it stores Key
g_stft_graph_cache full run_stft conv1d graph + gallocator + 4 MB context buffer T_src = T_mel × 480; rebuilds on streaming-shape change
g_stft_kernel_cache (cached_stft_kernel) n_fft → vector<float> of the analysis kernel from build_stft_kernel(n_fft, cached_hann_window(n_fft)) n_fft (constant 16 in chatterbox HiFT)

Per synth, this eliminates: a 4 MB std::vector<uint8_t> allocation + ggml_init + ggml_new_graph_custom(8192) + a fresh ggml_gallocr_t + the backend buffer it reserves. The buffer reservation is reused across rebuilds (graph_cache::destroy() preserves the buf capacity), eliminating heap-fragmentation churn in long-running streaming sessions.


Negative result documented (round 1)

Tried adding last_mu_ptr / last_spks_ptr / last_cond_ptr tracking to cfm_estimator_cache to skip redundant ggml_backend_tensor_set for mu/spks/cond on the second CFM step. F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete, and CFM mu/spks/cond are referenced only at the start of the graph (via ggml_concat(x_in, mu_in, spks_bc, cond_in)); their slots become reusable for downstream intermediates immediately. Same finding as FINDINGS_ROUND_HIFT.md §2-bis.4. Reverted.

General rule reinforced: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes.


Test infrastructure

Two harnesses, 339 / 339 cache checks green across the full matrix:

Suite Model arg Checks
test-cpu-caches none (cache-key + initial-state only) 27 / 27
test-cpu-caches chatterbox-s3gen-turbo.gguf 94 / 94
test-cpu-caches chatterbox-s3gen-mtl-q4_0.gguf 113 / 113
test-t3-caches none (initial-state only) 6 / 6
test-t3-caches chatterbox-t3-mtl-q4_0.gguf 99 / 99

src/test_cpu_caches.cpp (685 lines, NEW) covers:

src/test_t3_caches.cpp (452 lines, NEW) covers:

  • Initial state (cache empty before any eval_step_mtl; idempotent t3_release_caches()).
  • Step lifecycle: single call populates 2 entries (cond + uncond at n_past=0); same-key second call is a hit (size unchanged, hits=2); different-n_past adds 2 new entries; bit-exact logits across cold/warm at the same (n_past, token); explicit teardown drops every entry.
  • Multi-synth amortisation: 16 step calls at distinct n_past (cold pass populates 32 entries) followed by re-running the same sequence (warm pass — every call is a hit); bit-exact logits across both passes; warm pass measurably faster than cold pass (asserted as inequality, not percentage threshold, to stay robust under CPU jitter).

Cache observability is exposed via src/chatterbox_tts_test_hooks.h — under src/, NOT in include/, explicitly out of the public surface so production callers cannot take a dependency on cache layout.


Reproduction

# 1. Branch setup
cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp
git checkout chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU

# 2. Build (CPU-only — no Vulkan/Metal/CUDA needed for this PR)
cmake -S . -B build-cpu -DCMAKE_BUILD_TYPE=Release \
      -DGGML_VULKAN=OFF -DGGML_METAL=OFF -DGGML_CUDA=OFF \
      -DTTS_CPP_BUILD_TESTS=ON
cmake --build build-cpu -j16

# 3. Cache validation (339 / 339 checks expected across all suites)
./build-cpu/test-cpu-caches                                          # 27 cache-key + initial-state
./build-cpu/test-cpu-caches models/chatterbox-s3gen-turbo.gguf       # 94
./build-cpu/test-cpu-caches models/chatterbox-s3gen-mtl-q4_0.gguf    # 113
./build-cpu/test-t3-caches                                           # 6
./build-cpu/test-t3-caches models/chatterbox-t3-mtl-q4_0.gguf        # 99

# 4. End-to-end multilingual smoke (default cache-OFF; round-4 cache disabled)
./build-cpu/tts-cli --model models/chatterbox-t3-mtl-q4_0.gguf \
                    --s3gen-gguf models/chatterbox-s3gen-mtl-q4_0.gguf \
                    --text "Hola mundo, esta es una prueba multilingue del modelo CFG." \
                    --language es --out /tmp/mtl.wav --threads 8 \
                    --seed 42 --temp 0 --top-k 1 --cfg-weight 0.5

# 5. Server-mode (round-4 T3 step-graph cache enabled)
CHATTERBOX_T3_STEP_CACHE=1 ./build-cpu/tts-cli ...     # ~12 % T3 wall reduction on synth #2+

# 6. Streaming-mode regression (per-chunk RTF; rounds 2 + 5 amortise)
./build-cpu/tts-cli --model models/chatterbox-t3-turbo-q4_0.gguf \
                    --s3gen-gguf models/chatterbox-s3gen-turbo.gguf \
                    --text "First sentence. Second sentence here. Third sentence too." \
                    --out /tmp/stream.wav --threads 8 \
                    --seed 42 --temp 0 --top-k 1 \
                    --stream-first-chunk-tokens 10 --stream-chunk-tokens 25
# Each S3GEN_INFER_MS line should beat the round-1-only numbers in PROGRESS.md §3.33.

The multilingual GGUFs reproduce from the public ResembleAI/chatterbox HuggingFace repo using the existing scripts/convert-{t3-mtl,s3gen}-to-gguf.py tooling at Q4_0 (~3.2 GB source → ~1.1 GB GGUFs).


Memory cap

Every cache is bounded by the number of distinct shape keys it sees across the process lifetime. Steady-state for a single-utterance multilingual synth:

Cache Per-entry size Typical key set on multilingual
g_time_mlp_results 1024 F32 = 4 KB 10 entries (cosine t_span)
g_time_emb_results 1024 F32 = 4 KB 0 (multilingual non-meanflow)
g_weight_cpu_mirror up to 28 MB (flow/input_embedding) ~3 entries
g_cfm_estimator_cache 64 MB arena 1
g_encoder_graph_cache 64 MB arena 1 per distinct T
g_hift_graph_cache 64 MB arena 1 per distinct (T_mel, T_stft)
g_f0_graph_cache 8 MB arena 1 per distinct T_mel
g_stft_graph_cache 4 MB arena 1 per distinct T_src
g_pos_emb_results ~2.3 MB at T=600 2 per distinct chunk size
g_inv_alpha_results up to ~256 F32 / entry ~72 entries (one per alpha)
g_hann_window_cache n_fft × 4 B ≈ 64 B 1 (constant n_fft=16)
g_istft_kernel_cache n_fft × 2 × (n_fft/2 + 1) × 4 ≈ 1.1 KB 1
g_stft_kernel_cache same shape as istft_kernel ≈ 1.1 KB 1
g_window_sum_cache ((T_stft − 1) × hop + n_fft) × 4 (≤ a few hundred KB) 1 per distinct T_stft
g_t3_step_cache (opt-in) ~1.2 MB metadata arena up to 256 entries (LRU cap)

Total steady state for single-utterance with the round-4 cache off: ~250 MB. With round-4 on at full LRU saturation: ~560 MB.

For long-running streaming sessions with many distinct chunk sizes, the graph-cache arenas (64 MB × 3 + 8 MB + 4 MB) plus g_pos_emb_results (~2.3 MB × N) grow with the number of distinct shapes — see deferred follow-up #1 below.


Deferred follow-ups (separate PRs)

Round 5 closed the run_stft cache item; the remaining list is:

Candidate Estimated win Why deferred
LRU eviction for the round-2 + round-5 graph + shape-keyed caches n/a (memory cap) The encoder / HiFT / F0 / STFT graph caches each currently hold one (arena + gallocator) per distinct shape. Long-running streaming with many shape changes can grow unbounded. A small LRU bound (say 8 entries) would handle server-mode deployments. Round 4 already established the LRU pattern (std::list-backed); applying it to round-2 + round-5 graph caches is a straight port. Out of scope for the optimisation pass.
HiFT conv weight quantisation (§3.20 backlog #4) ~10 % MTL CPU wall Independent of caching; blocked on the conv1d_f32 arg-order refactor. Mirrors the conv1d_f32_b pattern.
Heterogeneous-core thread default (§3.20 backlog #5) ~10 % on M4 Hardware-bound; needs hwloc or per-platform sysctl to detect efficiency cores. Orthogonal.
ggml-cpu Q4_0 / Q4_K kernel optimisation unknown; potentially large The remaining ~95 % of multilingual CPU wall (~5500 ms / synth) is real ggml-cpu Q4_0 matmul work. Out of scope for chatterbox.cpp; lives in ggml/src/ggml-cpu/.
Mobile validation (Adreno / Mali / Apple Silicon CPU) Unknown on CPU; biggest unknown Hardware-bound. Same gap noted in OPTIMIZATION_PLAN_NEXT.md for the Vulkan branch.

Files

CMakeLists.txt                   +18      (test-cpu-caches + test-t3-caches targets, build flags)
PROGRESS.md                      +617     (§3.32 + §3.33 + §3.34 + §3.35 + §3.36)
src/chatterbox_cli.cpp           +9       (free_t3 calls t3_release_caches in 2 paths)
src/chatterbox_engine.cpp        +5       (Impl::free_model calls t3_release_caches)
src/chatterbox_t3_internal.h     +10      (detail::t3_release_caches decl)
src/chatterbox_tts.cpp           +892 / -127  (cache infra + 6 graph caches + 11 cached helpers + test-hook bridges)
src/chatterbox_tts_test_hooks.h  +165     NEW  (round 1..5 hook decls)
src/t3_mtl.cpp                   +348     (round-4 step-graph cache + opt-in gate + test bridges)
src/test_cpu_caches.cpp          +685     NEW  (rounds 1, 2, 3, 5 — 113 multilingual / 94 Turbo checks)
src/test_t3_caches.cpp           +452     NEW  (round 4 — 99 multilingual T3 checks)

Total: 10 files, +3074 / −127 lines, 6 commits on top of 21896a3.

No public-API change. include/tts-cpp/chatterbox/s3gen_pipeline.h remains untouched. The cache observability hooks live in src/chatterbox_tts_test_hooks.h (under src/, not include/), explicitly out of the public surface so production callers cannot take a dependency on cache layout.


Status

Clean six-commit branch on top of 21896a3. Headline numbers:

  • −22 % S3GEN_INFER_MS on Turbo single-utterance (794 → 619 ms; bit-exact across cold / warm / post-unload).
  • −27 % total wall on streaming (tts-cli --stream-chunk-tokens 25, 21 chunks: ~48 s → ~35 s).
  • −2.2 % S3GEN_INFER_MS on multilingual single-utterance (sub-noise on a single synth; compounds on streaming where multiple synths share warm caches).
  • ~12 % T3 wall reduction on synth Chatterbox optimize cpp backend multilingual model for cuda #2+ in multi-synth processes (~256 ms / synth on multilingual at the default 256-entry LRU cap), gated behind CHATTERBOX_T3_STEP_CACHE=1 so single-utterance CLI users see no regression.
  • Multilingual end-to-end CPU RTF 1.45 at Q4_0 / 8 ggml threads / 16-thread x86_64 host.

All cache + bit-exact + shape-invalidation checks green: 339 total across test-cpu-caches (27 + 94 + 113) and test-t3-caches (6 + 99).

After this PR, every host-side per-synth allocation cycle on the multilingual CPU pipeline is cached:

T3:        prompt + step graphs   → step cached (opt-in, round 4)
Encoder:   graph + pos_emb        → cached (round 2)
CFM:       estimator graph + time_mlp/time_emb + weight mirror → cached (round 1)
F0:        graph                  → cached (round 2)
HiFT:      graph + inv_alpha + hann + istft kernel + window_sum → cached (round 2)
STFT:      graph + analysis kernel → cached (round 5)
SineGen:   pure compute (per-call RNG-seeded; not cacheable)

The remaining ~95 % of multilingual CPU wall is real ggml-cpu Q4_0 matmul work, which lives in ggml/src/ggml-cpu/ and is out of scope for chatterbox.cpp.

@Zbig9000 Zbig9000 force-pushed the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch from 6fb0a3f to d1e8bbb Compare May 6, 2026 14:45
Zbig9000 added 6 commits May 6, 2026 18:33
…d 2)

PROGRESS.md §3.33 — persistent encoder/HiFT/F0 graph caches +
pos_emb / inv_alpha / hann_window / istft_kernel / window_sum
scaffolding caches on top of the round-1 CFM caches (§3.32).
Turbo single-utterance S3GEN_INFER_MS -22 %, streaming wall -27 %.
Tests: 79/79 pass (49 new round-2 checks).
…d 3)

PROGRESS.md §3.34 — multilingual verification (Turbo 80/80,
multilingual 99/99 checks pass; bit-exact synth-twice on the
converted-from-source MTL Q4_0 GGUF) + 19 new multilingual-specific
test assertions (cosine schedule produces exactly 10 distinct
g_time_mlp_results entries) + fused CFG-combine + Euler step in the
non-meanflow CFG path of synthesize().  Sub-noise wall-time saving
on a single multilingual synth (~8 s); biggest remaining host-side
win is T3 step-graph caching, documented as deferred follow-up.
…d 4)

PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token
decode) opt-in via CHATTERBOX_T3_STEP_CACHE.  Per-(n_past,
is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl;
saves ~3 ms per cache hit.  Single-utterance default-OFF (no
hits-to-amortise on synth GustavoA1604#1) keeps the existing path
regression-free; server-mode opt-in shows ~15 % per-pass speedup
(~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens).  Tests:
src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact
cold/warm logits + multi-synth amortisation timing).  Lifecycle
wired into free_t3 (CLI, both paths), Impl::free_model (Engine),
and an atexit fallback — all firing BEFORE ggml_backend_free.
Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.
@Zbig9000 Zbig9000 force-pushed the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch from d1e8bbb to eadf88f Compare May 6, 2026 16:51
@GustavoA1604 GustavoA1604 merged commit dd5b3f3 into GustavoA1604:main May 6, 2026
GustavoA1604 added a commit that referenced this pull request May 6, 2026
…timize-cpp-backend-multilingual-for-CPU

QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit that referenced this pull request May 6, 2026
Mirrors the layout established for the other test_*.cpp harnesses
(791759c).  Pure rename (git detects 100 % similarity) plus the
matching CMakeLists.txt path updates for the test-cpu-caches and
test-t3-caches executable targets.  No source changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 deleted the chatterbox-QVAC-18422-TTS-GGML-Optimize-cpp-backend-multilingual-for-CPU branch May 7, 2026 07:57
GustavoA1604 added a commit that referenced this pull request May 7, 2026
Mirrors the parakeet-cpp port README layout so a downstream consumer
can answer 'what does this library do, how do I link it, and which
CMake knobs do I need to know about?' from the top of the README
without scrolling through the 1300-line standalone development walk-
through.  No content removed; existing standalone material stays
verbatim, just shifted down by ~80 lines.

Adds three new blocks near the top:

- ## API overview (between the benchmark tables and 'Pipeline at a
  glance').  Two-row table for the high-level entry points exported
  through TTS_CPP_API:
    * tts_cpp::chatterbox::Engine::synthesize  - Chatterbox T3+S3Gen+HiFT
    * tts_cpp::supertonic::synthesize          - Supertonic CPU TTS
  Trailing paragraph mentions the lower-level helpers
  (s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload /
  tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly
  flags that detail-namespaced symbols (used by the supertonic /
  chatterbox test harnesses) are not part of the public API and are
  hidden in SHARED builds.

- ### Consumer integration (subsection of API overview).  Calls out
  that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port
  consumes ggml from the qvac-ext-ggml/speech branch directly
  (Metal / OpenCL / Vulkan patches included) and does NOT ship
  scripts/setup-ggml.sh or patches/ - those are standalone-dev tools
  maintained in this repo only.  Provides the
  find_package(tts-cpp CONFIG REQUIRED) +
  target_link_libraries(... tts-cpp::tts-cpp) + 8-line
  Engine::synthesize C++ snippet that's the entire consumer-side
  integration.

- ### Useful CMake options (inside section 1, between the GPU backend
  paragraph and the binaries table).  Full table of the project-
  namespaced flags:
    TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8),
    TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL,
    TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE
    (new from items 7+8).
  Plus a secondary table for the ctest-fixture cache paths
  (TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the
  REQUIRES auto-disable behaviour from item 7.

Touches existing prose in two places:

- The setup-ggml.sh paragraph in section 1 gets a one-paragraph
  follow-up clarifying it (and patches/) are standalone-development
  tools only, with a back-link to the Consumer integration section
  (item 9: 'document setup-ggml.sh inertness' folded into this
  framing rather than landed as a separate doc-only commit).  Also
  strengthens the existing 'Re-running is safe' line to 'idempotent
  and destructive' so a dev hacking on ./ggml is warned before
  losing local edits.

- The ### Alternative: consume ggml from vcpkg subsection now opens
  with one sentence positioning it as the CMake-mechanic detail
  behind the Consumer integration story, with a forward link to the
  qvac-ext-ggml/speech branch.

Also updates the binaries table in section 1 to list the missing
PR #6 + PR #7 binaries that landed since the README was last
refreshed: supertonic-cli, supertonic-bench, test-cpu-caches,
test-t3-caches, and the test-supertonic-* family.  Trailing paragraph
notes that test-* binaries register with CTest so
\`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works
out of the build directory.

No code changes, no CMake changes, no install behaviour changes.
README.md +128 / -10 lines.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants