QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU#6
Merged
Conversation
6fb0a3f to
d1e8bbb
Compare
…d 2) PROGRESS.md §3.33 — persistent encoder/HiFT/F0 graph caches + pos_emb / inv_alpha / hann_window / istft_kernel / window_sum scaffolding caches on top of the round-1 CFM caches (§3.32). Turbo single-utterance S3GEN_INFER_MS -22 %, streaming wall -27 %. Tests: 79/79 pass (49 new round-2 checks).
…d 3) PROGRESS.md §3.34 — multilingual verification (Turbo 80/80, multilingual 99/99 checks pass; bit-exact synth-twice on the converted-from-source MTL Q4_0 GGUF) + 19 new multilingual-specific test assertions (cosine schedule produces exactly 10 distinct g_time_mlp_results entries) + fused CFG-combine + Euler step in the non-meanflow CFG path of synthesize(). Sub-noise wall-time saving on a single multilingual synth (~8 s); biggest remaining host-side win is T3 step-graph caching, documented as deferred follow-up.
…d 4) PROGRESS.md §3.35 — T3 step-graph cache (multilingual CFG token decode) opt-in via CHATTERBOX_T3_STEP_CACHE. Per-(n_past, is_uncond) std::list-LRU cache (cap 256) for build_step_graph_mtl; saves ~3 ms per cache hit. Single-utterance default-OFF (no hits-to-amortise on synth GustavoA1604#1) keeps the existing path regression-free; server-mode opt-in shows ~15 % per-pass speedup (~256 ms / synth GustavoA1604#2 of multilingual at 136 tokens). Tests: src/test_t3_caches.cpp NEW with 99 checks (lifecycle + bit-exact cold/warm logits + multi-synth amortisation timing). Lifecycle wired into free_t3 (CLI, both paths), Impl::free_model (Engine), and an atexit fallback — all firing BEFORE ggml_backend_free. Total cache test suite green: 80 + 99 + 6 + 99 = 284 / 284.
d1e8bbb to
eadf88f
Compare
GustavoA1604
added a commit
that referenced
this pull request
May 6, 2026
…timize-cpp-backend-multilingual-for-CPU QVAC-18422 [TTS GGML] Optimize cpp backend multilingual for CPU Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 6, 2026
Mirrors the layout established for the other test_*.cpp harnesses (791759c). Pure rename (git detects 100 % similarity) plus the matching CMakeLists.txt path updates for the test-cpu-caches and test-t3-caches executable targets. No source changes. Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604
added a commit
that referenced
this pull request
May 7, 2026
Mirrors the parakeet-cpp port README layout so a downstream consumer
can answer 'what does this library do, how do I link it, and which
CMake knobs do I need to know about?' from the top of the README
without scrolling through the 1300-line standalone development walk-
through. No content removed; existing standalone material stays
verbatim, just shifted down by ~80 lines.
Adds three new blocks near the top:
- ## API overview (between the benchmark tables and 'Pipeline at a
glance'). Two-row table for the high-level entry points exported
through TTS_CPP_API:
* tts_cpp::chatterbox::Engine::synthesize - Chatterbox T3+S3Gen+HiFT
* tts_cpp::supertonic::synthesize - Supertonic CPU TTS
Trailing paragraph mentions the lower-level helpers
(s3gen_synthesize_to_wav / s3gen_preload / s3gen_unload /
tts_cpp_cli_main), points at <tts-cpp/export.h>, and explicitly
flags that detail-namespaced symbols (used by the supertonic /
chatterbox test harnesses) are not part of the public API and are
hidden in SHARED builds.
- ### Consumer integration (subsection of API overview). Calls out
that the qvac speech-stack qvac-ext-lib-whisper.cpp wrapper port
consumes ggml from the qvac-ext-ggml/speech branch directly
(Metal / OpenCL / Vulkan patches included) and does NOT ship
scripts/setup-ggml.sh or patches/ - those are standalone-dev tools
maintained in this repo only. Provides the
find_package(tts-cpp CONFIG REQUIRED) +
target_link_libraries(... tts-cpp::tts-cpp) + 8-line
Engine::synthesize C++ snippet that's the entire consumer-side
integration.
- ### Useful CMake options (inside section 1, between the GPU backend
paragraph and the binaries table). Full table of the project-
namespaced flags:
TTS_CPP_BUILD_LIBRARY, TTS_CPP_BUILD_SHARED (new from items 7+8),
TTS_CPP_BUILD_EXECUTABLES, TTS_CPP_BUILD_TESTS, TTS_CPP_INSTALL,
TTS_CPP_USE_SYSTEM_GGML, TTS_CPP_GGML_LIB_PREFIX, TTS_CPP_CCACHE
(new from items 7+8).
Plus a secondary table for the ctest-fixture cache paths
(TTS_CPP_TEST_{MODEL,AUDIO,REF}_DIR) and a one-liner explaining the
REQUIRES auto-disable behaviour from item 7.
Touches existing prose in two places:
- The setup-ggml.sh paragraph in section 1 gets a one-paragraph
follow-up clarifying it (and patches/) are standalone-development
tools only, with a back-link to the Consumer integration section
(item 9: 'document setup-ggml.sh inertness' folded into this
framing rather than landed as a separate doc-only commit). Also
strengthens the existing 'Re-running is safe' line to 'idempotent
and destructive' so a dev hacking on ./ggml is warned before
losing local edits.
- The ### Alternative: consume ggml from vcpkg subsection now opens
with one sentence positioning it as the CMake-mechanic detail
behind the Consumer integration story, with a forward link to the
qvac-ext-ggml/speech branch.
Also updates the binaries table in section 1 to list the missing
PR #6 + PR #7 binaries that landed since the README was last
refreshed: supertonic-cli, supertonic-bench, test-cpu-caches,
test-t3-caches, and the test-supertonic-* family. Trailing paragraph
notes that test-* binaries register with CTest so
\`ctest -C Release -L unit\` / \`ctest -C Release -L fixture\` works
out of the build directory.
No code changes, no CMake changes, no install behaviour changes.
README.md +128 / -10 lines.
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CPU-side optimisation pass for
chatterbox.cpp. Five layered passes that close the per-synth host-CPU envelope outside the heavy matmuls (which §3.20 already drove down via Q4_0/Q8_0 weight quantisation). All bit-exact-preserving, all model-agnostic by construction.g_time_mlp_results(10 graph submissions / synth → 0 on multilingual cosine schedule),g_time_emb_results(Turbo only), persistentg_cfm_estimator_cache(graph rebuild → 0 across synths),g_weight_cpu_mirrorforflow/input_embedding(~13–28 MB) +flow/spk_embed_affine/{w,b}.Torpack(T_mel, T_stft)), plus pure-compute scaffolding helpers:cached_pos_emb(the dominant scaffolding cost —T × D × ~5trig ops, fired twice per encoder run),cached_inv_alpha(~72tensor_get + per-element 1/xcalls per HiFT call),cached_hann_window,cached_istft_kernel,cached_window_sum.chatterbox-s3gen-mtl-q4_0.gguf); auto-detection of variant (Turbo vs multilingual) inside the test harness; fused CFG-combine + Euler step in the multilingual non-meanflow CFG path (saves one pass overdxdtper step; gated on!debug_mode && !dump_mel_pathso the slow-path debug hooks still see the post-combine vector).(n_past, is_uncond)graph cache forbuild_step_graph_mtl. Multilingual fires this 2× per token; a 136-token Spanish synth previously rebuilt 272 graphs at ~3 ms each (~800 ms / synth of pure host-CPU work). Opt-in viaCHATTERBOX_T3_STEP_CACHE=1because single-utterance workloads see every step as a uniquen_past(cache fills but nothing is re-used; bookkeeping costs ~10 % T3 regression). Server-mode wins: ~12 % T3 wall reduction on synth #2+ in the same process (~256 ms / synth on multilingual at the default 256-entry LRU cap). Lifecycle wired intochatterbox_cli.cpp::free_t3(both synthesis + streaming paths) andchatterbox_engine.cpp::Impl::free_modelBEFOREggml_backend_free, plus an atexit fallback.g_stft_graph_cache(keyed onT_src = T_mel × 480) andg_stft_kernel_cache(keyed onn_fft).run_stftpreviously allocated a fresh 4 MB context buffer + ggml_init + graph build +ggml_gallocr_t+ backend buffer every synth; the cache eliminates that whole cycle. Closes the only remaining round-2 deferred-followup. Streaming-shape invalidation handled by the same(cache.key != T_src)rebuild rule as the other graph caches.Headline numbers
Multilingual end-to-end CPU (this PR, three runs on the same Spanish prompt)
./build-cpu/tts-cliwith the converted multilingual GGUFs, CFG enabled (cfg_weight=0.5, 8 ggml threads, seed 42,temp 0 --top-k 1), Linux 6.8 / x86_64 / 32-thread host:136 speech tokens generated (CFG cond + uncond × 136 = 272 T3 step graphs).
Turbo, single-utterance bit-exact harness (
./build-cpu/test-cpu-caches)S3GEN_INFER_MS(synth #1 vs #2)Bit-exact across cold (synth #1), warm (synth #2), and post-
s3gen_unload(synth #3) WAV outputs — every diff sample = 0.Streaming-mode wins (
tts-cli --stream-chunk-tokens 25, 3-sentence prompt → 21 chunks)Per-chunk savings shrink with chunk index because each chunk has a new
T(the encoder input grows with the running prefix), so the encoder / HiFT / F0 / STFT graph caches rebuild on every chunk. The result + scaffolding caches (pos_emb,inv_alpha,istft_kernel,hann_window,window_sum,stft_kernel, plus round-1'stime_mlp_resultsetc.) stay warm across every chunk.Multilingual verification (round 3 — §3.34)
The multilingual GGUFs were converted from the public
ResembleAI/chatterboxHuggingFace repo using the existingscripts/convert-{t3-mtl,s3gen}-to-gguf.pytooling (Q4_0 to match the §3.20 baseline; ~3.2 GB of source files, ~1.1 GB of resulting GGUFs). Every cache invariant from §3.32 / §3.33 / §3.36 runs against the actual multilingual model:./test-cpu-caches models/chatterbox-s3gen-turbo.gguf./test-cpu-caches models/chatterbox-s3gen-mtl-q4_0.ggufThe 19 extra checks on multilingual are round-3 cosine-schedule assertions: every entry of
t_span = [1 − cos(i/10 · π/2)]fori in 0..9must land ing_time_mlp_resultsafter the first synth, each cached vector must be(1024,), and the variant auto-detection (time_mlp == 10 ∧ time_emb == 0) must classify the model correctly.Multilingual single-utterance synth-twice on the multilingual S3Gen GGUF (24-token harness):
S3GEN_INFER_MSs3gen_unload)Wav output is byte-for-byte identical between synth #1, synth #2, and post-unload synth #3 — every sample diff = 0. The relative per-synth delta is small because multilingual CFM compute is ~6× larger absolute than Turbo, so the constant per-synth host overhead amortises into a smaller fraction of total wall.
Why these specific levers
§3.20 (already shipped) drove down the compute-bound bulk of CPU multilingual wall time via Q4_0/Q8_0 quantisation of the CFM / Conformer / T3 linears. What's left after §3.20 is fixed per-synth host overhead that quantisation doesn't help:
gallocr_reservefor every per-pipeline graph (CFM, encoder, HiFT, F0 predictor, STFT) — none of which were cached across synth calls inmultilingual_merged;compute_time_mlpgraph submissions (10 / synth on multilingual's cosine schedule, with a constant set of t-values across every synth);ggml_backend_tensor_getofflow/input_embedding(~28 MB on multilingual) andflow/spk_embed_affine/{w,b};compute_pos_emb(T × D × ~5trig ops, fired twice per encoder run forTand2T; on multilingual at T=350+, D=512 that's a real wedge of per-synth host time);n_fft=16andhop=4are constants;invert_alpha_cpufired ~72× per HiFT call (12 ResBlocks × 6 alpha tensors), each doing atensor_get+ per-element reciprocal;build_step_graph_mtlrebuilt every CFG token-decode step (272 graphs × ~3 ms each ≈ 800 ms / synth pure host-CPU work);run_stftre-allocated a fresh 4 MB context + gallocator + backend buffer + rebuilt the conv1d graph every synth.This PR closes every one of those gaps with the same teardown discipline as the existing
thread_local time_mlp_cache: cleared ins3gen_release_synth_cachesbeforeggml_backend_freeso the gallocators in the graph caches release against a still-valid backend (otherwise Vulkan / Metal / CUDA backend dylibs would hit their resource-leak asserts on process exit). The T3 cache mirrors the discipline indetail::t3_release_caches()wired intochatterbox_cli.cppandchatterbox_engine.cpp.What this PR does NOT do — and why
permute+contat every attention block dominates the per-dispatch overhead, which is already negligible onggml-cpu). The existinguse_b2 = !ggml_backend_is_cpu(...)gate stays.mul_matisn't free against AVX-512 F32 kernels). This PR keeps F32.n_pastwould either waste ~10 GB of metadata arenas or pay the same per-step bookkeeping cost on synth QVAC-17872 [TTS GGML] Optimize cpp backend multilingual model for Vulkan #1. The opt-in env-var keeps single-utterance CLI users at zero cost while letting server-mode operators flip it on.ggml-cpuQ4_0 matmul compute. Fixing that lives inggml/src/ggml-cpu/, not chatterbox.cpp.What this PR does
Round 1 — §3.32 (commit
b1c83b9)Four caches that target the dominant CFM-side per-synth overhead.
g_time_mlp_results(compute_time_mlp_cached)n_timesteps=10) is constant across every synth; entries populated once and reused forever.t_span = [0, 0.5, 1.0]).g_time_emb_results(compute_time_emb_cached)(0, 0.5)and(0.5, 1)).g_cfm_estimator_cache(promoted from local-scope)Tskips it. Existing(cache.T != T) || (cache.b2 != needed)keying handles streaming chunks that varyT.g_weight_cpu_mirror(cached_cpu_weights_f32)ggml_backend_tensor_getper tensor; every subsequent synth returns the cached pointer in O(1). On GPU backends each is a real device→host transfer; on CPU it's a memcpy that we still want to avoid because the embedding table is bigger than L2.Bit-cast cache key (
g_float_bits/g_float_pair_bits) avoids the ambiguousstd::hash<float>behaviour around+0/-0and NaN that varies between libstdc++ and libc++.Round 2 — §3.33 (commit
c3a98e5)Five new graph-and-scaffolding caches that close the remaining encoder + HiFT + F0 host overhead.
g_encoder_graph_cacherun_encodergraph +gallocatorT; streaming chunks of varying length still produce correct output (rebuilds on key change).g_hift_graph_cache(+g_hift_inv_alpha_entriesmetadata)run_hift_decodegraph +gallocatorpack(T_mel, T_stft). The parallel inv-alpha-input metadata lets cache hits re-feed each alpha-input slot fromg_inv_alpha_resultswithout rebuilding.g_f0_graph_cacherun_f0_predictorgraph +gallocatorT_mel.g_pos_emb_results(cached_pos_emb)(T, D) → (2T-1, D) F32 vectorfromcompute_pos_embcompute_pos_embis pure compute (~5 trig ops × T × D). Fired twice per encoder run (T and 2T).g_inv_alpha_results(cached_inv_alpha)ggml_tensor* → vector<float>of inverted alphasg_weight_cpu_mirror.g_hann_window_cache/g_istft_kernel_cache(cached_*)n_fft → vector<float>n_fft(constant 16 in the chatterbox HiFT vocoder).g_window_sum_cache(cached_window_sum)pack(n_fft, hop, T_stft) → vector<float>A small generic
graph_cachestruct (used by the encoder / HiFT / F0 / STFT caches) andpack_hift_keyhelper centralise thedestroy()-on-teardown pattern so future per-stage caches can plug in with one struct + one mutex acquisition.Round 3 — §3.34 (commit
fff9820)scripts/convert-{t3-mtl,s3gen}-to-gguf.py@ Q4_0 from the publicResembleAI/chatterboxHuggingFace repo).time_mlp == 10 ∧ time_emb == 0⇒ multilingual: assert every cosinet_spanentryt = 1 − cos(i/10 × π/2)fori in 0..9lands ing_time_mlp_resultswith shape(1024,).time_mlp ≤ 3 ∧ time_emb == 2⇒ Turbo: assertt = 0.5is cached.(T_mu × MEL)dxdtvector per step. Slow path (debug_mode && meanflow) and(s == 0 && !dump_mel_path.empty())keep the explicit two-pass form so the post-combinedxdt_condvalue is still readable from the debug prints +_step0_dxdt.npydump.Round 4 — §3.35 (commit
78e4275+8abfd9b)Per-
(n_past, is_uncond)-keyed graph cache forbuild_step_graph_mtlinsrc/t3_mtl.cpp. Each entry holds:int64_t key—pack(n_past, is_uncond);ggml_context * ctx— per-entry metadata arena (no sharedthread_localbuf — would conflict with cached graphs);ggml_cgraph * gf— the cached graph;std::vector<uint8_t> buf— the arena bytes.No per-entry
gallocator. An earlier prototype gave each cached entry its ownggml_gallocr_t+ ~1 MB backend buffer, which paid off on multi-synth workloads but added a ~10 % T3 regression on single-utterance runs (272 misses × 1 MB = ~270 MB of allocator churn on synth #1). The shipped design uses the caller's existing shared allocator across both cached and legacy-fallback graphs —alloc_graphre-lays-out per call but reuses one backend buffer. Cache hits still skip the ~3 ms build cost.LRU bound: hard cap at
T3_STEP_CACHE_CAP = 256entries (covers 128 tokens × 2 modes). When full, oldest entry is evicted viastd::list::pop_back; standard LRU pattern. Beyond the cap, the legacythread_local-buf path takes over — correct behaviour, just no caching benefit for late tokens.Opt-in via
CHATTERBOX_T3_STEP_CACHEenv var (default OFF). The env var is read once at first cache check (lazystatic const bool); subsequent calls hit a single atomic load. Default-OFF imposes no measurable cost on single-utterance.detail::t3_release_caches()is the public teardown entrypoint, called from:chatterbox_cli.cpp::free_t3— both the synthesis path and the streaming path;chatterbox_engine.cpp::Impl::free_model;atexithandler registered on first cache insertion (fallback for code paths that don't go through the explicit teardown).All three entry points fire BEFORE
ggml_backend_free(model.backend)so the cachedggml_contextand any future backend-bound resources release cleanly.Round 5 — §3.36 (commit
d1e8bbb)Closes the only remaining deferred-followup from §3.33: the persistent
run_stftgraph cache.g_stft_graph_cacherun_stftconv1d graph +gallocator+ 4 MB context bufferT_src = T_mel × 480; rebuilds on streaming-shape changeg_stft_kernel_cache(cached_stft_kernel)n_fft → vector<float>of the analysis kernel frombuild_stft_kernel(n_fft, cached_hann_window(n_fft))n_fft(constant 16 in chatterbox HiFT)Per synth, this eliminates: a 4 MB
std::vector<uint8_t>allocation +ggml_init+ggml_new_graph_custom(8192)+ a freshggml_gallocr_t+ the backend buffer it reserves. The buffer reservation is reused across rebuilds (graph_cache::destroy()preserves thebufcapacity), eliminating heap-fragmentation churn in long-running streaming sessions.Negative result documented (round 1)
Tried adding
last_mu_ptr / last_spks_ptr / last_cond_ptrtracking tocfm_estimator_cacheto skip redundantggml_backend_tensor_setformu/spks/condon the second CFM step. F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete, and CFMmu/spks/condare referenced only at the start of the graph (viaggml_concat(x_in, mu_in, spks_bc, cond_in)); their slots become reusable for downstream intermediates immediately. Same finding asFINDINGS_ROUND_HIFT.md§2-bis.4. Reverted.General rule reinforced: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes.
Test infrastructure
Two harnesses, 339 / 339 cache checks green across the full matrix:
test-cpu-cachestest-cpu-cacheschatterbox-s3gen-turbo.gguftest-cpu-cacheschatterbox-s3gen-mtl-q4_0.gguftest-t3-cachestest-t3-cacheschatterbox-t3-mtl-q4_0.ggufsrc/test_cpu_caches.cpp(685 lines, NEW) covers:+0≠-0, NaN bit-pattern stability, pair-key composition, multilingual cosinet_spandistinctness).s3gen_unload()).s3gen_unloadreload (synth Metal optimisation #3) — byte-for-byte equality.s3gen_unload; idempotent second unload doesn't crash).src/test_t3_caches.cpp(452 lines, NEW) covers:eval_step_mtl; idempotentt3_release_caches()).n_pastadds 2 new entries; bit-exact logits across cold/warm at the same(n_past, token); explicit teardown drops every entry.n_past(cold pass populates 32 entries) followed by re-running the same sequence (warm pass — every call is a hit); bit-exact logits across both passes; warm pass measurably faster than cold pass (asserted as inequality, not percentage threshold, to stay robust under CPU jitter).Cache observability is exposed via
src/chatterbox_tts_test_hooks.h— undersrc/, NOT ininclude/, explicitly out of the public surface so production callers cannot take a dependency on cache layout.Reproduction
The multilingual GGUFs reproduce from the public
ResembleAI/chatterboxHuggingFace repo using the existingscripts/convert-{t3-mtl,s3gen}-to-gguf.pytooling at Q4_0 (~3.2 GB source → ~1.1 GB GGUFs).Memory cap
Every cache is bounded by the number of distinct shape keys it sees across the process lifetime. Steady-state for a single-utterance multilingual synth:
g_time_mlp_resultst_span)g_time_emb_resultsg_weight_cpu_mirrorflow/input_embedding)g_cfm_estimator_cacheg_encoder_graph_cacheTg_hift_graph_cache(T_mel, T_stft)g_f0_graph_cacheT_melg_stft_graph_cacheT_srcg_pos_emb_resultsg_inv_alpha_resultsg_hann_window_cachen_fft × 4 B≈ 64 Bg_istft_kernel_cachen_fft × 2 × (n_fft/2 + 1) × 4≈ 1.1 KBg_stft_kernel_cacheistft_kernel≈ 1.1 KBg_window_sum_cache((T_stft − 1) × hop + n_fft) × 4(≤ a few hundred KB)T_stftg_t3_step_cache(opt-in)Total steady state for single-utterance with the round-4 cache off: ~250 MB. With round-4 on at full LRU saturation: ~560 MB.
For long-running streaming sessions with many distinct chunk sizes, the graph-cache arenas (64 MB × 3 + 8 MB + 4 MB) plus
g_pos_emb_results(~2.3 MB × N) grow with the number of distinct shapes — see deferred follow-up #1 below.Deferred follow-ups (separate PRs)
Round 5 closed the
run_stftcache item; the remaining list is:(arena + gallocator)per distinct shape. Long-running streaming with many shape changes can grow unbounded. A small LRU bound (say 8 entries) would handle server-mode deployments. Round 4 already established the LRU pattern (std::list-backed); applying it to round-2 + round-5 graph caches is a straight port. Out of scope for the optimisation pass.conv1d_f32arg-order refactor. Mirrors theconv1d_f32_bpattern.hwlocor per-platform sysctl to detect efficiency cores. Orthogonal.ggml-cpuQ4_0 matmul work. Out of scope for chatterbox.cpp; lives inggml/src/ggml-cpu/.OPTIMIZATION_PLAN_NEXT.mdfor the Vulkan branch.Files
No public-API change.
include/tts-cpp/chatterbox/s3gen_pipeline.hremains untouched. The cache observability hooks live insrc/chatterbox_tts_test_hooks.h(undersrc/, notinclude/), explicitly out of the public surface so production callers cannot take a dependency on cache layout.Status
Clean six-commit branch on top of
21896a3. Headline numbers:S3GEN_INFER_MSon Turbo single-utterance (794 → 619 ms; bit-exact across cold / warm / post-unload).tts-cli --stream-chunk-tokens 25, 21 chunks: ~48 s → ~35 s).S3GEN_INFER_MSon multilingual single-utterance (sub-noise on a single synth; compounds on streaming where multiple synths share warm caches).CHATTERBOX_T3_STEP_CACHE=1so single-utterance CLI users see no regression.All cache + bit-exact + shape-invalidation checks green: 339 total across
test-cpu-caches(27 + 94 + 113) andtest-t3-caches(6 + 99).After this PR, every host-side per-synth allocation cycle on the multilingual CPU pipeline is cached:
The remaining ~95 % of multilingual CPU wall is real
ggml-cpuQ4_0 matmul work, which lives inggml/src/ggml-cpu/and is out of scope for chatterbox.cpp.