Skip to content

Chatterbox vulkan on multilingual#5

Open
Zbig9000 wants to merge 1 commit into
GustavoA1604:mainfrom
Zbig9000:chatterbox-Vulkan-on-multilingual
Open

Chatterbox vulkan on multilingual#5
Zbig9000 wants to merge 1 commit into
GustavoA1604:mainfrom
Zbig9000:chatterbox-Vulkan-on-multilingual

Conversation

@Zbig9000
Copy link
Copy Markdown

@Zbig9000 Zbig9000 commented May 4, 2026

Vulkan-side optimization work for chatterbox.cpp on the multilingual_merged
base. Two ggml-vulkan patches + four host-side optimizations in
src/chatterbox_tts.cpp that benefit BOTH the Turbo (meanflow) and the
multilingual (standard CFM with CFG) variants. All bit-exact on F32
across NVIDIA + AMD/RADV.

Change Bit-exact F32 NV+AMD Net RTX 5090 perf Notes
patches/ggml-vulkan-pipeline-cache.patch −2.44 s cold-start Persistent VkPipelineCache keyed by <vendorID>-<deviceID>-<driverVersion>. Disabled via empty GGML_VK_PIPELINE_CACHE_DIR.
patches/ggml-vulkan-eager-cache-save.patch crash-safety Write back the pipeline cache after every ggml_vk_load_shaders compile batch.
Persistent CFM estimator graph cache ~−10 ms / chunk cfm_estimator_cache was local-scope in synthesize(); promoted to global with explicit destroy() lifetime tied to s3gen_model_cache_release.
Time-embedding result memoisation ~−4 ms / inf Two-layer cache by t-value (Turbo + multilingual) and (t, r) pair (Turbo only). Eliminates 6 graph submissions / inf for Turbo, 9-19 for multilingual.
CPU mirror cache for large per-synth weight downloads ~−1 ms / inf flow/input_embedding (~13.4 MB Turbo / ~28 MB multilingual) + speaker affine matrices were re-downloaded every synth.
3 HiFT cont sites removed noise (code-quality) conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp permute.
G2 dump-script gap closure n/a (test-infra) n/a regress-tensor-compare.sh now runs end-to-end through G2/G3/G4/H1/H3/H4/H5.

Headline numbers — RTX 5090 + NVIDIA 590.48 + Vulkan 1.4.325, Turbo model, regress-tight aggregate, n=75 chunks

Metric upstream/multilingual_merged this PR Δ
S3GEN_INFER 76.6 ms 65.4 ms −11.2 ms (−14.6 %)
cfm_total 40.3 ms 28.7 ms −11.6 ms (−28.8 %)
encoder 19.9 ms 20.7 ms noise
hift_decode 10.9 ms 11.6 ms noise

cfm_total ranges fully separated on n=120 total samples
(base [38.3, 42.8] vs final [27.1, 30.1]) — real signal, not noise.

Cold-start (round-1 patch)

Scenario (fresh process, RTX 5090) T3 S3Gen Wall (ms)
Both caches cold (fresh machine / Mesa) 947 1741 2 688
ggml cache warm, NVIDIA cache cold 80 166 246
Both caches warm (steady state) 69 154 223

Round-1 alone recovers 91 % of the cold→fully-warm gap with only
ggml's cache populated
— the headline win for Mesa / Adreno / Mali
where there's no per-driver shader cache to fall back on.

Bit-exactness

  • 3 RTX 5090 F32 invariants PASS (round-1 single-shot, round-2 multi-synth identical, round-3 multi-synth varied):
    • 454b4cc14538e8ef917930b110d1e504
    • 4c83f367e6ca2b02fefbd480519ea3f6
    • 9252253ee532cb7928639a0f644a25da
  • 3 AMD/RADV F32 invariants PASS (locked AMD MD5s):
    • 713fe5aed997002379a12383d3795584
    • a84623b784b5e47dc95f62229773e81b
    • 694410826f1025e9888d1029a5cf2bc0
  • F16 invariants are NOT verified — C1 (F16 CFM matmul weights opt-in env var) is deferred (see §"Deferred follow-ups" below).
  • Multilingual model bit-exactness was NOT verified because the multilingual model files (chatterbox-s3gen.gguf for the MTL variant, MTL T3 GGUFs) were not available locally. The optimizations apply at the host-side / cache-management layer and are model-agnostic by construction; they should continue to be bit-exact for multilingual too.

What this PR does

Part 1 — ggml-vulkan patches (no chatterbox-source dependency)

Two opt-in patches applied via scripts/setup-ggml.sh, completely inert when configuring without -DGGML_VULKAN=ON:

  • patches/ggml-vulkan-pipeline-cache.patch (199 lines): persistent VkPipelineCache across processes, keyed by <vendorID>-<deviceID>-<driverVersion>. Resolved from $GGML_VK_PIPELINE_CACHE_DIR$XDG_CACHE_HOME/ggml/vulkan$HOME/.cache/ggml/vulkan. Disabled by setting the env var to the empty string (byte-identical to upstream).
  • patches/ggml-vulkan-eager-cache-save.patch (104 lines): write back the pipeline-cache blob after every compiles.wait() batch in ggml_vk_load_shaders (crash-safety against SIGKILL/abort losing freshly compiled pipelines). Tracks pipeline_cache_last_size so warm-cache hits skip the disk write.

Part 2 — Persistent CFM estimator graph cache (the headline)

multilingual_merged's cfm_estimator_cache was local-scope in synthesize() — every synth call paid the full graph rebuild cost (~5500-node CFM graph build + gallocr_reserve allocates the device-side buffer pool, ~10 ms wall on RTX 5090 with the 64 MB buf).

Refactored to follow the same explicit-destroy() global-lifetime pattern as the existing thread_local time_mlp_cache (which already documents the same Vulkan/Metal device-teardown ordering constraint):

// before (multilingual_merged) — local-scope, every synth pays the rebuild
cfm_estimator_cache cfm_cache;

// after (this PR) — global with explicit destroy() in s3gen_model_cache_release
cfm_estimator_cache & cfm_cache = g_cfm_estimator_cache;

Both cfm_estimator_forward (batch=1, Turbo) and cfm_estimator_forward_b2 (batch=2 CFG, multilingual) use the same cache object — the existing (cache.T != T) || (cache.b2 != current_b2) rebuild logic handles mode switches correctly.

Cache is destroyed in s3gen_model_cache_release BEFORE ggml_backend_free (Vulkan gallocr_free against a dangling vk_device would assert) and on s3gen_model_cache_get cache-miss (backend swap). Same constraint already documented for thread_local time_mlp_cache.

Part 3 — Time-embedding result memoisation

multilingual_merged has thread_local time_mlp_cache (graph cached) but no result cache. Two-layer cache transparently plugs in:

static std::unordered_map<uint32_t, std::vector<float>> g_time_mlp_results;  // both variants
static std::unordered_map<uint64_t, std::vector<float>> g_time_emb_results;  // Turbo only

static std::vector<float> compute_time_mlp_cached(const model_ctx & m, float t_val);
static std::vector<float> compute_time_emb_cached(const model_ctx & m, float t_val, float r_val);
  • Turbo (meanflow, t_span = [0, 0.5, 1]): compute_time_mlp(0.5) is called twice per inference (as r in step 0, as t in step 1). After warm-up: 6 graph submissions / inference → 0.
  • Multilingual (cosine-scheduled, default 10 steps): 10 distinct t-values, all repeat across every subsequent synth. After warm-up: 9-19 graph submissions / inference → 0.

Each compute_time_mlp graph has 3 dispatches (~18 µs GPU compute) but the wall-clock cost is ~700 µs due to fixed cmd-buffer + queue-submit + sync + tensor_get overhead — the per-graph fixed cost is 30× actual compute. Memoisation saves the full submit cost.

Caches cleared in g_cfm_estimator_cache_destroy alongside the graph cache (this also handles a future CHATTERBOX_F16_CFM opt-in mode flip — model reload → cache clear).

Float keys use bitcast → uint32_t so IEEE equality matches the literal const-folded values from t_span[i].

Part 4 — CPU mirror cache for large per-synth weight downloads

synthesize() reads three large model tensors via ggml_backend_tensor_get on every call:

Tensor Turbo size Multilingual size
flow/input_embedding 13.4 MB ~28 MB
flow/spk_embed_affine/w 60 KB 60 KB
flow/spk_embed_affine/b 320 B 320 B

On a GPU backend each is a real device→host transfer plus sync (~600-1000 µs / synth on RTX 5090 for input_embedding). These weights are CONSTANT for the model lifetime — cache them.

static std::unordered_map<const ggml_tensor *, std::vector<float>> g_weight_cpu_mirror;

static const float * cached_cpu_weights_f32(const ggml_tensor * t);

Three call-site swaps in synthesize(). Cleared in g_cfm_estimator_cache_destroy because the ggml_tensor * keys belong to the soon-to-be-freed model context.

Part 5 — Three HiFT cont sites removed

Round-AUDIT-style cleanup applied to the HiFT decoder:

Site Calls/inf Consumer Why safe
conv_transpose_1d_f32 exit cont 3 ggml_add(x, reshape_2d(bias)) Same strided-tolerant pattern as round-AUDIT's pre_lookahead exit.
ISTFT y_trim exit cont 1 ggml_clamp (element-wise) → output Clamp's output is fresh contiguous; tensor_get reads from contig.
f0_predictor xp permute cont 1 ggml_mul_mat src1 Vulkan/Metal/CUDA mul_mat shaders accept strided src1 for f32 matmul.

Perf-neutral on RTX 5090 (HiFT-section CONT contribution is 0.13 % of HiFT runtime per the perf logger). Code-quality + future-proofing wins, same character as the upstream's earlier cont-removal work.

Part 6 — G2 dump-script gap closure (test-infra)

regress-tensor-compare.sh was previously aborting at stage G2 with cannot open cfm_concat.npy. Four files added to scripts/dump-s3gen-reference.py:

File Stage Why missing
cfm_concat.npy G2 Concat happens inside ConditionalDecoder.forward.
cfm_h_conv.npy G2 Output of block1.block[0] (CausalConv1d).
cfm_h_ln.npy G2 Output of block1.block[3] (Transpose back to (B, C, T) after LayerNorm).
hift_s_stft.npy H3, H4 Output of hift._stft followed by cat([real, imag], dim=1).

Plus a one-line C++ fix in test_s3gen.cpp's stage_G2: add ggml_set_output(xc) so the gallocator preserves the diagnostic intermediate (was returning garbage because xc's slot got reused by downstream intermediates after the conv1d consumer completed).

Full pipeline now runs end-to-end through G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT (PyTorch FFT vs hand-built DFT, expected), max ≤ 4.7e-5 everywhere else, final waveform max_abs = 8.20e-08.


Negative result documented

Tried adding last_mu_ptr / last_spks_ptr / last_cond_ptr tracking to cfm_estimator_cache to skip the redundant ggml_backend_tensor_set for mu/spks/cond on the 2nd CFM step within one synthesize() call (those inputs are constant across cfm_steps).

F32 single-shot WAV diverged on the first test. Root cause: ggml's gallocator REUSES input-tensor buffer slots once their consumers complete. In CFM:

xc = ggml_concat(x_in, mu_in, spks_bc, cond_in);  // <-- last use of mu/spks/cond
// rest of the graph operates on `xc`; mu/spks/cond's slots are now free for
// the gallocator to reuse for downstream intermediates.

Skip-upload only works for inputs referenced THROUGHOUT the graph (like pos_emb in encoder, which feeds every conformer block). CFM mu/spks/cond are referenced only at the start, so they're recyclable immediately.

Reverted. General rule for ggml's gallocator: pointer-equality skip-upload is unsafe for any input that isn't referenced past the first few graph nodes. Detailed analysis in FINDINGS_ROUND_HIFT.md §2-bis.4.


Why a fresh squashed commit instead of rebase

The original optimization work was 8 commits on chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan (branched from upstream/main). git rebase upstream/multilingual_merged hit conflicts immediately at commit 2 (e3d5707, round-2 stage-graph caches) because multilingual_merged already added similar caches in different form (thread_local time_mlp_cache, cfm_estimator_cache with b2 flag, etc.). Subsequent commits would have stacked conflicts.

A clean linear port re-applies only the optimizations that still provide measurable value on top of the multilingual_merged base, with comments explaining the fit on the new code structure. Detailed analysis in FINDINGS_ROUND_MULTILINGUAL_PORT.md.

The original main-base branch is preserved at the backup-pre-multilingual-rebase tag for reference.


Deferred follow-ups (separate PRs)

Candidate RTX 5090 estimated win Why deferred
C1 — F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM env var) ~125 MB device memory + bandwidth-bound mobile win multilingual_merged's load_s3gen_gguf uses ggml_dup_tensor + ggml_backend_alloc_ctx_tensors (different from main); needs adapting our F16 conversion path. ~100 lines + new MD5 baselines (NVIDIA + AMD, F32 + F16).
Round-4 / 6 Q/K/V batched matmul fusion ~1.3 ms RTX 5090 + larger on bandwidth-starved targets multilingual_merged uses zero-cont strided Q/K/V views (their 849507a). Composing with our batched matmul approach is non-trivial. Pick one approach + bench on Vulkan.
HiFT decoder graph caching ~5-10 ms / chunk on multilingual variant multilingual_merged's run_hift_decode allocates ggml_gallocr_t + ggml_context * fresh on every call. Same persistent-cache pattern as round-HIFT could apply.
Multilingual model regression n/a — verification Multilingual model files were not available locally; only Turbo F32 invariants verified bit-exact on multilingual_merged base. Optimizations are model-agnostic by construction; explicit verification is a follow-up.
Mobile validation (Adreno / Mali / Apple) n/a — hardware-bound Biggest remaining evidence gap. AMD/RADV proxy refuted the mobile-bandwidth projection on rounds 2/3/5/6/C1 of the original work; real mobile runs would either confirm or force revision.

How was it tested

cd inputFilesForAI/qvac-17872-findings/chatterbox.cpp

# Apply patches (Metal + OpenCL + 2 Vulkan)
bash scripts/setup-ggml.sh

# Build
cmake -S . -B build-mtl -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON
cmake --build build-mtl -j$(nproc) --target tts-cli test-s3gen

# 1. RTX 5090 — F32 invariants
bash ../bench-logs-vk-c1/regress-c1.sh build-mtl 1
# Expected: round-1/2/3 F32 invariants PASS (3/3).
# F16 invariants will FAIL — C1 not in this PR.

# 2. AMD/RADV — F32 invariants
VK_LOADER_DRIVERS_SELECT='radeon_icd*' \
    bash ../bench-logs-vk-amd/regress-amd.sh build-mtl 1
# Expected: AMD F32 invariants PASS (3/3).

# 3. Aggregate perf — RTX 5090, 5 iters × 16 chunks = 80 chunks per build
bash ../bench-logs-vk-round3/regress-tight.sh build-mtl mtl-final 5
# Expected: S3GEN_INFER ~65 ms, cfm_total ~29 ms, n=75
# vs upstream baseline ~77 ms / ~40 ms.

# 4. Tensor-level Python ↔ C++ stage compare (test-infra unblocked by G2 fix)
bash ../bench-logs-vk-c1/regress-tensor-compare.sh
# Expected: end-to-end through G2/G3/G4/H1/H3/H4/H5; max rel err ≤ 4.7e-5
# (except STFT 7.9e-3 — PyTorch FFT vs DFT-via-conv1d).

# 5. Cold-start measurement (round-1 ggml-vulkan patch)
rm -rf ~/.cache/ggml/vulkan ~/.cache/nvidia
./build-mtl/chatterbox …  # first run: ~2.7 s cold
./build-mtl/chatterbox …  # second run: ~250 ms (ggml cache warm)

Ports the Vulkan-side optimizations from
chatterbox-Optimize-cpp-backend-multilingual-model-for-Vulkan
(branched from upstream/main) onto the multilingual_merged base.

Two ggml-vulkan patches + chatterbox-source host-side optimizations
that benefit BOTH the Turbo (meanflow) and the multilingual (standard
CFM with CFG) variants.  No GGUF format change.

Headline (RTX 5090, regress-tight aggregate, n=75 chunks, Turbo model
since multilingual model files weren't available locally):

  metric        | mtl_merged base |  + this PR  |          Δ
  S3GEN_INFER   |       76.6 ms   |   65.4 ms   |  -11.2 ms (-14.6 %)
  cfm_total     |       40.3 ms   |   28.7 ms   |  -11.6 ms (-28.8 %)
  encoder       |       19.9 ms   |   20.7 ms   |  noise
  hift_decode   |       10.9 ms   |   11.6 ms   |  noise

cfm_total ranges fully separated on n=120 samples
(base [38.3, 42.8] vs final [27.1, 30.1]).  All F32 invariants
(round-1 single-shot, round-2 multi-synth ident, round-3 multi-synth
varied) stay bit-exact; F16 invariant requires the C1 follow-up
(opt-in CHATTERBOX_F16_CFM env var, not in this commit).

ggml-vulkan patches (apply via setup-ggml.sh, inert without
GGML_VULKAN=ON):

  - patches/ggml-vulkan-pipeline-cache.patch     (+199 lines)
    Persistent VkPipelineCache across processes, keyed by
    <vendorID>-<deviceID>-<driverVersion>.  Recovers ~91 % of the
    cold-to-warm gap on the first warm run.  Disabled by setting
    GGML_VK_PIPELINE_CACHE_DIR="".  Entire bench: 2.69 s -> 0.25 s
    fresh-process wall on RTX 5090 with cache populated.

  - patches/ggml-vulkan-eager-cache-save.patch   (+104 lines)
    Write back the pipeline cache after every ggml_vk_load_shaders
    compile batch (crash-safety against SIGKILL/abort losing freshly
    compiled pipelines).  Stacks on the first patch.

Chatterbox-source host-side optimizations (src/chatterbox_tts.cpp,
+~250 lines):

  1. Persistent CFM estimator graph cache (~ -10 ms / chunk).
     cfm_estimator_cache was local-scope in s3gen_synthesize_to_wav()
     -- every synth call paid the full graph rebuild cost.  Refactored
     to follow the same explicit-destroy() global-lifetime pattern as
     the existing thread_local time_mlp_cache.  Both the batch=1
     (Turbo) and batch=2 (multilingual CFG) paths reuse the same
     cache; the cache.b2 flag triggers a rebuild when mode changes.

     Cache cleared in s3gen_model_cache_release BEFORE the backend is
     freed (Vulkan/Metal device-teardown ordering).  Also cleared on
     s3gen_model_cache_get cache-miss (backend swap).

  2. Time-embedding result memoisation (~ -4 ms / inf).
     Both Turbo (t_span = [0, 0.5, 1]) and multilingual (cosine-
     scheduled, default 10 steps) produce the same set of t-values
     across all subsequent synth calls.  Added two-layer cache:
       - g_time_mlp_results: keyed by uint32_t bitcast of t_val
       - g_time_emb_results: keyed by uint64_t = (kt << 32) | kr
                             (Turbo only; multilingual skips the mixer)
     compute_time_mlp_cached + compute_time_emb_cached wrappers.
     6 graph submissions / inference -> 0 after first inference for
     Turbo; 9-19 -> 0 for multilingual (10-step).

  3. CPU mirror cache for large per-synth weight downloads (~ -1 ms
     / inf).  flow/input_embedding (~13.4 MB Turbo / ~28 MB MTL) +
     flow/spk_embed_affine/{w,b} were re-downloaded GPU->CPU on every
     synth call.  New cached_cpu_weights_f32(t) helper +
     g_weight_cpu_mirror map (keyed by ggml_tensor *).

  4. Three HiFT cont sites removed (perf-neutral, code quality).
     conv_transpose_1d_f32 exit, ISTFT y_trim exit, f0_predictor xp
     permute -- all bit-exact-preserving (consumers accept strided
     sources: ggml_add for bias, ggml_clamp element-wise, ggml_mul_mat
     src1 for f32 matmul).

Test infrastructure (scripts/dump-s3gen-reference.py +65 lines,
src/test_s3gen.cpp +6 lines):

  - G2 dump-script gap closure: cfm_concat / cfm_h_conv / cfm_h_ln /
    hift_s_stft .npy files now produced.  Plus ggml_set_output(xc)
    in stage_G2 so the gallocator preserves the diagnostic
    intermediate (was returning garbage because xc's slot got reused
    by downstream intermediates after the conv1d consumer completed).
  - regress-tensor-compare.sh now runs end-to-end through
    G2/G3/G4/H1/H3/H4/H5: max relative error 7.92e-3 on STFT
    (PyTorch FFT vs hand-built DFT, expected), max <= 4.7e-5
    everywhere else; final waveform max_abs = 8.20e-08.

Negative result documented (inline comments + FINDINGS doc): tried
skip-upload of mu/spks/cond across cfm_steps within one synthesize
call.  Broke F32 single-shot.  Root cause: ggml's gallocator REUSES
input-tensor buffer slots once their consumers complete.  Skip-upload
only works for inputs referenced THROUGHOUT the graph (encoder
pos_emb pattern works, CFM mu/spks/cond pattern doesn't).

Deferred follow-ups (in OPTIMIZATION_PLAN_NEXT.md):

  - C1: F16 CFM matmul weights (opt-in CHATTERBOX_F16_CFM env var).
    Saves ~125 MB device memory + helps bandwidth-bound mobile.
    Needs new MD5 baselines.
  - Round-4/6 QKV fusion: multilingual_merged uses zero-cont strided
    Q/K/V views (Metal-tuned).  Our Vulkan fused mul_mat would need
    careful integration to compose with that approach.
  - HiFT decoder graph caching: multilingual_merged HiFT rebuilds
    every chunk (no g_hift_cache equivalent).  Same persistent-cache
    pattern as round-HIFT could apply.
  - Multilingual model file regression: multilingual model not
    available locally; only Turbo F32 invariants verified bit-exact
    on multilingual_merged base.

Files: src/chatterbox_tts.cpp                       +252 / -19
  src/test_s3gen.cpp                             +6
  scripts/dump-s3gen-reference.py                +65
  scripts/setup-ggml.sh                          +20 / -8
  patches/ggml-vulkan-pipeline-cache.patch     +199 (NEW)
  patches/ggml-vulkan-eager-cache-save.patch   +104 (NEW)
  patches/README.md                              +13 / -8
  CHANGELOG.md                                   +603 (NEW)
Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether pushed a commit to ogad-tether/chatterbox.cpp that referenced this pull request May 6, 2026
…etry/scratch)

Five targeted fixes surfaced by review of the multilingual_merged tip
after the origin/main merge.  Three are real bugs (CFG, top_k, engine
crash on MTL GGUFs); one is a perf regression with audible behaviour
on MTL (spurious T3 retries); one is a defensive cleanup.

1. src/chatterbox_tts.cpp (CFM step loop): the use_b2 branch correctly
   computes (1+cfg)*cond - cfg*uncond, but the else branch only
   computed the conditional pass and silently dropped CFG on every
   non-Metal backend.  Restores the §3.19 (3f0a8da) behaviour: when
   !meanflow && cfg_rate != 0 and use_b2 is false (CPU and any GPU
   backend where the b2 path was disabled), run cond + uncond
   back-to-back on the same B=1 graph (cfm_estimator_cache key
   (T, b2=false) reuses the cached graph across both calls) and
   combine via the standard CFG mix.  Smoke-tested on CPU
   (--n-gpu-layers 0): runs cleanly, S3Gen wall-clock doubles vs
   meanflow as expected (12 CFM steps × 2 forward calls).

2. src/t3_mtl.cpp::sample_next_token_mtl top-k filter: after
   nth_element(begin, begin+k, end, greater) the (k+1)-th largest
   sits at idx[k] and positions [0, k) hold the top-k UNORDERED.
   The previous code took cut = l[idx[k-1]] which is some
   arbitrary top-k element (often not the smallest), making cut
   too large and the `x < cut` filter then erased legitimate
   top-k logits.  Fix: partition to begin+(k-1) so idx[k-1] is
   the k-th largest exactly.  Mostly masked by the default
   top_k=1000 vs an 8194-vocab where the threshold falls into
   the noise floor; the bug bites at small top_k (e.g. greedy
   --top-k 1 where the wrong cut could pessimise tie handling).
   The Turbo sample_next_token_ex in src/main.cpp uses a
   different (correct) approach via tmp[k] + per-element rescan
   for ties; left untouched.

3. src/chatterbox_engine.cpp: load_model_gguf dispatches MTL
   GGUFs into load_model_gguf_mtl (populates layers_mtl, leaves
   layers empty), but synthesize() unconditionally calls
   eval_prompt -> build_prompt_graph -> build_transformer_core,
   which iterates model.layers[il] -- empty std::vector, UB or
   crash.  Add a clean rejection guard right after the load: if
   model.hparams.variant != CHBX_VARIANT_TURBO, free_model() and
   throw a clear error pointing the user at the CLI / internal
   eval_*_mtl helpers.  Wiring MTL through the public Engine API
   (extend EngineOptions with language / cfg_weight / min_p /
   exaggeration, branch synthesize() on variant) is left as a
   follow-up; this just stops the crash on the public surface.

4. src/chatterbox_cli.cpp::run_t3_for_segment retry trigger: the
   0cad44d merge commit said the 5x speech-tokens-per-BPE-token
   floor (calibrated for English Turbo / GPT-2 BPE) should be
   gated to non-MTL because MTL's Llama tokenizer has a ~1.7x
   ratio.  The gating wasn't actually in the code -- a clean
   stop-token termination on a short MTL segment looked
   "implausible" and triggered up to 3 spurious retries.
   `plausible = is_mtl || (int)generated.size() >= min_tokens;`
   restores the intent.  The 3x-repeated-token early-stop above
   still guards MTL's catastrophic case.  Measured on M4 Metal
   with the ES reference prompt + jfk/gianni voice: T3 wall
   time drops from ~3.9 s (4 attempts) to ~0.93 s (1 attempt) --
   ~4x speedup just from removing the wasted retries.  WAV md5
   stays byte-exact at 57cc80f27a122f03435fd05f47d1b3d2.

5. src/t3_mtl.cpp stacked-QKV loader scratch sizing: the early
   type-equality guard implies wq/wk/wv have identical sizes
   today, but max over all three so a future shape divergence
   (e.g. an MTL variant with non-square Q/K/V) can't silently
   truncate a per-layer copy via undersized scratch.  No
   behaviour change today; defensive only.

Validation (Apple M4, Metal, Release):
  - cmake --build: clean, no warnings, all targets link.
  - test-metal-ops: 14/14 PASS, 0 FAIL.
  - End-to-end synthesis (ES prompt, gianni.wav, --seed 42, greedy):
    md5 57cc80f27a122f03435fd05f47d1b3d2 -- byte-exact vs the
    pre-fix baseline.  T3 wall time ~3.9s -> ~0.9s (fix GustavoA1604#4).
  - CPU CFG smoke test (--n-gpu-layers 0, --text "Hola.", es):
    completes cleanly, S3Gen ~12s for 12 CFM steps × 2 forward
    calls (cond + uncond), produces valid 1.1s WAV.

Issues GustavoA1604#5 (redundant peek+open in load_model_gguf), GustavoA1604#7
(/g deny-list breadth in requantize-gguf.py), GustavoA1604#8 (forward-hook
idiom in dump-t3-mtl-reference.py), and #9 (CMake duplicate
cli_main.cpp build) are tracked but intentionally not folded in
here -- the reviewer flagged them as cosmetic / trivial / fine.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant