Skip to content

Zbig9000/chatterbox.cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

106 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chatterbox.cpp

Chatterbox Turbo (Resemble AI, MIT-licensed zero-shot text-to-speech) ported to ggml. Pure C++/ggml inference on CPU / Metal / CUDA / Vulkan, with no runtime dependency on Python or PyTorch.

End-to-end inference on a short sentence with voice cloning from an 11 s reference wav (T3 + S3Gen + HiFT, warm runs, excludes model load):

Backend Wall RTF vs real-time vs ONNX Runtime
Vulkan (RTX 5090, Q4_0) 463 ms 0.07 14.2× 13.8× faster
Metal (Mac Studio M3 Ultra, Q4_0) 985 ms 0.16 6.4× 17.5× faster
CPU (AMD Ryzen 9 9950X, AVX, Q4_0) 5 397 ms 0.82 1.2× 1.2× faster
CPU (Mac Studio M3 Ultra, NEON) 7 568 ms 1.05 0.96× 2.3× faster
Reference (ONNX Runtime, CPU Q4) 6.4–17 s 1.2–3.2 0.3–0.85×

See the full benchmark section below for the per-stage breakdown, or PROGRESS.md for the full chronological development journal — every numerical-parity stage and optimization pass (T3 Flash Attention, KV-cache layout rework, Metal kernel patches, CAMPPlus + VoiceEncoder + S3TokenizerV2 ported to ggml graphs, mel extraction via STFT matmul, legacy Q4/Q5/Q8 quantization, etc.).


Pipeline at a glance

      text                                                 24 kHz wav
       │                                                        ▲
       ▼                                                        │
  ┌────────────────────────────────────────────────────────────────┐
  │                       chatterbox                               │
  │                                                                │
  │   T3 (GPT-2 Medium)  ──►  S3Gen encoder  ──►  CFM (meanflow)   │
  │   text → speech toks      speech toks → h      h → mel         │
  │                                                                │
  │                          HiFT vocoder  ──►  24 kHz wav         │
  └────────────────────────────────────────────────────────────────┘
       ▲                                              ▲
   BPE tokenizer                               reference voice
   (embedded in T3 GGUF metadata)              (embedded in S3Gen GGUF)

One binary, one invocation, end to end — scripts/synthesize.sh is a thin convenience wrapper that fills in the two GGUF paths.

Prerequisites

  • C++17 compiler (clang or gcc)
  • cmake ≥ 3.14
  • Python 3.10+ with torch, numpy, gguf, safetensors, scipy, librosa, resampy — needed once, at setup time only, to run the weight converters (which bake the precomputed mel filterbanks into the GGUFs) and the optional reference-dump scripts. Once the GGUFs exist, the C++ binary has zero runtime dependency on Python.

The easiest way to get the Python side is:

git clone https://github.com/resemble-ai/chatterbox.git chatterbox-ref
cd chatterbox-ref
python -m venv .venv && . .venv/bin/activate
pip install -e .
pip install gguf safetensors scipy librosa resampy
cd -

1. Clone and build

# (from wherever you want the repo to live)
git clone git@github.com:gianni-cor/chatterbox.cpp.git
cd chatterbox.cpp

# Clone ggml at the pinned commit and apply our Metal op patches.
# Skip the patch part (i.e. just `git clone ... ggml`) if you're not
# building with -DGGML_METAL=ON.
./scripts/setup-ggml.sh

# Configure + build every target in one shot.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

scripts/setup-ggml.sh is idempotent: it clones upstream ggml into ./ggml, checks out the commit our patch is pinned against (GGML_COMMIT at the top of the script), and applies patches/ggml-metal-chatterbox-ops.patch. Re-running it is a no-op; bump the pinned commit inside the script whenever the patch is re-generated against a newer upstream.

To enable GPU acceleration, add the matching backend flag at configure time: -DGGML_METAL=ON on Apple Silicon, -DGGML_VULKAN=ON on Linux/Windows with a Vulkan loader, or -DGGML_CUDA=ON if you have the CUDA toolkit. Pass --n-gpu-layers 99 at runtime to actually use the GPU. See patches/README.md for what the Metal patch does and why.

This produces the main binary plus a set of per-stage validation harnesses:

Binary What it does
build/chatterbox End-to-end: text → speech tokens (T3) → wav (S3Gen + HiFT). Also handles voice cloning via --reference-audio.
build/mel2wav HiFT only: mel.npy → wav (demo)
build/test-s3gen Staged numerical validation of S3Gen encoder + CFM vs Python dumps
build/test-resample Round-trip SNR of the C++ Kaiser-windowed sinc resampler
build/test-voice-features 24 kHz 80-ch mel parity (prompt_feat)
build/test-fbank 16 kHz 80-ch Kaldi fbank parity
build/test-voice-encoder VoiceEncoder 256-d speaker embedding parity
build/test-campplus CAMPPlus 192-d embedding parity
build/test-voice-embedding wav → fbank → CAMPPlus end-to-end parity
build/test-s3tokenizer S3TokenizerV2 log-mel + speech-token parity
build/test-metal-ops Metal-only: parity check for diag_mask_inf, pad_ext, and fast conv_transpose_1d (only useful when built with -DGGML_METAL=ON)

You'll normally only need build/chatterbox; the test-* binaries are there for the staged-verification methodology in PROGRESS.md.

2. One-time: convert weights

# Activate the Python environment from the Prerequisites step
. ../chatterbox-ref/.venv/bin/activate

# Convert T3 weights + tokenizer + voice conditionals
python scripts/convert-t3-turbo-to-gguf.py --out models/chatterbox-t3-turbo.gguf

# Convert S3Gen encoder + CFM + HiFT weights
# (the built-in reference voice is embedded inside this GGUF)
python scripts/convert-s3gen-to-gguf.py --out models/chatterbox-s3gen.gguf

The scripts pull ResembleAI/chatterbox-turbo from Hugging Face Hub on first run (about 1.5 GB). The BPE tokenizer (vocab.json + merges.txt + added_tokens.json) is embedded directly into the T3 GGUF as tokenizer.ggml.* metadata, so you don't need to keep those three files around on disk.

You should now have:

models/
  chatterbox-t3-turbo.gguf   (~742 MB) — T3 GPT-2 Medium + embedded GPT-2 BPE
                               tokenizer + VoiceEncoder weights + built-in voice
  chatterbox-s3gen.gguf      (~1.0 GB) — S3Gen encoder/CFM + HiFT vocoder
                               + CAMPPlus speaker encoder + S3TokenizerV2
                               (everything needed for voice cloning, on top of
                               the built-in reference voice)

For numerical validation against PyTorch (optional, step 4), also run:

python scripts/dump-s3gen-reference.py \
  --text "Hello from ggml." --out artifacts/s3gen-ref \
  --seed 42 --n-predict 64 --device cpu

Optional: quantize the models (smaller + faster)

Both GGUFs can be quantized to Q8_0 (near-lossless) or Q4_0 (different CFM sample but same subjective quality, smaller). llama-quantize doesn't recognize the chatterbox / chatterbox-s3gen custom architectures, so we ship a small standalone rewriter that works on either model:

# T3
python scripts/requantize-gguf.py \
  models/chatterbox-t3-turbo.gguf \
  models/t3-q8_0.gguf q8_0

# S3Gen
python scripts/requantize-gguf.py \
  models/chatterbox-s3gen.gguf \
  models/chatterbox-s3gen-q8_0.gguf q8_0

Swap q8_0q4_0 (or q5_0) for a more aggressive variant. T3's original converter also accepts --quant if you prefer to quantize at conversion time instead of after.

Measured on the QVAC paragraph (M3 Ultra, Metal, streaming mode --stream-chunk-tokens 25 --max-sentence-chars 100):

T3 / S3Gen total size first-audio total wall cos sim¹
F16 / F32 (baseline) 1 757 MB 1 604 ms 28.6 s 1.000
Q8_0 / F32 1 476 MB 1 451 ms 27.2 s
F16 / Q8_0 1 532 MB 1 646 ms 28.2 s 0.991
Q8_0 / Q8_0 1 251 MB 1 399 ms 26.4 s 0.991
Q4_0 / Q4_0 1 071 MB 1 510 ms 26.7 s 0.66²

¹ Cosine similarity of the final waveform vs the F16/F32 baseline.

² Q4_0 quantization shifts the CFM diffusion ODE's trajectory enough to land on a different sample from the same noise seed. Subjective quality is essentially the same (in-distribution speech, correct phonemes, stable voice); it's just a different legitimate sample rather than a lower-fidelity version of the baseline.

Using both Q8_0 variants cuts ~500 MB off disk, drops first-audio latency ~13 %, speeds total wall-clock ~8 %, and produces audibly-identical output (cos-sim > 0.99 vs F32 reference waveform). Q4_0 trims another ~180 MB on top for roughly the same speed — best choice on memory-constrained targets (mobile, low-end CPUs) when you don't need per-seed reproducibility against the F32 baseline.

Note: the S3Gen requantize script only compresses the 385 big 2-D matmul weights (encoder attention/MLPs + CFM projections + flow FFs). The 1 664 other tensors — biases, norms, spectral filterbanks, the input-embedding table, the 3-D convolution weights — remain at their source dtype to keep numerics clean. That's why Q4_0 ends up only ~15 % smaller than Q8_0 rather than 2× smaller; the bulk not covered by block quantization dominates.

Pass the quantized GGUFs to chatterbox exactly like the defaults:

./build/chatterbox \
  --model      models/t3-q8_0.gguf \
  --s3gen-gguf models/chatterbox-s3gen-q8_0.gguf \
  --text "Hello from the quantized port." \
  --n-gpu-layers 99 --out out.wav

3. Run — end-to-end text → wav

The easiest way:

./scripts/synthesize.sh "Hello from native C plus plus." /tmp/out.wav

That's equivalent to running the binary directly:

./build/chatterbox \
  --model       models/chatterbox-t3-turbo.gguf \
  --s3gen-gguf  models/chatterbox-s3gen.gguf \
  --text        "Hello from native C plus plus." \
  --out         /tmp/out.wav

Everything is self-contained in the two .gguf files:

  • chatterbox-t3-turbo.gguf embeds the BPE tokenizer (vocab + merges + added tokens) as standard tokenizer.ggml.* metadata, which the C++ binary loads out of GGUF at startup.
  • chatterbox-s3gen.gguf embeds the built-in reference voice (embedding, prompt token, prompt mel) under s3gen/builtin/*.

Advanced modes:

  • T3 only — drop --s3gen-gguf + --out; write tokens with --output tokens.txt. Useful for piping into other tools.

  • S3Gen + HiFT only — pass --s3gen-gguf + --tokens-file FILE with already-generated speech tokens and no --model.

  • Custom voice (voice cloning) — point --reference-audio at a reference .wav and the C++ binary does everything else natively (no Python, no preprocessing step):

    ./build/chatterbox --model models/chatterbox-t3-turbo.gguf \
                       --s3gen-gguf models/chatterbox-s3gen.gguf \
                       --reference-audio me.wav \
                       --text "Hello in my voice." \
                       --out out.wav

    Requirements for the reference wav:

    • Strictly more than 5 s of clean mono speech (the binary enforces this and fails fast; 10–15 s gives the best similarity).
    • Any sample rate, any PCM bit-depth (binary resamples + downmixes).

    Prep helperscripts/extract-voice.py automates the usual chore of picking a good clip out of a messy recording (podcast, WhatsApp voice note, .mov screen capture, etc.):

    # auto-detect codec, pick the best 10 s speech block, write voices/alice.wav:
    ./scripts/extract-voice.py ~/Downloads/alice.m4a --name alice
    # same, but also bake the .npy profile in one go:
    ./scripts/extract-voice.py ~/Downloads/alice.m4a --name alice --bake

    It probes the file, runs silencedetect to find speech regions, picks the longest clean 5–15 s block from the middle of the recording (or concatenates the two best short blocks if no single long block exists), then applies a codec-aware filter chain:

    source codec chain applied
    WAV / FLAC / ≥ 96 kbps AAC / ≥ 128 kbps MP3 highpass + alimiter — minimal, trusts the source
    Opus / Vorbis at any bitrate, low-bitrate AAC/MP3 highpass + afftdn + 3-band EQ + loudnorm + alimiter — restores presence/air past the codec's brick-wall low-pass

    The lossy chain is what takes an 18 kbps Opus voice note from "clone sounds wrong" to "clone sounds like the speaker". See ./scripts/extract-voice.py --help for the full flag set.

    Loudness is normalised to -27 LUFS (ITU-R BS.1770-4 / EBU R 128) internally before preprocessing, so a quiet recording like a phone memo works as well as a studio track. All five voice-conditioning tensors are produced in C++:

    tensor source
    speaker_emb C++ VoiceEncoder (T3 GGUF)
    cond_prompt_speech_tokens C++ S3TokenizerV2 (S3Gen GGUF)
    prompt_token C++ S3TokenizerV2 (S3Gen GGUF)
    embedding C++ CAMPPlus (S3Gen GGUF)
    prompt_feat C++ mel extraction
  • Cache a voice for fast reuse (--save-voice) — voice preprocessing (VoiceEncoder + CAMPPlus + S3TokenizerV2 + mel) adds ≈ 2 minutes on a Mac before every synthesis. The five tensors don't depend on the text, so bake them once:

    # Bake the profile (no --text needed; just preprocesses + saves).
    ./build/chatterbox --model models/chatterbox-t3-turbo.gguf \
                       --s3gen-gguf models/chatterbox-s3gen.gguf \
                       --reference-audio me.wav \
                       --save-voice voices/me/
    # Writes voices/me/{speaker_emb, cond_prompt_speech_tokens,
    # embedding, prompt_token, prompt_feat}.npy (~160 KB total).
    
    # Reuse (≈ 17× faster; VoiceEncoder / CAMPPlus / S3TokenizerV2
    # / mel extraction are all skipped).
    ./build/chatterbox --model models/chatterbox-t3-turbo.gguf \
                       --s3gen-gguf models/chatterbox-s3gen.gguf \
                       --ref-dir voices/me/ \
                       --text "Anything you want." \
                       --out  out.wav

    You can mix the two: --ref-dir D --reference-audio X.wav will load any .npy present in D and compute the rest from X.wav. Useful during development when you want to iterate on one tensor.

Play the result:

afplay /tmp/out.wav         # macOS
aplay  /tmp/out.wav         # Linux (alsa)
ffplay /tmp/out.wav         # any OS with ffmpeg

Live / streaming input

When you want a long-running process that keeps the model loaded and synthesises whatever text arrives as it arrives — e.g. the output of a streaming LLM, a live transcription, or just a human typing — use --input-file. The binary tail -f's the file, splits on sentence terminators (or \n in --input-by-line mode), and pipes raw PCM (s16le, 24 kHz, mono) to stdout chunk-by-chunk.

# Two-process demo: background writer appends sentences, chatterbox
# tail-follows, sox plays in real time.
./build/chatterbox \
    --model       models/t3-q8_0.gguf \
    --s3gen-gguf  models/chatterbox-s3gen-q8_0.gguf \
    --ref-dir     voices/alice \
    --input-file  ./speech.txt \
    --input-by-line \
    --stream-chunk-tokens 25 --stream-cfm-steps 2 \
    --n-gpu-layers 99 \
    --out -                     \
  | play -q -t raw -r 24000 -b 16 -e signed -c 1 -   # sox(1)

# Another process (LLM, transcriber, shell, etc.) writes here:
echo "First request." >> speech.txt
echo "Second request with, internal, punctuation." >> speech.txt

Interactive mode on a TTY — pass --input-file - to read from stdin. On a terminal you get a > prompt; each Enter-terminated line is spoken immediately, Ctrl-D exits:

./build/chatterbox \
    --model       models/t3-q8_0.gguf \
    --s3gen-gguf  models/chatterbox-s3gen-q8_0.gguf \
    --ref-dir     voices/alice \
    --input-file  - --input-by-line \
    --stream-chunk-tokens 25 --stream-cfm-steps 2 \
    --n-gpu-layers 99 \
    --out -                     \
  | play -q -t raw -r 24000 -b 16 -e signed -c 1 -

Relevant flags:

flag effect
--input-file PATH Tail-follow PATH; - means read stdin (interactive on a TTY).
--input-by-line One Enter-terminated line = one request. . ! ? inside a line stay part of the same utterance (no mid-line restart).
--input-eof-marker STR Exit cleanly after seeing STR anywhere in the input (useful for scripted pipelines).
--stream-chunk-tokens N Speech-token chunk granularity for the S3Gen streaming loop. 25 is a good default.
--stream-cfm-steps N CFM Euler steps per chunk. 2 is the minimum the model was designed for; 4–5 gives crisper word endings on cloned voices.
--stream-first-chunk-tokens N Override the first chunk's size to minimise first-audio-out latency.

The process keeps the T3 + S3Gen models warm across requests, so after the initial load (~150 ms), each request only pays T3 + S3Gen inference cost (well under real-time on any GPU backend).

Useful flags

  • --seed N — change the RNG seed for the CFM initial noise and the SineGen excitation (same text, different voice "take").
  • --threads N — override the default std::thread::hardware_concurrency(). The sweet spot on a 10-core CPU is 10.
  • --n-gpu-layers N — move layers to the GPU backend when built with -DGGML_METAL=ON / -DGGML_CUDA=ON / -DGGML_VULKAN=ON. Pass 99 (or any large number) to move everything.
  • --reference-audio PATH — voice cloning input (see the Custom voice section above).
  • --save-voice DIR — cache the five voice-conditioning tensors for reuse via --ref-dir DIR.
  • --ref-dir DIR — load previously-baked voice tensors (or a subset) from DIR/*.npy.
  • --input-file PATH — long-running mode; tail-follow PATH and synthesise text as it arrives. Pass - to read from stdin (see the Live / streaming input section above).
  • --input-by-line — treat one newline as one complete request; . ! ? inside a line stay part of the same utterance.
  • --debug (requires --ref-dir) — substitute Python-dumped reference values for the random bits so every stage can be bit-exactly compared to PyTorch.

Performance

Reproducible perf check vs an ONNX Runtime Q4 baseline (same architecture) on the same machine. Shared setup:

  • Text: "Hello from native C plus plus. This audio was generated end to end on CPU using ggml."
  • Reference voice: test/reference-audio/jfk.wav (11 s mono 16 kHz)
  • Seed: 42, warm 3-run average, inference only (excludes model load)

Mac Studio M3 Ultra (96 GB unified memory)

Implementation Backend T3 gen S3Gen+HiFT gen Total inference RTF vs real-time
chatterbox.cpp Q4_0 Metal 573 ms / 155 tok 412 ms 985 ms 0.16 6.4×
chatterbox.cpp Q4_0 CPU (NEON+Accel) 2 045 ms / 178 tok 5 523 ms 7 568 ms 1.05 0.96×
ONNX Runtime Q4 baseline CPU 17 190 ms 3.18 0.31×

chatterbox.cpp (Metal) is 17.5× faster than ONNX Runtime on the same machine; the CPU-only build is still 2.3× faster.

Linux RTX 5090 + AMD Ryzen 9 9950X

Implementation Backend T3 gen S3Gen+HiFT gen Total inference RTF vs real-time
chatterbox.cpp Q4_0 Vulkan 241 ms / 161 tok 222 ms 463 ms 0.07 14.2×
chatterbox.cpp Q4_0 CPU (AVX) 2 161 ms / 161 tok 3 236 ms 5 397 ms 0.82 1.2×
ONNX Runtime Q4 baseline CPU 6 373 ms 1.18 0.85×

chatterbox.cpp (Vulkan) is 13.8× faster than ONNX Runtime on the same machine. Note that the ONNX Runtime baseline here only uses the CPU execution provider; a CUDA build would narrow the gap, but is not included in this comparison.

Per-stage S3Gen + HiFT breakdown (GPU builds)

Stage M3 Ultra Metal RTX 5090 Vulkan
T3 per token 3.70 ms / tok 1.50 ms/tok
encoder 38 ms 35 ms
cfm_step0 69 ms 84 ms
cfm_step1 49 ms 13 ms
cfm_total 124 ms 100 ms
f0_predictor 3.1 ms 1.1 ms
sinegen (CPU) 15 ms 16 ms
stft 3.1 ms 1.0 ms
hift_decode 225 ms 66 ms
hift_total 246 ms 84 ms

HiFT conv_transpose_1d upsampling is the single biggest stage on Metal today; the 5090 chews through it 3.4× faster, which is where the remaining end-to-end gap comes from.

Reproducing these numbers

# Build chatterbox.cpp, then:
./build/chatterbox \
    --model       models/chatterbox-t3-turbo.gguf \
    --s3gen-gguf  models/chatterbox-s3gen.gguf \
    --reference-audio test/reference-audio/jfk.wav \
    --text "Hello from native C plus plus. This audio was generated end to end on CPU using ggml." \
    --out /tmp/bench.wav \
    --seed 42 \
    --n-gpu-layers 99   # 0 or omit for CPU

The binary prints both the per-stage timings and BENCH: lines that scripts can scrape. Note: the binary also prints an inner === pipeline: … RTF=… === line — that RTF covers only the S3Gen + HiFT phase (the timer around s3gen_synthesize_to_wav, which runs after T3 is already done). The tables above report the full end-to-end number (T3_INFER + S3GEN_INFER).

gen_RTF = (T3_INFER_MS + S3GEN_INFER_MS) / AUDIO_MS

Token counts vary slightly across backends because the CPU-side sampler reads logits that come out of different float-reduction orders per backend; per-token T3 cost is the directly-comparable figure. Full development history and older backend combinations (F16 vs Q4_0 / Q5_0 / Q8_0, plus other machines) are in PROGRESS.md §3.10 / §3.13.

Streaming mode — low-latency playback

For interactive use cases, the binary can emit audio chunk-by-chunk as it's generated instead of waiting for the whole sentence to finish. Any non-zero --stream-chunk-tokens N turns streaming on.

Flags:

  • --stream-chunk-tokens N — main knob; N speech tokens per chunk (25 ≈ 1 s of audio, 50 ≈ 2 s).
  • --stream-first-chunk-tokens N — override the first chunk's size so first-audio-out lands early while later chunks stay big and keep overall RTF low. Typical: 10.
  • --stream-cfm-steps N — CFM Euler step count. Default 2 (matches Python meanflow). 1 halves CFM cost with a small quality penalty; Turbo's meanflow training makes 1-step a valid sampling mode per the paper.
  • --out - — emit raw s16le mono @ 24 kHz to stdout instead of writing a wav file, so the output can be piped straight into a player.

Recommended low-latency preset for interactive use:

brew install sox      # one-time, for the `play` command

./build/chatterbox \
    --model      models/chatterbox-t3-turbo.gguf \
    --s3gen-gguf models/chatterbox-s3gen.gguf \
    --text       "Hello from streaming Chatterbox." \
    --stream-first-chunk-tokens 10 \
    --stream-chunk-tokens       25 \
    --stream-cfm-steps          1 \
    --n-gpu-layers              99 \
    --out - \
  | play -q -t raw -r 24000 -b 16 -e signed -c 1 -

play ships with sox and routes straight to CoreAudio. If you prefer, the same stdout stream works with ffplay -f s16le -ar 24000 -ch_layout mono -nodisp -i - or piped through a Python sounddevice.play() one-liner; on some macOS 26 builds ffplay's SDL output is silent for raw piped audio, so sox play is the safest default.

You can also drop the --out - to get a regular wav:

./build/chatterbox … --stream-chunk-tokens 50 --out out.wav
afplay out.wav

Latency and throughput on an Apple M4 with the Metal backend and the preset above, feeding the sentence "Hello from streaming Chatterbox, I am John and I work in Google since 2010. I love to go out with my friends, eat some pizza and also drink some wine. I also love to travel around the world alone." (produces 317 speech tokens, ~12.7 s of audio):

metric value
first-audio-out latency 279 ms
chunk 1 (10-token bootstrap) RTF 0.99
chunks 2–13 (steady-state, 25 tokens each) RTF 0.30 – 0.63
chunk 14 (tail finalise) RTF 1.42
total wall time 11.5 s for 12.7 s of audio
overall RTF 0.90

The steady-state RTFs stay comfortably below 1.0, so the streamer sustainably pushes audio faster than real-time playback consumes it. Chunk 1 is small by design so first audio lands in ~280 ms; the final chunk is short and relatively slow (fixed encoder/CFM overhead amortised over only 0.4 s of audio).

For the full journal of how streaming got there — bit-exact CFM parity, cache_source + trim_fade port, --out - stdout wiring, per-chunk tuning — see PROGRESS.md §B1.

4. Optional: validate against PyTorch

Every stage of the pipeline has a numerical regression test against Python-dumped reference tensors:

./build/test-s3gen models/chatterbox-s3gen.gguf artifacts/s3gen-ref ALL

Expected output (rel error per stage):

Stage A  speaker_emb_affine    rel ≈ 1e-7
Stage B  input_embedded        rel = 0
Stage C  encoder_embed         rel ≈ 4e-7
Stage D  pre_lookahead         rel ≈ 3e-7
Stage E  enc_block0_out        rel ≈ 1e-7
Stage F  encoder_proj (mu)     rel ≈ 5e-7
Stage G1 time_mixer            rel ≈ 7e-7
Stage G2 cfm_resnet_out        rel ≈ 3e-7
Stage G3 tfm_out               rel ≈ 2e-7
Stage G4 cfm_step0_dxdt        rel ≈ 1e-6
Stage H1 f0                    rel ≈ 4e-6
Stage H3 conv_post             rel ≈ 6e-7
Stage H4 stft                  rel ≈ 8e-3 (boundary-bound)
Stage H5 waveform              rel ≈ 1e-4

For T3 bit-exact validation against the Python reference:

python scripts/reference-t3-turbo.py \
  --text "Hello from ggml." \
  --out-dir artifacts \
  --cpp-bin ./build/chatterbox \
  --cpp-model models/chatterbox-t3-turbo.gguf

Repository layout

chatterbox.cpp/
  ggml/                          pristine ggml clone (not tracked)
  src/
    main.cpp                     CLI + T3 runtime            (chatterbox)
    chatterbox_tts.cpp           S3Gen + HiFT pipeline       (linked into chatterbox)
    s3gen_pipeline.h             public API for the S3Gen+HiFT back half
    mel2wav.cpp                  HiFT-only demo              (mel2wav)
    gpt2_bpe.{h,cpp}             self-contained GPT-2 BPE tokenizer

    voice_features.{h,cpp}       WAV I/O, sinc resampler, LUFS meter,
                                   24 kHz & 16 kHz log-mel extraction,
                                   Kaldi-style 80-ch fbank
    voice_encoder.{h,cpp}        3-layer LSTM → 256-d speaker_emb
                                   (matches Resemble VoiceEncoder)
    campplus.{h,cpp}              FunASR x-vector port (FCM + 3× CAMDense
                                   TDNN) → 192-d embedding
    s3tokenizer.{h,cpp}          6-layer FSMN-attn transformer + FSQ →
                                   25-Hz speech tokens
    dr_wav.h                     vendored single-header WAV reader
    npy.h                        minimal .npy load / save + compare

    test_*.cpp                   per-stage numerical-parity harnesses
  scripts/
    synthesize.sh                text → wav wrapper
    convert-t3-turbo-to-gguf.py  T3 weights + tokenizer + VE + builtin
                                   voice → T3 GGUF
    convert-s3gen-to-gguf.py     S3Gen encoder + CFM + HiFT + CAMPPlus +
                                   S3TokenizerV2 + mel filterbanks →
                                   S3Gen GGUF
    dump-*-reference.py          PyTorch → .npy intermediates for the
                                   per-stage harnesses
    reference-t3-turbo.py        PyTorch T3 bit-exact compare vs C++
    compare-tokenizer.py         10-case BPE tokenizer compare vs HF
  patches/
    ggml-metal-chatterbox-ops.patch  ggml-metal fixes (see patches/README.md)
    README.md                    applies-to / what-it-does notes
  voices/                        baked voice profiles (not tracked; populated
                                   by --save-voice)
  models/                        generated GGUFs (not tracked)
  artifacts/                     .npy dumps for validation (not tracked)
  CMakeLists.txt                 top-level build
  README.md                      this file
  PROGRESS.md                    chronological development journal

Troubleshooting

error: this GGUF has no embedded tokenizer — you're running against a legacy T3 GGUF built before the tokenizer was embedded. Re-run the converter to produce a fresh GGUF:

python scripts/convert-t3-turbo-to-gguf.py --out models/chatterbox-t3-turbo.gguf

--debug requires --ref-dir — debug mode substitutes Python-dumped random bits to make every intermediate tensor bit-exactly comparable. Run python scripts/dump-s3gen-reference.py --out artifacts/s3gen-ref … first, then pass --ref-dir artifacts/s3gen-ref.

Output is much louder than the Python reference — expected: the Python reference dump uses a very short utterance (mostly silence). Generate a longer sentence and compare RMS. Differences up to ~2.5 % in spectrogram magnitude are from the stochastic SineGen excitation (non-bit-exact RNG between std::mt19937 and torch.rand).

Slower than real-time — make sure you built -DCMAKE_BUILD_TYPE=Release and that --threads picks up all your cores. The binary defaults to std::thread::hardware_concurrency().

License

Released under the MIT License — Copyright (c) 2026 Gianfranco Cordella. The bundled ggml/ is also MIT-licensed (ggml/LICENSE). The upstream Python implementation (Chatterbox, Copyright (c) 2025 Resemble AI) is likewise MIT-licensed; see LICENSE for the third-party attribution block.

About

Chatterbox Turbo text-to-speech ported to ggml

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 55.3%
  • C 33.0%
  • Python 11.0%
  • Other 0.7%