tts: chunked streaming acoustic decode to bound decode VRAM by VelvetBeans · Pull Request #5 · localai-org/vibevoice.cpp

VelvetBeans · 2026-05-30T05:44:36Z

🤖 Implemented with AI assistance (Claude Opus 4.8); authored, reviewed, and tested by the submitter.

Problem

The TTS acoustic decode (decode_latent_sequence → decoder_forward) upsamples the entire latent sequence to 24 kHz audio in a single ggml graph. The decoder's total upsample factor is the product of the ratios (8·5·5·4·2·2 = 3200), so the last stages materialize activation tensors at full audio resolution for all frames at once. Peak VRAM therefore grows roughly linearly with clip length — measured at ~161 MiB per latent frame on this build — and a 12 GB GPU OOMs in the decode once a clip passes ~20 frames (~2.7 s of audio), even though the LM and encoder fit comfortably. In practice the Q8_0 1.5B model and any clip longer than a couple of seconds could not be decoded on CUDA at all. There was no knob to chunk it.

Fix

Decode the latent sequence in fixed-size frame chunks through a streaming decoder that keeps a small per-conv left-context cache, mirroring the existing long-form ASR encoder streaming path (encoder_forward_streaming + StreamingCache). Each chunk pushes only C frames through the whole decoder, so peak activation memory is bounded to one chunk regardless of total length, while the per-conv caches carry just kernel-1 (regular convs) or ceil((K-1)/stride) (transposed upsamplers) frames of context — making the concatenated output bit-exact with a single-shot decode.

The one genuinely new primitive is a streaming causal transposed convolution (sconv_transpose1d_causal_streaming): the ASR encoder only ever downsamples (regular strided convs), whereas the decoder upsamples, so the existing sconv1d_causal_streaming couldn't be reused for the upsamplers. Everything else reuses the established streaming building blocks (sconv1d_causal_streaming, block1d_forward_streaming).

Sequences <= chunk still take the original single-shot path (renamed decode_latent_single_shot), so short clips are byte-for-byte unchanged. Default chunk size is 15 frames on CUDA (safely under ggml-cuda's IM2COL gridDim.y 65535 cap) and 64 on CPU, overridable with VIBEVOICE_DECODE_CHUNK_FRAMES.

Files

src/conv1d.{hpp,cpp} — sconv_transpose1d_causal_streaming (the new, scrutiny-worthy piece)
src/acoustic_tokenizer.{hpp,cpp} — decoder_forward_streaming (mirrors encoder_forward_streaming)
src/vibevoice_tts.cpp — chunk driver (decode_latent_sequence dispatcher + run_decoder_chunk_streaming + decode_chunk_frames), original body renamed to decode_latent_single_shot

Verification

Bit-exact on CPU (deterministic backend): single-shot vs forced 4-frame streaming, same seed → max abs diff = 0.0, 0 / 35200 samples differ.
CUDA differs only by floating-point rounding (convs run over different tensor shapes per chunk): RMS difference 0.082% of signal RMS — inaudible.
Previously-OOM cases now pass on CUDA: a 26.7 s clip (200 frames) and the Q8_0 1.5B model end-to-end, both of which previously OOM'd in the decode.
Test suite: ctest green, no regressions.

(Hardware: RTX 4070, 12 GB; CUDA build.)

Caveats (not addressed here)

Speed is LM-bound, not decode-bound. This PR removes the decode memory wall; it is not a throughput optimization.
Long-form generation stability (drift on long single generations, short-input speech-end behavior) is a separate, pre-existing issue in the autoregressive path and is untouched here — the decoder renders faithfully whatever latents it's given.

Open questions for maintainers

CLI flag vs env var. Chunk size is currently VIBEVOICE_DECODE_CHUNK_FRAMES only. Would you prefer an explicit --decode-chunk-frames CLI flag too, to match --cfg / --steps / --max-frames?
Decoder parity test. Happy to add tests/test_decoder_chunked_parity.cpp mirroring the existing tests/test_encoder_chunked_parity.cpp (assert chunked == single-shot within tolerance, gated by VIBEVOICE_TEST_LARGE). In this PR or a follow-up?
README note. Should this also document VIBEVOICE_DECODE_CHUNK_FRAMES in the README?
Default chunk sizes. 15 (CUDA) / 64 (CPU) target a 12 GB card + the IM2COL cap. Prefer a fixed conservative default or a VRAM-aware heuristic for larger GPUs?

Verification:

CPU streaming vs single-shot: bit-exact (max abs diff 0.0, 0/35200 samples differ).
CUDA: differs only by float rounding (RMS diff 0.082% of signal).
A 26.7 s clip and the Q8_0 1.5B model -- both previously OOM in the decode -- now complete end-to-end on a 12 GB RTX 4070.
ctest green, no regressions.

Assisted-By: Claude Opus 4.8 noreply@anthropic.com

The TTS acoustic decode upsampled the entire latent sequence to 24 kHz audio in a single ggml graph, so peak VRAM grew ~linearly with clip length (~161 MiB per latent frame on this build). A 12 GB GPU OOMs in the decode once a clip passes ~20 frames (~2.7 s), even though the LM and encoder fit comfortably -- making long clips and the Q8_0 1.5B model undecodable on CUDA. There was no knob to chunk it. Decode the latent sequence in fixed-size frame chunks through a streaming decoder that keeps a small per-conv left-context cache, mirroring the existing long-form ASR encoder path (encoder_forward_streaming + StreamingCache). Each chunk pushes only C frames through the decoder, so peak activation memory is bounded to one chunk regardless of total length, while the per-conv caches carry kernel-1 (regular convs) / ceil((K-1)/stride) (transposed upsamplers) frames of context -- making the concatenated output bit-exact with a single-shot decode. New primitive: sconv_transpose1d_causal_streaming, the streaming causal transposed convolution for the decoder's upsamplers (the ASR path only streams downsampling convs, so the existing helper could not be reused). Everything else reuses the established streaming building blocks (sconv1d_causal_streaming, block1d_forward_streaming). Sequences <= chunk still take the original single-shot path (renamed decode_latent_single_shot), so short clips are byte-for-byte unchanged. Default chunk: 15 frames on CUDA (safely under ggml-cuda's IM2COL gridDim.y 65535 cap), 64 on CPU; override with VIBEVOICE_DECODE_CHUNK_FRAMES. Verification: * CPU streaming vs single-shot: bit-exact (max abs diff 0.0, 0/35200 samples differ). * CUDA: differs only by float rounding (RMS diff 0.082% of signal). * A 26.7 s clip and the Q8_0 1.5B model -- both previously OOM in the decode -- now complete end-to-end on a 12 GB RTX 4070. * ctest green, no regressions. Signed-off-by: VelvetBeans <velvetbeanvibes@gmail.com> Assisted-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts: chunked streaming acoustic decode to bound decode VRAM#5

tts: chunked streaming acoustic decode to bound decode VRAM#5
VelvetBeans wants to merge 1 commit into
localai-org:masterfrom
VelvetBeans:tts-chunked-streaming-decode

VelvetBeans commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VelvetBeans commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Files

Verification

Caveats (not addressed here)

Open questions for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VelvetBeans commented May 30, 2026 •

edited

Loading