Skip to content

tts: chunked streaming acoustic decode to bound decode VRAM#5

Open
VelvetBeans wants to merge 1 commit into
localai-org:masterfrom
VelvetBeans:tts-chunked-streaming-decode
Open

tts: chunked streaming acoustic decode to bound decode VRAM#5
VelvetBeans wants to merge 1 commit into
localai-org:masterfrom
VelvetBeans:tts-chunked-streaming-decode

Conversation

@VelvetBeans
Copy link
Copy Markdown

@VelvetBeans VelvetBeans commented May 30, 2026

🤖 Implemented with AI assistance (Claude Opus 4.8); authored, reviewed, and tested by the submitter.

Problem

The TTS acoustic decode (decode_latent_sequencedecoder_forward) upsamples the entire latent sequence to 24 kHz audio in a single ggml graph. The decoder's total upsample factor is the product of the ratios (8·5·5·4·2·2 = 3200), so the last stages materialize activation tensors at full audio resolution for all frames at once. Peak VRAM therefore grows roughly linearly with clip length — measured at ~161 MiB per latent frame on this build — and a 12 GB GPU OOMs in the decode once a clip passes ~20 frames (~2.7 s of audio), even though the LM and encoder fit comfortably. In practice the Q8_0 1.5B model and any clip longer than a couple of seconds could not be decoded on CUDA at all. There was no knob to chunk it.

Fix

Decode the latent sequence in fixed-size frame chunks through a streaming decoder that keeps a small per-conv left-context cache, mirroring the existing long-form ASR encoder streaming path (encoder_forward_streaming + StreamingCache). Each chunk pushes only C frames through the whole decoder, so peak activation memory is bounded to one chunk regardless of total length, while the per-conv caches carry just kernel-1 (regular convs) or ceil((K-1)/stride) (transposed upsamplers) frames of context — making the concatenated output bit-exact with a single-shot decode.

The one genuinely new primitive is a streaming causal transposed convolution (sconv_transpose1d_causal_streaming): the ASR encoder only ever downsamples (regular strided convs), whereas the decoder upsamples, so the existing sconv1d_causal_streaming couldn't be reused for the upsamplers. Everything else reuses the established streaming building blocks (sconv1d_causal_streaming, block1d_forward_streaming).

Sequences <= chunk still take the original single-shot path (renamed decode_latent_single_shot), so short clips are byte-for-byte unchanged. Default chunk size is 15 frames on CUDA (safely under ggml-cuda's IM2COL gridDim.y 65535 cap) and 64 on CPU, overridable with VIBEVOICE_DECODE_CHUNK_FRAMES.

Files

  • src/conv1d.{hpp,cpp}sconv_transpose1d_causal_streaming (the new, scrutiny-worthy piece)
  • src/acoustic_tokenizer.{hpp,cpp}decoder_forward_streaming (mirrors encoder_forward_streaming)
  • src/vibevoice_tts.cpp — chunk driver (decode_latent_sequence dispatcher + run_decoder_chunk_streaming + decode_chunk_frames), original body renamed to decode_latent_single_shot

Verification

  • Bit-exact on CPU (deterministic backend): single-shot vs forced 4-frame streaming, same seed → max abs diff = 0.0, 0 / 35200 samples differ.
  • CUDA differs only by floating-point rounding (convs run over different tensor shapes per chunk): RMS difference 0.082% of signal RMS — inaudible.
  • Previously-OOM cases now pass on CUDA: a 26.7 s clip (200 frames) and the Q8_0 1.5B model end-to-end, both of which previously OOM'd in the decode.
  • Test suite: ctest green, no regressions.

(Hardware: RTX 4070, 12 GB; CUDA build.)

Caveats (not addressed here)

  • Speed is LM-bound, not decode-bound. This PR removes the decode memory wall; it is not a throughput optimization.
  • Long-form generation stability (drift on long single generations, short-input speech-end behavior) is a separate, pre-existing issue in the autoregressive path and is untouched here — the decoder renders faithfully whatever latents it's given.

Open questions for maintainers

  1. CLI flag vs env var. Chunk size is currently VIBEVOICE_DECODE_CHUNK_FRAMES only. Would you prefer an explicit --decode-chunk-frames CLI flag too, to match --cfg / --steps / --max-frames?
  2. Decoder parity test. Happy to add tests/test_decoder_chunked_parity.cpp mirroring the existing tests/test_encoder_chunked_parity.cpp (assert chunked == single-shot within tolerance, gated by VIBEVOICE_TEST_LARGE). In this PR or a follow-up?
  3. README note. Should this also document VIBEVOICE_DECODE_CHUNK_FRAMES in the README?
  4. Default chunk sizes. 15 (CUDA) / 64 (CPU) target a 12 GB card + the IM2COL cap. Prefer a fixed conservative default or a VRAM-aware heuristic for larger GPUs?

Verification:

  • CPU streaming vs single-shot: bit-exact (max abs diff 0.0, 0/35200 samples differ).
  • CUDA: differs only by float rounding (RMS diff 0.082% of signal).
  • A 26.7 s clip and the Q8_0 1.5B model -- both previously OOM in the decode -- now complete end-to-end on a 12 GB RTX 4070.
  • ctest green, no regressions.

Assisted-By: Claude Opus 4.8 noreply@anthropic.com

The TTS acoustic decode upsampled the entire latent sequence to 24 kHz
audio in a single ggml graph, so peak VRAM grew ~linearly with clip
length (~161 MiB per latent frame on this build). A 12 GB GPU OOMs in
the decode once a clip passes ~20 frames (~2.7 s), even though the LM
and encoder fit comfortably -- making long clips and the Q8_0 1.5B model
undecodable on CUDA. There was no knob to chunk it.

Decode the latent sequence in fixed-size frame chunks through a streaming
decoder that keeps a small per-conv left-context cache, mirroring the
existing long-form ASR encoder path (encoder_forward_streaming +
StreamingCache). Each chunk pushes only C frames through the decoder, so
peak activation memory is bounded to one chunk regardless of total
length, while the per-conv caches carry kernel-1 (regular convs) /
ceil((K-1)/stride) (transposed upsamplers) frames of context -- making
the concatenated output bit-exact with a single-shot decode.

New primitive: sconv_transpose1d_causal_streaming, the streaming causal
transposed convolution for the decoder's upsamplers (the ASR path only
streams downsampling convs, so the existing helper could not be reused).
Everything else reuses the established streaming building blocks
(sconv1d_causal_streaming, block1d_forward_streaming). Sequences <= chunk
still take the original single-shot path (renamed
decode_latent_single_shot), so short clips are byte-for-byte unchanged.
Default chunk: 15 frames on CUDA (safely under ggml-cuda's IM2COL
gridDim.y 65535 cap), 64 on CPU; override with VIBEVOICE_DECODE_CHUNK_FRAMES.

Verification:
  * CPU streaming vs single-shot: bit-exact (max abs diff 0.0,
    0/35200 samples differ).
  * CUDA: differs only by float rounding (RMS diff 0.082% of signal).
  * A 26.7 s clip and the Q8_0 1.5B model -- both previously OOM in the
    decode -- now complete end-to-end on a 12 GB RTX 4070.
  * ctest green, no regressions.

Signed-off-by: VelvetBeans <velvetbeanvibes@gmail.com>
Assisted-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant