tts: chunked streaming acoustic decode to bound decode VRAM#5
Open
VelvetBeans wants to merge 1 commit into
Open
tts: chunked streaming acoustic decode to bound decode VRAM#5VelvetBeans wants to merge 1 commit into
VelvetBeans wants to merge 1 commit into
Conversation
The TTS acoustic decode upsampled the entire latent sequence to 24 kHz
audio in a single ggml graph, so peak VRAM grew ~linearly with clip
length (~161 MiB per latent frame on this build). A 12 GB GPU OOMs in
the decode once a clip passes ~20 frames (~2.7 s), even though the LM
and encoder fit comfortably -- making long clips and the Q8_0 1.5B model
undecodable on CUDA. There was no knob to chunk it.
Decode the latent sequence in fixed-size frame chunks through a streaming
decoder that keeps a small per-conv left-context cache, mirroring the
existing long-form ASR encoder path (encoder_forward_streaming +
StreamingCache). Each chunk pushes only C frames through the decoder, so
peak activation memory is bounded to one chunk regardless of total
length, while the per-conv caches carry kernel-1 (regular convs) /
ceil((K-1)/stride) (transposed upsamplers) frames of context -- making
the concatenated output bit-exact with a single-shot decode.
New primitive: sconv_transpose1d_causal_streaming, the streaming causal
transposed convolution for the decoder's upsamplers (the ASR path only
streams downsampling convs, so the existing helper could not be reused).
Everything else reuses the established streaming building blocks
(sconv1d_causal_streaming, block1d_forward_streaming). Sequences <= chunk
still take the original single-shot path (renamed
decode_latent_single_shot), so short clips are byte-for-byte unchanged.
Default chunk: 15 frames on CUDA (safely under ggml-cuda's IM2COL
gridDim.y 65535 cap), 64 on CPU; override with VIBEVOICE_DECODE_CHUNK_FRAMES.
Verification:
* CPU streaming vs single-shot: bit-exact (max abs diff 0.0,
0/35200 samples differ).
* CUDA: differs only by float rounding (RMS diff 0.082% of signal).
* A 26.7 s clip and the Q8_0 1.5B model -- both previously OOM in the
decode -- now complete end-to-end on a 12 GB RTX 4070.
* ctest green, no regressions.
Signed-off-by: VelvetBeans <velvetbeanvibes@gmail.com>
Assisted-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The TTS acoustic decode (
decode_latent_sequence→decoder_forward) upsamples the entire latent sequence to 24 kHz audio in a single ggml graph. The decoder's total upsample factor is the product of the ratios (8·5·5·4·2·2 = 3200), so the last stages materialize activation tensors at full audio resolution for all frames at once. Peak VRAM therefore grows roughly linearly with clip length — measured at ~161 MiB per latent frame on this build — and a 12 GB GPU OOMs in the decode once a clip passes ~20 frames (~2.7 s of audio), even though the LM and encoder fit comfortably. In practice the Q8_0 1.5B model and any clip longer than a couple of seconds could not be decoded on CUDA at all. There was no knob to chunk it.Fix
Decode the latent sequence in fixed-size frame chunks through a streaming decoder that keeps a small per-conv left-context cache, mirroring the existing long-form ASR encoder streaming path (
encoder_forward_streaming+StreamingCache). Each chunk pushes onlyCframes through the whole decoder, so peak activation memory is bounded to one chunk regardless of total length, while the per-conv caches carry justkernel-1(regular convs) orceil((K-1)/stride)(transposed upsamplers) frames of context — making the concatenated output bit-exact with a single-shot decode.The one genuinely new primitive is a streaming causal transposed convolution (
sconv_transpose1d_causal_streaming): the ASR encoder only ever downsamples (regular strided convs), whereas the decoder upsamples, so the existingsconv1d_causal_streamingcouldn't be reused for the upsamplers. Everything else reuses the established streaming building blocks (sconv1d_causal_streaming,block1d_forward_streaming).Sequences
<= chunkstill take the original single-shot path (renameddecode_latent_single_shot), so short clips are byte-for-byte unchanged. Default chunk size is 15 frames on CUDA (safely under ggml-cuda's IM2COLgridDim.y65535 cap) and 64 on CPU, overridable withVIBEVOICE_DECODE_CHUNK_FRAMES.Files
src/conv1d.{hpp,cpp}—sconv_transpose1d_causal_streaming(the new, scrutiny-worthy piece)src/acoustic_tokenizer.{hpp,cpp}—decoder_forward_streaming(mirrorsencoder_forward_streaming)src/vibevoice_tts.cpp— chunk driver (decode_latent_sequencedispatcher +run_decoder_chunk_streaming+decode_chunk_frames), original body renamed todecode_latent_single_shotVerification
max abs diff = 0.0,0 / 35200samples differ.ctestgreen, no regressions.(Hardware: RTX 4070, 12 GB; CUDA build.)
Caveats (not addressed here)
Open questions for maintainers
VIBEVOICE_DECODE_CHUNK_FRAMESonly. Would you prefer an explicit--decode-chunk-framesCLI flag too, to match--cfg/--steps/--max-frames?tests/test_decoder_chunked_parity.cppmirroring the existingtests/test_encoder_chunked_parity.cpp(assert chunked == single-shot within tolerance, gated byVIBEVOICE_TEST_LARGE). In this PR or a follow-up?VIBEVOICE_DECODE_CHUNK_FRAMESin the README?Verification:
Assisted-By: Claude Opus 4.8 noreply@anthropic.com