Improve inference consistency and runtime stability for streaming and vLLM backends by KEVINTUAN12 · Pull Request #1864 · FunAudioLLM/CosyVoice

KEVINTUAN12 · 2026-03-25T09:19:13Z

This PR improves inference consistency and runtime stability across different CosyVoice inference backends.

While testing recent local changes on top of the latest upstream branch, we observed two practical issues:

the vLLM path could produce noticeably different generation results from the original PyTorch path, especially in streaming TTS, where generated speech could become longer due to decoding divergence;
the streaming inference path still relied on polling-based waits, which introduced unnecessary CPU overhead and wakeup jitter;
after optimizing TRT single-concurrency execution to avoid an extra dedicated CUDA stream, the estimator path still assumed the stream object was always valid, which could cause runtime failure.

This PR keeps the latest upstream features and applies the local fixes on top of them, so generation behavior is closer across backends and the runtime path is more robust.

…or streaming and vLLM backends

fix(inference): Improve inference consistency and runtime stability f…

b4388fc

…or streaming and vLLM backends

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inference consistency and runtime stability for streaming and vLLM backends#1864

Improve inference consistency and runtime stability for streaming and vLLM backends#1864
KEVINTUAN12 wants to merge 1 commit intoFunAudioLLM:mainfrom
KEVINTUAN12:fix-cosyvoice2-vllm-consistency

KEVINTUAN12 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KEVINTUAN12 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant