cfm model trtllm conversion by zmy1116 · Pull Request #1796 · FunAudioLLM/CosyVoice

zmy1116 · 2026-01-19T02:39:49Z

This is conversion code for TRTLLM conversion for the estimator of the cfm model.

The conversion code is done mimicing the DIT and STDIT model code in TRTLLM
https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/models/stdit
https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/models/dit

The FP16 engine runs 5x faster on L4 GPU comparing with original Torch model. And the generated audio sounds clean.
example.wav

Conversion is done directly in the docker environment soar97/triton-cosyvoice:25.06, no need extra installation. Instructions are in README.md.

Unresolved Issue

The converted engine currently does not accept attention_mask (Hence, it cannot do stream generation). The attention module is extended from from tensorrt_llm.layers.attention import BertAttention to add in the ROPE part. However, the bert attention plugin does not accept attention mask.

        if default_net().plugin_config.bert_attention_plugin:
            # TRT plugin mode
            assert input_lengths is not None
            context = bert_attention(
                qkv,
                input_lengths,
                self.num_attention_heads,
                self.attention_head_size,
                q_scaling=self.q_scaling,
                relative_attention=self.relative_attention,
                max_distance=self.max_distance,
                relative_attention_bias=self.rel_attn_table.value
                if self.relative_attention else None,
                max_input_length=max_input_length,
                cp_group=self.cp_group,
                cp_size=self.cp_size,
                cp_rank=self.cp_rank)
        else:

The alternative solution is to disable the bert_attention_plugin, but the audio quality would be lower many of our testing voices.

@yuekaizhang if you can take a look and let me know how to deal with this. Meanwhile I will try to see if I can use from tensorrt_llm.layers.attention import Attention directly.. the thing is there are many more parameters and I keeps getting errors when converting.

yuekaizhang · 2026-01-19T04:12:13Z

@zmy1116 Thanks for the amazing work!

For the current code, at least it is very useful for offline TTS.

I think you may also try to test with new pytorch workflow. A bert example: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_bert.py

It has a much easy attention interface and supports custom attention mask https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/attention_backend/interface.py#L634.

zmy1116 · 2026-01-21T18:46:17Z

@zmy1116 Thanks for the amazing work!

For the current code, at least it is very useful for offline TTS.

I think you may also try to test with new pytorch workflow. A bert example: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_bert.py

It has a much easy attention interface and supports custom attention mask https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/attention_backend/interface.py#L634.

thanks! a torch like api interface is definitely helpful... i will try to go through it this weekend

zmy1116 · 2026-01-30T04:31:15Z

@yuekaizhang I went through the _torch part of trtllm, but it appears that it also does not accept custom mask type

            if attention_mask == PredefinedAttentionMask.CAUSAL:
                mask_type = AttentionMaskType.causal
            elif attention_mask == PredefinedAttentionMask.FULL:
                mask_type = AttentionMaskType.padding
            else:
                raise ValueError("Unexpected attention mask type")

https://github.com/NVIDIA/TensorRT-LLM/blob/29a203aedbd65630f68f5b0b91e420437d87bea3/tensorrt_llm/_torch/attention_backend/trtllm.py#L424C1-L430C1

If I am not mistaking for streaming our mask needs to be like
[1 1 1 0 0 0 ]
[1 1 1 1 0 0 ]
[1 1 1 1 1 1 ]

the same issue seems to exist even if I want to use from tensorrt_llm.layers.attention import Attention

        if default_net().plugin_config.gpt_attention_plugin:
            if self.cross_attention and (past_key_value is not None):
                past_key_value = kv_cache_params.past_key_value[1]
            assert self.attention_mask_type in [
                AttentionMaskType.causal, AttentionMaskType.bidirectional,
                AttentionMaskType.bidirectionalglm,
                AttentionMaskType.blocksparse
            ], 'Plugin only support masked MHA.'

Oh well.. I think the other pr with custom ln plugin works already, so dont need to investigate this further.

ming added 2 commits January 18, 2026 18:02

trtllm conversion code

65f5aef

add readme

51fc0a1

zmy1116 marked this pull request as draft January 19, 2026 02:43

zmy1116 mentioned this pull request Jan 21, 2026

CosyVoice3.0 TRT LLM #1771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cfm model trtllm conversion#1796

cfm model trtllm conversion#1796
zmy1116 wants to merge 2 commits intoFunAudioLLM:mainfrom
zmy1116:flow_trtllm_conversion

zmy1116 commented Jan 19, 2026

Uh oh!

yuekaizhang commented Jan 19, 2026

Uh oh!

zmy1116 commented Jan 21, 2026

Uh oh!

zmy1116 commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zmy1116 commented Jan 19, 2026

Unresolved Issue

Uh oh!

yuekaizhang commented Jan 19, 2026

Uh oh!

zmy1116 commented Jan 21, 2026

Uh oh!

zmy1116 commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants