Conversation
|
@zmy1116 Thanks for the amazing work! For the current code, at least it is very useful for offline TTS. I think you may also try to test with new pytorch workflow. A bert example: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/models/modeling_bert.py It has a much easy attention interface and supports custom attention mask https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/attention_backend/interface.py#L634. |
thanks! a torch like api interface is definitely helpful... i will try to go through it this weekend |
|
@yuekaizhang I went through the _torch part of trtllm, but it appears that it also does not accept custom mask type If I am not mistaking for streaming our mask needs to be like the same issue seems to exist even if I want to use Oh well.. I think the other pr with custom ln plugin works already, so dont need to investigate this further. |
This is conversion code for TRTLLM conversion for the estimator of the cfm model.
The conversion code is done mimicing the DIT and STDIT model code in TRTLLM
https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/models/stdit
https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/models/dit
The FP16 engine runs 5x faster on L4 GPU comparing with original Torch model. And the generated audio sounds clean.
example.wav
Conversion is done directly in the docker environment
soar97/triton-cosyvoice:25.06, no need extra installation. Instructions are inREADME.md.Unresolved Issue
The converted engine currently does not accept
attention_mask(Hence, it cannot do stream generation). The attention module is extended fromfrom tensorrt_llm.layers.attention import BertAttentionto add in the ROPE part. However, thebert attention plugindoes not accept attention mask.The alternative solution is to disable the
bert_attention_plugin, but the audio quality would be lower many of our testing voices.@yuekaizhang if you can take a look and let me know how to deal with this. Meanwhile I will try to see if I can use
from tensorrt_llm.layers.attention import Attentiondirectly.. the thing is there are many more parameters and I keeps getting errors when converting.