Hi! Interesting approach — meeting transcription using just an LLM.
For the speech-to-text step, have you considered SenseVoice? It could complement or replace the LLM-only approach for initial transcription:
SenseVoice advantages
- 5x faster than Whisper — non-autoregressive architecture
- 234M params — lightweight, runs on CPU
- Built-in features: speaker diarization (cam++), emotion detection, audio event classification
- 50+ languages — auto-detects language
- OpenAI-compatible API — drop-in replacement
Quick start
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
spk_model="cam++", # speaker diarization
)
result = model.generate(input="meeting.wav")
# Returns: text + speaker labels + timestamps + emotion
The combination of SenseVoice (fast ASR) + your LLM approach (structuring/summarization) could give best of both worlds — accurate transcription with intelligent formatting.
Links
Hi! Interesting approach — meeting transcription using just an LLM.
For the speech-to-text step, have you considered SenseVoice? It could complement or replace the LLM-only approach for initial transcription:
SenseVoice advantages
Quick start
The combination of SenseVoice (fast ASR) + your LLM approach (structuring/summarization) could give best of both worlds — accurate transcription with intelligent formatting.
Links