-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Feature/fix 3039 speaker #3817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/fix 3039 speaker #3817
Conversation
…tion Fixes BasedHardware#3039 - Replace regex-based speaker identification with Qwen2.5-1.5B-Instruct LLM - Distinguish between addressed vs mentioned speakers - Support multiple addressees (returns list) - Add GPU acceleration (Metal/CUDA) - Thread-safe singleton pattern - Add README_SPEAKER_ID.md with setup instructions Performance: - 100% accuracy on test suite - ~300ms latency with GPU - Model: qwen_1.5b_speaker.gguf (1.1GB, Apache 2.0)
…tection - Restore detect_speaker_from_text() with original multi-language regex patterns for self-identification (e.g., 'I am Alice', 'My name is Bob') - Keep identify_speaker_from_transcript() for LLM-based addressee detection (e.g., 'Hey Alice, help' -> ['Alice']) - Fix warmup exception handler with proper logging - Both functions now coexist for different use cases
- Add STRICT EXCLUSION RULE in prompt for verbs like told/said/saw/asked - 'I told Alice' now correctly returns null (mentioned, not addressed) - 'Hey Alice, help' still correctly returns ['Alice'] (addressed) - Fixes false positive detection of mentioned names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement by replacing the previous regex-based speaker detection with a more robust, self-hosted LLM solution. The implementation is well-done, incorporating thread-safe model loading, GPU acceleration, and a warmup mechanism to reduce latency. The code is clean, and the inclusion of new documentation and comprehensive tests is excellent. I have one suggestion to improve the performance of the legacy regex function to better align with its stated goal.
|
Interesting. @uulasb what do you think is the best model we can use for speaker identification and transcript cleaning? @thainguyensunya please remind me once we have the self-hosted Llama so we can continue with this ticket. |
|
ah, one more thing, it would be great if you guys could work together to make it happen. we need to self-host the model so we can test it on our dev environment first, then move to production later. thank you. |
|
@uulasb For your information, we will have a separate/external self-hosted LLM inference with OpenAI-compatiable API Server endpoint. Specifically, we will use vLLM for inference, Llama-3.1-8B-Instruct model and 1 x NVIDIA L4 GPU. This approach supports serving high throughput inference on production and optimize GPU power effectively. Please let me know your thought. (Do you think Llama-3.1-8B-Instruct is overkill for speaker identification and transcript cleaning?) |
… transcript cleaning - Replace llama-cpp-python with strict AsyncOpenAI client - Add transcript cleaning to system prompt - Update unit tests with AsyncMock and edge cases (12/12 pass) - Update documentation for VLLM_ env vars - Remove local dependencies
|
@thainguyensunya @beastoin Thanks for the guidance! I completely agree with the move to external vLLM, it makes the backend much lighter and easier to scale. I've just pushed the refactor to match your roadmap. @thainguyensunya On the 8B model size, If this were just for name detection, I'd agree it's overkill. However, to justify using the L4 GPU, I updated the prompt to also handle Transcript Cleaning in the same pass. It now identifies the speaker AND scrubs filler words ("um", "uh") / fixes grammar simultaneously. We get a much better user experience for the same inference cost, which makes the 8B model a great fit. @beastoin As requested, I have removed the local GGUF/llama-cpp code entirely. I've switched the backend to use AsyncOpenAI, which clears out the heavy llama-cpp dependencies and keeps the event loop non blocking. I also made sure to keep the original regex function synchronous, so we don't accidentally break any legacy calls. I verified the logic on Groq (simulating your setup) and it's hitting 300ms. For Configuration, I standardized the environment variables for your vLLM deployment as follows: VLLM_API_BASE |
- Moved speaker_identification.py -> text_speaker_detection.py to avoid conflict with upstream audio code - Updated imports in transcribe.py and verify_llama_8b.py - Renamed and updated tests/test_speaker_identification.py -> tests/test_text_speaker_detection.py
|
@thainguyensunya @beastoin I noticed main recently introduced a new "speaker_identification.py" for audio embedding logic. To resolve the merge conflict and keep concerns separate, I have renamed my module to "backend/utils/text_speaker_detection.py". This ensures my vLLM/Text logic coexists cleanly with your new Audio logic without overwriting it. |
Fixes #3039
Summary
Replaces the outdated regex-based speaker detection with a robust, self-hosted LLM solution using
Qwen2.5-1.5B-Instructandllama-cpp-python. This PR significantly improves accuracy by distinguishing between addressed speakers (e.g., "Hey Alice") and mentioned names (e.g., "I told Alice"), while preserving legacy compatibility.Changes
Core Functionality
Implemented Addressee Detection: Uses Qwen2.5-1.5B to identify who is being spoken TO.
["Alice", "Bob"]NoneRestored Legacy Compatibility: The
detect_speaker_from_textfunction now correctly uses the original multi-language regex patterns for self-identification (e.g., "I am Alice"), ensuring existing backend logic remains unbroken.Performance & Reliability
Performance Optimization:
n_gpu_layers=-1Reliability:
Documentation & Testing
backend/README_SPEAKER_ID.mdwith setup/usage instructionsbackend/tests/test_speaker_identification.pyVerification Results
Ran comprehensive test suite
test_speaker_identification.py:Setup
pip install llama-cpp-python # Download Model (1.1GB) curl -L -o backend/utils/qwen_1.5b_speaker.gguf \ https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.ggufAI Disclosure
Tools used: Cursor / Gemini