Skip to content

Conversation

@uulasb
Copy link

@uulasb uulasb commented Dec 17, 2025

Fixes #3039

Summary

Replaces the outdated regex-based speaker detection with a robust, self-hosted LLM solution using Qwen2.5-1.5B-Instruct and llama-cpp-python. This PR significantly improves accuracy by distinguishing between addressed speakers (e.g., "Hey Alice") and mentioned names (e.g., "I told Alice"), while preserving legacy compatibility.


Changes

Core Functionality

  • Implemented Addressee Detection: Uses Qwen2.5-1.5B to identify who is being spoken TO.

    • Supports multiple addressees: "Alice and Bob, come here" → ["Alice", "Bob"]
    • Strict exclusion for mentioned names: "I told Alice" → None
  • Restored Legacy Compatibility: The detect_speaker_from_text function now correctly uses the original multi-language regex patterns for self-identification (e.g., "I am Alice"), ensuring existing backend logic remains unbroken.

Performance & Reliability

  • Performance Optimization:

    • Thread-safe Singleton: Model loads only once across the application lifecycle
    • GPU Acceleration: Auto-offloads to Metal (Mac) or CUDA (NVIDIA) via n_gpu_layers=-1
    • Silent Warmup: Eliminates cold-start latency on first request
  • Reliability:

    • Strict JSON output schema enforcement
    • Proper logging for initialization and warmup failures

Documentation & Testing

  • Documentation: Added backend/README_SPEAKER_ID.md with setup/usage instructions
  • Testing: Added comprehensive unit tests in backend/tests/test_speaker_identification.py

Verification Results

Ran comprehensive test suite test_speaker_identification.py:

=======================================================
      OMI SPEAKER IDENTIFICATION VERIFICATION
=======================================================
[TEST 1] Legacy Regex: Self-Identification
------------------------------------------
✅ Input: 'I am Alice' -> Alice
✅ Input: 'My name is Bob' -> Bob
✅ Input: 'Je m'appelle Pierre' -> Pierre
✅ Input: 'Hey Alice, help me' -> None

[TEST 2] LLM: Addressee Detection
---------------------------------
✅ Input: 'Hey Alice, can you help?' -> ['Alice']
✅ Input: 'Bob, come here quickly.' -> ['Bob']
✅ Input: 'John and Mary, listen up.' -> ['John', 'Mary']
✅ Input: 'I told Alice about the meeting.' -> None
✅ Input: 'I saw Bob yesterday.' -> None

=======================================================
REGEX RESULTS: 6/6 passed
LLM RESULTS:   8/8 passed
=======================================================

Setup

pip install llama-cpp-python

# Download Model (1.1GB)
curl -L -o backend/utils/qwen_1.5b_speaker.gguf \
  https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf

AI Disclosure

Tools used: Cursor / Gemini

…tion

Fixes BasedHardware#3039

- Replace regex-based speaker identification with Qwen2.5-1.5B-Instruct LLM
- Distinguish between addressed vs mentioned speakers
- Support multiple addressees (returns list)
- Add GPU acceleration (Metal/CUDA)
- Thread-safe singleton pattern
- Add README_SPEAKER_ID.md with setup instructions

Performance:
- 100% accuracy on test suite
- ~300ms latency with GPU
- Model: qwen_1.5b_speaker.gguf (1.1GB, Apache 2.0)
…tection

- Restore detect_speaker_from_text() with original multi-language regex patterns
  for self-identification (e.g., 'I am Alice', 'My name is Bob')
- Keep identify_speaker_from_transcript() for LLM-based addressee detection
  (e.g., 'Hey Alice, help' -> ['Alice'])
- Fix warmup exception handler with proper logging
- Both functions now coexist for different use cases
- Add STRICT EXCLUSION RULE in prompt for verbs like told/said/saw/asked
- 'I told Alice' now correctly returns null (mentioned, not addressed)
- 'Hey Alice, help' still correctly returns ['Alice'] (addressed)
- Fixes false positive detection of mentioned names
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by replacing the previous regex-based speaker detection with a more robust, self-hosted LLM solution. The implementation is well-done, incorporating thread-safe model loading, GPU acceleration, and a warmup mechanism to reduce latency. The code is clean, and the inclusion of new documentation and comprehensive tests is excellent. I have one suggestion to improve the performance of the legacy regex function to better align with its stated goal.

@beastoin
Copy link
Collaborator

Interesting.

@uulasb what do you think is the best model we can use for speaker identification and transcript cleaning?

@thainguyensunya please remind me once we have the self-hosted Llama so we can continue with this ticket.

@beastoin
Copy link
Collaborator

ah, one more thing, it would be great if you guys could work together to make it happen.

https://github.com/BasedHardware/omi/pull/3817/changes#diff-52e392e28d7c9113854b824355974c705167c7f3e95cc44cd7b8baf360fb849eR26-R45

we need to self-host the model so we can test it on our dev environment first, then move to production later.

thank you.

@thainguyensunya
Copy link
Collaborator

thainguyensunya commented Dec 25, 2025

@uulasb For your information, we will have a separate/external self-hosted LLM inference with OpenAI-compatiable API Server endpoint. Specifically, we will use vLLM for inference, Llama-3.1-8B-Instruct model and 1 x NVIDIA L4 GPU.

This approach supports serving high throughput inference on production and optimize GPU power effectively.
So you may need to modify your code to adapt to this external self-hosted LLM approach.

Please let me know your thought. (Do you think Llama-3.1-8B-Instruct is overkill for speaker identification and transcript cleaning?)

… transcript cleaning

- Replace llama-cpp-python with strict AsyncOpenAI client
- Add transcript cleaning to system prompt
- Update unit tests with AsyncMock and edge cases (12/12 pass)
- Update documentation for VLLM_ env vars
- Remove local dependencies
@uulasb
Copy link
Author

uulasb commented Dec 25, 2025

@thainguyensunya @beastoin Thanks for the guidance! I completely agree with the move to external vLLM, it makes the backend much lighter and easier to scale. I've just pushed the refactor to match your roadmap.

@thainguyensunya On the 8B model size, If this were just for name detection, I'd agree it's overkill. However, to justify using the L4 GPU, I updated the prompt to also handle Transcript Cleaning in the same pass. It now identifies the speaker AND scrubs filler words ("um", "uh") / fixes grammar simultaneously. We get a much better user experience for the same inference cost, which makes the 8B model a great fit.

@beastoin As requested, I have removed the local GGUF/llama-cpp code entirely. I've switched the backend to use AsyncOpenAI, which clears out the heavy llama-cpp dependencies and keeps the event loop non blocking. I also made sure to keep the original regex function synchronous, so we don't accidentally break any legacy calls.

I verified the logic on Groq (simulating your setup) and it's hitting 300ms.

For Configuration, I standardized the environment variables for your vLLM deployment as follows:

VLLM_API_BASE
VLLM_API_KEY
VLLM_MODEL_NAME

uulasb added 3 commits January 1, 2026 18:13
- Moved speaker_identification.py -> text_speaker_detection.py to avoid conflict with upstream audio code
- Updated imports in transcribe.py and verify_llama_8b.py
- Renamed and updated tests/test_speaker_identification.py -> tests/test_text_speaker_detection.py
@uulasb
Copy link
Author

uulasb commented Jan 1, 2026

@thainguyensunya @beastoin I noticed main recently introduced a new "speaker_identification.py" for audio embedding logic. To resolve the merge conflict and keep concerns separate, I have renamed my module to "backend/utils/text_speaker_detection.py". This ensures my vLLM/Text logic coexists cleanly with your new Audio logic without overwriting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use NER (Named Entity Recognition) or better techniques (like self-hosted LLM) to improve speaker detection based on transcripts ($500)

3 participants