Feature/fix 3039 speaker #3817

uulasb · 2025-12-17T11:03:33Z

Fixes #3039

Summary

Replaces the outdated regex-based speaker detection with a robust, self-hosted LLM solution using Qwen2.5-1.5B-Instruct and llama-cpp-python. This PR significantly improves accuracy by distinguishing between addressed speakers (e.g., "Hey Alice") and mentioned names (e.g., "I told Alice"), while preserving legacy compatibility.

Changes

Core Functionality

Implemented Addressee Detection: Uses Qwen2.5-1.5B to identify who is being spoken TO.
- Supports multiple addressees: "Alice and Bob, come here" → ["Alice", "Bob"]
- Strict exclusion for mentioned names: "I told Alice" → None
Restored Legacy Compatibility: The detect_speaker_from_text function now correctly uses the original multi-language regex patterns for self-identification (e.g., "I am Alice"), ensuring existing backend logic remains unbroken.

Performance & Reliability

Performance Optimization:
- Thread-safe Singleton: Model loads only once across the application lifecycle
- GPU Acceleration: Auto-offloads to Metal (Mac) or CUDA (NVIDIA) via n_gpu_layers=-1
- Silent Warmup: Eliminates cold-start latency on first request
Reliability:
- Strict JSON output schema enforcement
- Proper logging for initialization and warmup failures

Documentation & Testing

Documentation: Added backend/README_SPEAKER_ID.md with setup/usage instructions
Testing: Added comprehensive unit tests in backend/tests/test_speaker_identification.py

Verification Results

Ran comprehensive test suite test_speaker_identification.py:

=======================================================
      OMI SPEAKER IDENTIFICATION VERIFICATION
=======================================================
[TEST 1] Legacy Regex: Self-Identification
------------------------------------------
✅ Input: 'I am Alice' -> Alice
✅ Input: 'My name is Bob' -> Bob
✅ Input: 'Je m'appelle Pierre' -> Pierre
✅ Input: 'Hey Alice, help me' -> None

[TEST 2] LLM: Addressee Detection
---------------------------------
✅ Input: 'Hey Alice, can you help?' -> ['Alice']
✅ Input: 'Bob, come here quickly.' -> ['Bob']
✅ Input: 'John and Mary, listen up.' -> ['John', 'Mary']
✅ Input: 'I told Alice about the meeting.' -> None
✅ Input: 'I saw Bob yesterday.' -> None

=======================================================
REGEX RESULTS: 6/6 passed
LLM RESULTS:   8/8 passed
=======================================================

Setup

pip install llama-cpp-python

# Download Model (1.1GB)
curl -L -o backend/utils/qwen_1.5b_speaker.gguf \
  https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF/resolve/main/qwen2.5-1.5b-instruct-q4_k_m.gguf

AI Disclosure

Tools used: Cursor / Gemini

…tion Fixes BasedHardware#3039 - Replace regex-based speaker identification with Qwen2.5-1.5B-Instruct LLM - Distinguish between addressed vs mentioned speakers - Support multiple addressees (returns list) - Add GPU acceleration (Metal/CUDA) - Thread-safe singleton pattern - Add README_SPEAKER_ID.md with setup instructions Performance: - 100% accuracy on test suite - ~300ms latency with GPU - Model: qwen_1.5b_speaker.gguf (1.1GB, Apache 2.0)

…tection - Restore detect_speaker_from_text() with original multi-language regex patterns for self-identification (e.g., 'I am Alice', 'My name is Bob') - Keep identify_speaker_from_transcript() for LLM-based addressee detection (e.g., 'Hey Alice, help' -> ['Alice']) - Fix warmup exception handler with proper logging - Both functions now coexist for different use cases

- Add STRICT EXCLUSION RULE in prompt for verbs like told/said/saw/asked - 'I told Alice' now correctly returns null (mentioned, not addressed) - 'Hey Alice, help' still correctly returns ['Alice'] (addressed) - Fixes false positive detection of mentioned names

… logic

gemini-code-assist

Code Review

This pull request introduces a significant improvement by replacing the previous regex-based speaker detection with a more robust, self-hosted LLM solution. The implementation is well-done, incorporating thread-safe model loading, GPU acceleration, and a warmup mechanism to reduce latency. The code is clean, and the inclusion of new documentation and comprehensive tests is excellent. I have one suggestion to improve the performance of the legacy regex function to better align with its stated goal.

backend/utils/speaker_identification.py

beastoin · 2025-12-25T04:26:28Z

Interesting.

@uulasb what do you think is the best model we can use for speaker identification and transcript cleaning?

@thainguyensunya please remind me once we have the self-hosted Llama so we can continue with this ticket.

beastoin · 2025-12-25T04:28:00Z

ah, one more thing, it would be great if you guys could work together to make it happen.

https://github.com/BasedHardware/omi/pull/3817/changes#diff-52e392e28d7c9113854b824355974c705167c7f3e95cc44cd7b8baf360fb849eR26-R45

we need to self-host the model so we can test it on our dev environment first, then move to production later.

thank you.

thainguyensunya · 2025-12-25T13:33:39Z

@uulasb For your information, we will have a separate/external self-hosted LLM inference with OpenAI-compatiable API Server endpoint. Specifically, we will use vLLM for inference, Llama-3.1-8B-Instruct model and 1 x NVIDIA L4 GPU.

This approach supports serving high throughput inference on production and optimize GPU power effectively.
So you may need to modify your code to adapt to this external self-hosted LLM approach.

Please let me know your thought. (Do you think Llama-3.1-8B-Instruct is overkill for speaker identification and transcript cleaning?)

… transcript cleaning - Replace llama-cpp-python with strict AsyncOpenAI client - Add transcript cleaning to system prompt - Update unit tests with AsyncMock and edge cases (12/12 pass) - Update documentation for VLLM_ env vars - Remove local dependencies

uulasb · 2025-12-25T16:39:58Z

@thainguyensunya @beastoin Thanks for the guidance! I completely agree with the move to external vLLM, it makes the backend much lighter and easier to scale. I've just pushed the refactor to match your roadmap.

@thainguyensunya On the 8B model size, If this were just for name detection, I'd agree it's overkill. However, to justify using the L4 GPU, I updated the prompt to also handle Transcript Cleaning in the same pass. It now identifies the speaker AND scrubs filler words ("um", "uh") / fixes grammar simultaneously. We get a much better user experience for the same inference cost, which makes the 8B model a great fit.

@beastoin As requested, I have removed the local GGUF/llama-cpp code entirely. I've switched the backend to use AsyncOpenAI, which clears out the heavy llama-cpp dependencies and keeps the event loop non blocking. I also made sure to keep the original regex function synchronous, so we don't accidentally break any legacy calls.

I verified the logic on Groq (simulating your setup) and it's hitting 300ms.

For Configuration, I standardized the environment variables for your vLLM deployment as follows:

VLLM_API_BASE
VLLM_API_KEY
VLLM_MODEL_NAME

- Moved speaker_identification.py -> text_speaker_detection.py to avoid conflict with upstream audio code - Updated imports in transcribe.py and verify_llama_8b.py - Renamed and updated tests/test_speaker_identification.py -> tests/test_text_speaker_detection.py

…ection

uulasb · 2026-01-01T15:25:21Z

@thainguyensunya @beastoin I noticed main recently introduced a new "speaker_identification.py" for audio embedding logic. To resolve the merge conflict and keep concerns separate, I have renamed my module to "backend/utils/text_speaker_detection.py". This ensures my vLLM/Text logic coexists cleanly with your new Audio logic without overwriting it.

uulasb added 6 commits December 17, 2025 13:55

feat(deps): Add .gitignore and requirements.txt for project cleanup

8e7ccf6

fix(.gitignore): Restore original and append *.gguf rule

50c3af5

test(speaker-id): Add comprehensive unit tests covering regex and LLM…

5934c2a

… logic

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

backend/utils/speaker_identification.py Outdated Show resolved Hide resolved

uulasb added 2 commits December 17, 2025 14:13

perf(speaker-id): Pre-compile regex patterns at module load

d53e22a

docs(speaker-id): Add testing instructions to README

a22d2fb

uulasb added 3 commits January 1, 2026 18:13

Merge main: resolve conflict by renaming to text_speaker_detection.py

922c93b

docs(speaker-id): update README to reflect rename to text_speaker_det…

7d03317

…ection

Merge branch 'main' into feature/fix-3039-speaker-id

bbc67ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/fix 3039 speaker #3817

Feature/fix 3039 speaker #3817

uulasb commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

beastoin commented Dec 25, 2025

Uh oh!

beastoin commented Dec 25, 2025

Uh oh!

thainguyensunya commented Dec 25, 2025 •

edited

Loading

Uh oh!

uulasb commented Dec 25, 2025

Uh oh!

uulasb commented Jan 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/fix 3039 speaker #3817

Are you sure you want to change the base?

Feature/fix 3039 speaker #3817

Conversation

uulasb commented Dec 17, 2025

Fixes #3039

Summary

Changes

Core Functionality

Performance & Reliability

Documentation & Testing

Verification Results

Setup

AI Disclosure

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

beastoin commented Dec 25, 2025

Uh oh!

beastoin commented Dec 25, 2025

Uh oh!

thainguyensunya commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uulasb commented Dec 25, 2025

Uh oh!

uulasb commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thainguyensunya commented Dec 25, 2025 •

edited

Loading

uulasb commented Jan 1, 2026 •

edited

Loading