Skip to content

Conversation

@michalkulakowski
Copy link
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@michalkulakowski michalkulakowski changed the title Add speaker ambeddings support Add speaker embeddings support Dec 11, 2025
@michalkulakowski michalkulakowski force-pushed the mkulakow/support_voice_embeddings branch from ec35a1a to 4f45830 Compare January 12, 2026 12:32
@dtrawins
Copy link
Collaborator

Can you include the unit tests?

TtsServable(const mediapipe::T2sCalculatorOptions& nodeOptions, const std::string& graphPath) {
auto fsModelsPath = std::filesystem::path(nodeOptions.models_path());
TtsServable(const std::string& modelDir, const std::string& targetDevice, const google::protobuf::RepeatedPtrField<mediapipe::T2sCalculatorOptions_SpeakerEmbeddings>& graphVoices, const std::string& pluginConfig, const std::string& graphPath) {
auto fsModelsPath = std::filesystem::path(modelDir);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this constructor to cpp? this way you could move read_speaker_embedding to cpp as well.

input.seekg(0, std::ios::beg);

// Check size is multiple of float
OPENVINO_ASSERT(buffer_size % sizeof(float) == 0, "File size is not a multiple of float size.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it might be the first time we use openvino_assert
why are we introducing yet another way of reporting errors?
for now we have:

  • ovms::Status
  • absl::Status
  • std::variant<Obj, std::string(error)>
  • and now this... :(

for (auto voice : graphVoices) {
if (!std::filesystem::exists(voice.path()))
throw std::runtime_error{"Requested voice speaker embeddings file does not exist."};
voices[voice.name()] = read_speaker_embedding(voice.path());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we catch exceptions thrown inside read_speaker_embedding?

@michalkulakowski michalkulakowski force-pushed the mkulakow/support_voice_embeddings branch from 0baece7 to aa8b340 Compare January 20, 2026 13:08

Instead of generating speech with default model voice you can create speaker embeddings with [this script](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py)
```bash
curl --output create_speaker_embedding.py "https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/master/samples/python/speech_generation/create_speaker_embedding.py"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet another link for manual change, it wouldnt even hit sed command which i use because its genai and master
do we want to keep this reference to master and risk of change and getting outdated demos? or do we want to hardcode it? or maybe do we want to keep it in our repo?

if [ -f "$1/$TTS_MODEL/$TOKENIZER_FILE" ]; then
echo "Model file $1/$TTS_MODEL/$TOKENIZER_FILE exists. Skipping downloading models."
else
python3 demos/common/export_models/export_model.py text2speech --source_model "$TTS_MODEL" --weight-format int4 --model_repository_path $1 --vocoder microsoft/speecht5_hifigan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about windows script?

"input": "The quick brown fox jumped over the lazy dog."
}
)";
ASSERT_EQ(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests check only "OK" and no errors, but what about actual output?

@mzegla mzegla requested a review from Copilot January 21, 2026 09:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds speaker embeddings support to the text-to-speech (TTS) functionality, allowing users to customize the voice used for speech generation by providing speaker embedding files.

Changes:

  • Added configuration and validation for speaker embeddings in TTS calculator
  • Refactored test infrastructure to reuse common test base class across multiple test files
  • Added comprehensive test coverage for TTS functionality including speaker embeddings

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
windows_prepare_llm_models.bat Added TTS model download configuration with vocoder
src/test/test_http_utils.hpp Extracted V3HttpTest base class for reuse across test files
src/test/reranknode_test.cpp Refactored to use extracted V3HttpTest base class
src/test/embeddingsnode_test.cpp Refactored to use extracted V3HttpTest base class
src/test/audio/text2speech_test.cpp Added comprehensive tests for TTS functionality including speaker embeddings
src/test/audio/graph.pbtxt Added test graph configuration for TTS with speaker embeddings
src/test/audio/config.json Added test configuration for TTS mediapipe graph
src/mediapipe_internal/mediapipegraphdefinition.cpp Added error handling for TTS servable initialization
src/audio/text_to_speech/t2s_servable.hpp Refactored to use constructor parameters instead of inline initialization
src/audio/text_to_speech/t2s_servable.cpp Implemented constructor with speaker embeddings loading logic
src/audio/text_to_speech/t2s_calculator.proto Added SpeakerEmbeddings message definition
src/audio/text_to_speech/t2s_calculator.cc Added voice parameter handling and speaker embeddings usage
src/audio/text_to_speech/BUILD Updated build dependencies for new implementation
src/BUILD Added text2speech_test to build targets
prepare_llm_models.sh Added TTS model download script for Linux
demos/audio/README.md Added documentation for speaker embeddings feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

voice = voiceIt->value.GetString();
}
if (voice.has_value()) {
if (pipe->voices.find(voice.value()) == pipe->voices.end())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are voices static at this point? Don't we need to lock it for the thread as well for the access?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkulakowski whats the answer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer is yes, they are static, loaded once during pipeline initialization

throw std::runtime_error("File size is not a multiple of float size.");
}
size_t num_floats = buffer_size / sizeof(float);
if (num_floats != 512) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 512?
Can we have more context for this number for future reference?
Is it like embedding dim size or something?

if (streamIt != payload.parsedJson->MemberEnd()) {
return absl::InvalidArgumentError("streaming is not supported");
}
std::optional<std::string> voice;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::optional<std::string> voice;
std::optional<std::string> voiceName;

?
To distinguish from pipe->voices elements

if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) {
voice = voiceIt->value.GetString();
}
if (voice.has_value()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need separate condition here? Can we bring content from this block to the condition above?
if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) I suppose it's the same.

michalkulakowski and others added 5 commits January 21, 2026 10:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@michalkulakowski michalkulakowski force-pushed the mkulakow/support_voice_embeddings branch 2 times, most recently from a93b865 to 2452659 Compare January 21, 2026 12:21
@michalkulakowski michalkulakowski merged commit c9a9610 into main Jan 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants