-
Notifications
You must be signed in to change notification settings - Fork 237
Add speaker embeddings support #3855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec35a1a to
4f45830
Compare
|
Can you include the unit tests? |
| TtsServable(const mediapipe::T2sCalculatorOptions& nodeOptions, const std::string& graphPath) { | ||
| auto fsModelsPath = std::filesystem::path(nodeOptions.models_path()); | ||
| TtsServable(const std::string& modelDir, const std::string& targetDevice, const google::protobuf::RepeatedPtrField<mediapipe::T2sCalculatorOptions_SpeakerEmbeddings>& graphVoices, const std::string& pluginConfig, const std::string& graphPath) { | ||
| auto fsModelsPath = std::filesystem::path(modelDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this constructor to cpp? this way you could move read_speaker_embedding to cpp as well.
| input.seekg(0, std::ios::beg); | ||
|
|
||
| // Check size is multiple of float | ||
| OPENVINO_ASSERT(buffer_size % sizeof(float) == 0, "File size is not a multiple of float size."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it might be the first time we use openvino_assert
why are we introducing yet another way of reporting errors?
for now we have:
- ovms::Status
- absl::Status
- std::variant<Obj, std::string(error)>
- and now this... :(
| for (auto voice : graphVoices) { | ||
| if (!std::filesystem::exists(voice.path())) | ||
| throw std::runtime_error{"Requested voice speaker embeddings file does not exist."}; | ||
| voices[voice.name()] = read_speaker_embedding(voice.path()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we catch exceptions thrown inside read_speaker_embedding?
0baece7 to
aa8b340
Compare
|
|
||
| Instead of generating speech with default model voice you can create speaker embeddings with [this script](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/speech_generation/create_speaker_embedding.py) | ||
| ```bash | ||
| curl --output create_speaker_embedding.py "https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/master/samples/python/speech_generation/create_speaker_embedding.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yet another link for manual change, it wouldnt even hit sed command which i use because its genai and master
do we want to keep this reference to master and risk of change and getting outdated demos? or do we want to hardcode it? or maybe do we want to keep it in our repo?
| if [ -f "$1/$TTS_MODEL/$TOKENIZER_FILE" ]; then | ||
| echo "Model file $1/$TTS_MODEL/$TOKENIZER_FILE exists. Skipping downloading models." | ||
| else | ||
| python3 demos/common/export_models/export_model.py text2speech --source_model "$TTS_MODEL" --weight-format int4 --model_repository_path $1 --vocoder microsoft/speecht5_hifigan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about windows script?
| "input": "The quick brown fox jumped over the lazy dog." | ||
| } | ||
| )"; | ||
| ASSERT_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests check only "OK" and no errors, but what about actual output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds speaker embeddings support to the text-to-speech (TTS) functionality, allowing users to customize the voice used for speech generation by providing speaker embedding files.
Changes:
- Added configuration and validation for speaker embeddings in TTS calculator
- Refactored test infrastructure to reuse common test base class across multiple test files
- Added comprehensive test coverage for TTS functionality including speaker embeddings
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| windows_prepare_llm_models.bat | Added TTS model download configuration with vocoder |
| src/test/test_http_utils.hpp | Extracted V3HttpTest base class for reuse across test files |
| src/test/reranknode_test.cpp | Refactored to use extracted V3HttpTest base class |
| src/test/embeddingsnode_test.cpp | Refactored to use extracted V3HttpTest base class |
| src/test/audio/text2speech_test.cpp | Added comprehensive tests for TTS functionality including speaker embeddings |
| src/test/audio/graph.pbtxt | Added test graph configuration for TTS with speaker embeddings |
| src/test/audio/config.json | Added test configuration for TTS mediapipe graph |
| src/mediapipe_internal/mediapipegraphdefinition.cpp | Added error handling for TTS servable initialization |
| src/audio/text_to_speech/t2s_servable.hpp | Refactored to use constructor parameters instead of inline initialization |
| src/audio/text_to_speech/t2s_servable.cpp | Implemented constructor with speaker embeddings loading logic |
| src/audio/text_to_speech/t2s_calculator.proto | Added SpeakerEmbeddings message definition |
| src/audio/text_to_speech/t2s_calculator.cc | Added voice parameter handling and speaker embeddings usage |
| src/audio/text_to_speech/BUILD | Updated build dependencies for new implementation |
| src/BUILD | Added text2speech_test to build targets |
| prepare_llm_models.sh | Added TTS model download script for Linux |
| demos/audio/README.md | Added documentation for speaker embeddings feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| voice = voiceIt->value.GetString(); | ||
| } | ||
| if (voice.has_value()) { | ||
| if (pipe->voices.find(voice.value()) == pipe->voices.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are voices static at this point? Don't we need to lock it for the thread as well for the access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michalkulakowski whats the answer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer is yes, they are static, loaded once during pipeline initialization
| throw std::runtime_error("File size is not a multiple of float size."); | ||
| } | ||
| size_t num_floats = buffer_size / sizeof(float); | ||
| if (num_floats != 512) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 512?
Can we have more context for this number for future reference?
Is it like embedding dim size or something?
| if (streamIt != payload.parsedJson->MemberEnd()) { | ||
| return absl::InvalidArgumentError("streaming is not supported"); | ||
| } | ||
| std::optional<std::string> voice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::optional<std::string> voice; | |
| std::optional<std::string> voiceName; |
?
To distinguish from pipe->voices elements
| if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) { | ||
| voice = voiceIt->value.GetString(); | ||
| } | ||
| if (voice.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need separate condition here? Can we bring content from this block to the condition above?
if (voiceIt != payload.parsedJson->MemberEnd() && voiceIt->value.IsString()) I suppose it's the same.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
a93b865 to
2452659
Compare
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``