feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths by freddy311082 · Pull Request #19 · tetherto/qvac-ext-lib-whisper.cpp

freddy311082 · 2026-05-14T23:35:27Z

🎯 What problem does this PR solve?

tts-cpp only supported 18 tier-1 languages; 5 tier-2 languages (ja, he, ru, zh, hi) were missing
MeCab dictionary path for Japanese was hardcoded to /opt/homebrew/lib/mecab/dic/ipadic, breaking cross-platform builds
Cangjie TSV path for Chinese was baked in at compile time, same problem

📝 How does it solve it?

Add text_preprocess.h/.cpp with Japanese (MeCab), Chinese (Cangjie), and Korean (Jamo) preprocessing
Expand supported_languages() from 18 to 23
Replace compile-time dictionary paths with runtime resolution: EngineOptions::mecab_dict_path / cangjie_tsv_path, CLI --mecab-dict / --cangjie-tsv, env var fallbacks CHATTERBOX_MECAB_DICT / CHATTERBOX_CANGJIE_TSV
Add scripts/build_mecab_dict.py to materialize IPAdic from pip (same pattern as tts-onnx)
Add 23-language CTest integration tests (L0 registry, L1 wav structure, L2 audio sanity)
Add optional L3 ASR verification tests via whisper-cli

🧪 How was it tested?

All 23 languages pass L0/L1/L2 synthesis tests
L3 ASR matrix: 22/23 pass, zh fails (CER ~97% — Cangjie approach needs rework)

🔌 API Changes

struct EngineOptions {
    // ...existing fields...
    std::string mecab_dict_path;   // runtime path to MeCab IPAdic dir (for ja)
    std::string cangjie_tsv_path;  // runtime path to Cangjie5_TC.tsv (for zh)
};

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

QVAC-7457: Add seed parameter for reproducible sampling

add_codeowners file

added approval check worker

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

chore: rebase fork to whisper.cpp v1.8.4

Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor

- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor

Made-with: Cursor

Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor

The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor

…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.

…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.

…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)

@gianni-cor

…review) Address @gianni-cor review on PR #11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>

Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml

The standalone setup-ggml.sh + patches/ tooling was dropped from qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the CMakeLists.txt still: * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-" on the bundled ggml target. That combination quietly broke standalone bundled-ggml builds: the filename-prefix patch was no longer applied, so libspeech-ggml-*.so files existed on disk but ggml's runtime loader still searched for libggml-*.so under GGML_BACKEND_DL=ON. Vulkan / OpenCL / CUDA backends silently failed to load on Android. Fix per reviewer guidance: converge the speech stack on a single ggml source-of-truth. Standalone-bundled-ggml is no longer a supported build mode out of this in-tree subtree; the canonical path is `-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack `ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships the patches pre-applied. Edits: - TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this tree. Docstring spells out the rationale + points users at the standalone github.com/gianni-cor/chatterbox.cpp repo if they need a bundled-ggml dev build with patches/ present. - The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to configure when patches/ is absent: a FATAL_ERROR points at the right consumption path (vcpkg ggml-speech) and the standalone fallback. Doesn't break in-tree-with-patches builds (parakeet-cpp in this same repo still ships patches/, so its bundled path is unaffected by this guard inside tts-cpp). - Verified locally: `cmake -S tts-cpp -B build` (no flags) errors out at find_package(ggml CONFIG REQUIRED) with our new message pointing at the ggml-speech port; `cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard with the no-patches message. - tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/ that no longer exist; running it would have errored out anyway. The standalone repo keeps its own setup-ggml.sh; only the in-tree subtree drops it. The standalone chatterbox.cpp repo (the one tts-cpp/ was copied from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/ folder + scripts/setup-ggml.sh. This commit is therefore an integration-time delta against that source, not a change to the standalone build flow. Co-authored-by: Cursor <cursoragent@cursor.com>

…_token Mirrors src/chatterbox_cli.cpp's MTL tokenisation path and the Python ChatterboxMultilingualTTS.generate reference (chatterbox-ref/src/chatterbox/ mtl_tts.py:288-291). The MTL T3 prompt graph anchors position 0 on start_text_token (255); without it the autoregressive decode drops the first speech tokens, audible as a missing leading syllable ("Hello" -> "lo from the multilingual"). Turbo (gpt2_bpe) is unaffected and keeps the existing single-line tokenise + punc_norm path. Co-authored-by: Cursor <cursoragent@cursor.com>

…gine::run_t3 MTL T3 occasionally emits a plausible end-of-speech silence cadence (three identical tokens in a row) mid-utterance and then hallucinates low-energy content -- silence, hissing, garbage tokens -- until n_predict (1000) is reached, producing ~40 s of trailing junk on a short input. chatterbox_cli.cpp already guards against this via the AlignmentStreamAnalyzer token_repetition port, but Engine::run_t3 was missing the same check, so the addon path (which doesn't go through the CLI) saw the regression on whichever language/seed combinations happen to hit the cadence (most reliably reproduced on German with the default seed=42). Mirrors the CLI's existing guard 1:1, gated on is_mtl since the Turbo codebook has a different cadence signature. Co-authored-by: Cursor <cursoragent@cursor.com>

Add tts-cpp/ subtree (Chatterbox Turbo + Multilingual + Supertonic TTS) + integration fixes

Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>

…and Cangjie for Chinese (need more review) Add runtime MeCab/Cangjie dictionary paths for cross-platform multilingual TTS Replace compile-time hardcoded Homebrew paths with runtime dictionary resolution via CLI flags, EngineOptions, and env var fallbacks (CHATTERBOX_MECAB_DICT, CHATTERBOX_CANGJIE_TSV).

… handling

GustavoA1604

Need to check the following. Probably can just feed to cursor and it will be able to take care

Should zh indeed be in supported_languages()? The PR description states "zh fails (CER ~97% — Cangjie approach needs rework". Also on windows the zh test crashed on windows (see blocker #3)
generated.resize(n - 2) in the repetition early-stop trims too aggressively
In chatterbox_cli.cpp, the 3-token early-stop was changed to 2-token with a kMtlMinTokensBeforeCadence = 60 guard. But generated.resize(n - 2) removes both the repeated token and the legitimate one before it. The old code didn't resize at all — the downstream pop_back() handled the trailing token. This should be resize(n - 1). Add a comment explaining the intent.
preprocess_japanese/preprocess_chinese exceptions cause std::terminate instead of a clean error exit. Both ja and zh crash rather than printing an error and returning 1. Root cause: s3gen_preload_thread (line 1031 of chatterbox_cli.cpp) is spawned when --s3gen-gguf is provided. When a preprocessing exception propagates out of the try block during stack unwind, this std::thread's destructor is called while it is still joinable, which unconditionally calls std::terminate(). The codebase has a comment at line 1184 explicitly noting this hazard, and guards against it in the --input-file path — but the normal synthesis path is unguarded. Fix: wrap s3gen_preload_thread in a RAII join guard, or join it in the catch block before rethrowing.
--output flag in test_multilingual_synth.cpp::run_synthesis() does not exist in the CLI

if (!args.tokens_dump_path.empty()) {
cli_args.push_back("--output"); // no such CLI arg
cli_args.push_back(args.tokens_dump_path);
}
chatterbox_cli.cpp has --out for the wav file, no --output. If --dump-tokens is passed to the test binary, tts_cpp_cli_main hits the unknown-argument handler and returns failure. Either remove the dead branch or add the matching CLI flag.

create_mecab_tagger does not pass -r, so a broken system mecabrc silently kills ja synthesis

Confirmed by local testing: MeCab reads the system mecabrc before honoring -d. On any host where the system rc points to a non-existent dictionary (e.g. MSYS2's default naist-jdic that isn't installed), mecab_new returns nullptr and MeCabTagger::load throws with an empty error message. Fix by passing the materialized mecabrc:

mecab_t * create_mecab_tagger(const std::string & dic_str) {
std::string a0 = "mecab";
std::string a1 = "-d";
std::string a2 = dic_str;
std::string a3 = "-r";
std::string a4 = (std::filesystem::path(dic_str) / "mecabrc").string();
char * argv[] = { a0.data(), a1.data(), a2.data(), a3.data(), a4.data() };
return mecab_new(5, argv);
}
build_mecab_dict.py already copies a neutral mecabrc (; This file can be empty but it needs to exist), so this is self-contained.

target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR}) must be PRIVATE

Confirmed by build: PUBLIC propagates the entire MinGW include/ directory to every consumer of tts-cpp. On a Windows Clang-cl build this causes a catastrophic cascade of MinGW vs MSVC header conflicts (hundreds of errors) across every translation unit that includes any tts-cpp header. mecab.h is an implementation detail of text_preprocess.cpp — no public header exposes MeCab types.

was:
target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR})
fix:
target_include_directories(tts-cpp PRIVATE ${MECAB_INCLUDE_DIR})

set_mecab_dict_path / set_cangjie_tsv_path — silent no-op if called after first synthesis

The std::call_once blocks in mtl_tokenizer.cpp capture the global path at first call. If set_mecab_dict_path is called after the first ja synthesis (e.g. constructing a second Engine with a different dict), the once_flag is spent and the new path is silently ignored. At minimum document this constraint on the setters; ideally assert or warn.

Korean (ko) skips decompose_korean_to_jamo — intentional?

apply_language_preprocessing dispatches on "ja" and "zh" but not "ko". The new decompose_korean_to_jamo function exists but is never called. If ko already works without Jamo decomposition this is fine, but it needs a comment explaining why the function exists without being called here.

Duplicate UTF-8 decoder

test_multilingual_asr.cpp contains its own utf8_to_codepoints (lines ~14–40) that duplicates decode_utf8 from text_preprocess.h, which is already a public header in the same library. The ASR test should #include "text_preprocess.h" and use the shared function.

…only once and improving token handling

GustavoA1604

The changes look good now. The only thing left is that there is no acquisition script for the Cangjie TSV (unlike MeCab, which has build_mecab_dict.py). After some investigation the canonical source is Jackchows/Cangjie5 — the Cangjie5_TC.txt file is already TAB-separated, so it's largely compatible with CangjieTable::load() already. I've written a build_cangjie_tsv.py script following the same pattern as build_mecab_dict.py

Also Cangjie5_TC.txt has ~2,529 lines with a third annotation column, e.g.:

昸 ahem [u]
parse_tsv_line uses line.find('\t') to find only the first tab, then sets row.code = line.substr(tab + 1). For annotated lines this gives code = "ahem\t[u]", which would emit tokens like [cj_a][cj_h][cj_e][cj_m] followed by a literal \t and [u] — breaking those characters. Fix is one line in text_preprocess.cpp:

row.code = line.substr(tab + 1);
// strip optional annotation column
const auto tab2 = row.code.find('\t');
if (tab2 != std::string::npos) row.code = row.code.substr(0, tab2);

build_cangjie_tsv.py strips the annotation at conversion time (so the TSV written to disk is already clean), but parse_tsv_line should still be hardened for any raw files callers might point at.

Resolves 37 add/add conflicts that accumulated since the last master merge (May 7). Master moved 326 commits forward, mainly landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend registry refactor (`backend_selection.{h,cpp}`, registry-only device walk replacing the per-backend `#ifdef GGML_USE_<X>` cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the `backends_dir` / `opencl_cache_dir` Engine knobs. Resolution strategy: - parakeet-cpp/ (19 files): taken from master verbatim. The PR branch only carried the original port (commits d7ab516 / c6c3fd7 / 761eca0, all <= May 7); master has 13 newer commits including TDT/EOU/Sortformer v2.1 + AOSC and the word-start signal already integrated. Nothing of the PR was lost on this side. - .github/CODEOWNERS: taken from master (team reorg to `qvac-internal-dev` / `qvac-internal-merge`). - tts-cpp/ stale-from-initial-drop (7 files: voice_encoder, t3_mtl, s3tokenizer, mel_extract_stft, main, campplus, campplus_forward.inc): taken from master. Their only PR commit is the original `ef840d5c Add tts-cpp files` drop; master has since rewritten them for the registry refactor. - tts-cpp/ mirror-only (4 files: supertonic/engine.h, supertonic_engine, supertonic_gguf, chatterbox_tts): taken from master. The PR's only authored commits on these mirror pre-existing fixes from chatterbox.cpp that are already on master. - tts-cpp/CMakeLists.txt: hybrid merge. Master's Android dynamic-backend stack, registry-only backend-defs interface (with `src/backend_selection.cpp` in the source list), and `target_compile_definitions(test-metal-ops PRIVATE GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp` source entry, MeCab/Cangjie find_library block (PRIVATE include per gianni-cor review), and 23-language multilingual test matrix retained. - tts-cpp/include/tts-cpp/chatterbox/engine.h: master's updated `n_gpu_layers` doc (Adreno-tier policy) and new `backends_dir` / `opencl_cache_dir` fields retained. PR's `mecab_dict_path` / `cangjie_tsv_path` fields retained. - tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` + `text_preprocess.h` includes, 23 supported_languages, preprocess_japanese / preprocess_chinese helpers with call_once-cached MeCab tagger + Cangjie table, apply_language_preprocessing dispatch, and `set_mecab_dict_path` / `set_cangjie_tsv_path` setters (with already-initialised warn) retained. Master's `// ---- Encode ----` divider kept. - tts-cpp/src/chatterbox_engine.cpp: master's `#include "backend_selection.h"` and `backends_dir` / `opencl_cache_dir` wiring retained. PR's per-Engine `mtl_tokenizer::set_mecab_dict_path` / `set_cangjie_tsv_path` calls retained. - tts-cpp/src/chatterbox_cli.cpp: master's removal of the per-backend `#include "ggml-{cuda,metal,vulkan}.h"` cascade (registry-only refactor) and the new voice-cloning backend comment retained. PR's `--mecab-dict` / `--cangjie-tsv` flags (declaration, help, parsing, and per-Engine setter call) retained. PR's RAII `thread_join_guard` on the s3gen preload thread retained (addresses GustavoA1604 review #3: std::terminate hazard during stack unwind). PR's 2-token MTL early-stop with `kMtlMinTokensBeforeCadence = 60` guard and `generated.resize(n - 1)` retained (addresses GustavoA1604 review #2: previous over-aggressive `resize(n - 2)` trimmed a legitimate token); the log line was updated to surface the repeated token id. PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp}, tts-cpp/scripts/build_mecab_dict.py, tts-cpp/scripts/build_cangjie_tsv.py, tts-cpp/test/test_multilingual_{synth,asr}.cpp are all preserved as-is by the merge. Co-authored-by: Cursor <cursoragent@cursor.com>

…ng issues

freddy311082 · 2026-05-29T22:35:54Z

@GustavoA1604 CI here: https://github.com/tetherto/qvac/actions/runs/26643991409

ogad-tether

Review — omar.gad@tether.io

I reviewed the full diff (16 files, +2080/−17) against the prior review threads from @GustavoA1604. All 9 items from the first review and both items from the second review appear to be addressed — nice work. Below are new findings.

Blockers

1. Data race on global path strings (`mtl_tokenizer.cpp`)

g_mecab_dict_path and g_cangjie_tsv_path are bare std::string globals written by set_mecab_dict_path() / set_cangjie_tsv_path() and read by resolve_mecab_dict_path() / resolve_cangjie_tsv_path(). The std::call_once guards protect the tagger/table initialization, but the path strings themselves are read/written without synchronization. If two Engine instances are constructed on different threads (plausible in the addon), this is a data race — undefined behavior per the C++ standard.

Fix options:

Protect the globals with a std::mutex, or
Move the path into the std::call_once lambda's capture (snapshot at set time), or
Use std::atomic<std::string*> with a publish pattern.

The simplest fix is probably a std::mutex guarding both the path and the initialized flag together.

2. `TTS_CPP_HAS_MECAB` compile definition should be `PRIVATE`, not `PUBLIC`

target_compile_definitions(tts-cpp PUBLIC TTS_CPP_HAS_MECAB)

MeCabTagger lives in src/text_preprocess.h — an internal header, not under include/. No public header exposes MeCab types. Making this PUBLIC leaks an implementation detail to every consumer of tts-cpp and can cause surprising behavior if a consumer happens to #include "text_preprocess.h" transitively. Same rationale as GustavoA1604's review #6 (include dirs) — this should be PRIVATE.

3. MeCab link library should be `PRIVATE`, not `PUBLIC`

target_link_libraries(tts-cpp PUBLIC ${MECAB_LIBRARY})

MeCab is only used in text_preprocess.cpp. No public header exposes MeCab symbols. PUBLIC linkage means install(EXPORT) records the dependency, and every consumer doing find_package(tts-cpp) must also have libmecab installed — even if they never synthesize Japanese. This should be PRIVATE.

Medium

4. `build_cangjie_tsv.py` defaults to `master` — not reproducible

DEFAULT_TAG = "master"

The MeCab script (build_mecab_dict.py) pins to a specific PyPI package version via ipadic. The Cangjie script downloads from Jackchows/Cangjie5 at whatever master happens to be. If upstream changes column format or renames the file, builds silently break. Pin to a specific commit or tag (e.g., --tag v1.0.0 or a commit SHA) as the default, with master available as an override.

5. `text_preprocess.h` includes `<fstream>` unnecessarily

The header declares CangjieTable::read_tsv_entries(std::ifstream &, ...) which requires <fstream>. But this is a private method — it doesn't need to be in the header at all. Consider moving it to the .cpp file (or making it a free function in the anonymous namespace), which lets you drop <fstream> from the header and reduces include bloat for every TU that includes this header.

Nits

6. CER metric doesn't match standard definition

In test_multilingual_asr.cpp:

const size_t denom = std::max(exp_n.size(), got_n.size());

Standard CER uses ref_len (expected length) as the denominator, not max(ref, hyp). Using max caps the metric at 1.0, which is convenient for thresholding but makes the logged "CER" values incomparable with published benchmarks. Consider renaming to "normalized edit distance" or switching to exp_n.size() as denominator. Not a blocker since this is test-only, but the log output says "CER" which could confuse someone comparing against paper numbers.

7. `sh_quote()` in ASR test is POSIX-only

test_multilingual_asr.cpp uses POSIX single-quote escaping + std::system(). On Windows this won't work (cmd.exe quoting is different). Fine if the L3 tests are POSIX-only, but worth a comment or a #ifdef _WIN32 skip.

8. Parakeet patches in this PR

The diff includes 4 new files under parakeet-cpp/patches/ (+659 lines). These appear to have come through the merge commit with master. They're unrelated to the multilingual feature — just flagging in case the intent was a clean feature-only diff for review purposes. If they were already reviewed and merged via another PR, no action needed.

Summary

The core multilingual implementation is solid — the MeCab/Cangjie architecture, runtime path resolution with env-var fallbacks, the tiered test matrix, and the zh exclusion are all well-thought-out. The three blockers above are focused on thread safety and build hygiene; they should be straightforward to fix. Happy to re-review once those land.

Blockers: - mtl_tokenizer: guard global MeCab/Cangjie path strings + initialized flags with a mutex. The std::call_once flags only serialized tagger/ table init; the path globals were read by resolve_*()/written by set_*() unsynchronized, a data race when two Engines are constructed on different threads. set_*/resolve_* and the in-call_once flag write now take path_mutex(); the flag write is scoped so it releases before resolve_* re-locks (no deadlock on the non-recursive mutex). - CMakeLists: TTS_CPP_HAS_MECAB compile definition PUBLIC -> PRIVATE. - CMakeLists: MECAB_LIBRARY link PUBLIC -> PRIVATE. MeCab is an impl detail of text_preprocess.cpp; no public header exposes its types or symbols, so consumers of tts-cpp no longer inherit a libmecab dep. Medium: - build_cangjie_tsv.py: default ref master -> pinned commit b76ed7fd4b8365529c056096b6a3a2ba749f5a40 for reproducible builds (Jackchows/Cangjie5 has no tags); master still available via --tag. - text_preprocess.h: drop <fstream> (-> <iosfwd>) and take the private read_tsv_entries() by std::istream& instead of std::ifstream& so the header no longer pulls <fstream> into every consumer TU. Nits: - test_multilingual_asr: CER denominator max(ref,hyp) -> ref_len so the metric matches the standard definition and stays comparable with published benchmarks. - test_multilingual_asr: document that sh_quote()/run_asr() are POSIX-only.

ogad-tether

Re-review — omar.gad@tether.io

Commit 63fc0c93 addresses all 7 items from my first review. Verified each fix:

#	Issue	Status
1	Data race on global path strings	✅ `path_mutex()` guards all reads/writes
2	`TTS_CPP_HAS_MECAB` PUBLIC → PRIVATE	✅
3	MeCab link PUBLIC → PRIVATE	✅
4	`build_cangjie_tsv.py` unpinned default	✅ Pinned to `b76ed7fd`
5	`text_preprocess.h` includes `<fstream>`	✅ Now `<iosfwd>` + `std::istream &`
6	CER denominator non-standard	✅ Now `exp_n.size()`
7	`sh_quote` POSIX-only undocumented	✅ Comment added

One remaining nit (non-blocking)

g_mecab_initialized is set before the tagger is actually initialized — misleading warning in a narrow race window

In mecab_tagger():

std::call_once(once, [] {
    {
        std::lock_guard<std::mutex> lk(path_mutex());
        g_mecab_initialized = true;   // ← set HERE
    }
    const std::string path = resolve_mecab_dict_path();  // ← but init happens AFTER
    ...
    t->load(path);
});

If Thread B calls set_mecab_dict_path("new_path") in the window between the flag being set and resolve_mecab_dict_path() being called:

Thread B sees g_mecab_initialized == true, prints "new path will be ignored"
Thread B then sets g_mecab_dict_path = "new_path"
Thread A calls resolve_mecab_dict_path(), reads the new path, and uses it

So the warning says "will be ignored" but the path IS used. The fix is to move g_mecab_initialized = true (and the Cangjie equivalent) to the end of the call_once lambda, after t->load() completes. That way:

set_*() before any mecab_tagger() call → sets path, no warning, tagger uses it ✅
set_*() after mecab_tagger() completes → flag is true, warning fires, path genuinely ignored ✅
set_*() during call_once (inherently racy by the caller) → no flag yet, no misleading warning ✅

Same pattern for cangjie_table().

This is a very narrow window and the practical impact is a spurious warning message (not UB), so it's non-blocking — but it's a quick one-line move that makes the semantics correct.

Verdict

Everything else looks good. The mutex pattern is correct (no deadlocks — the scoped lock releases before resolve_*() re-acquires, and std::call_once prevents re-entrant init). The thread_join_guard RAII is correct (safe on default-constructed thread). The 2-token early-stop with kMtlMinTokensBeforeCadence = 60 and resize(n-1) is correct. The Cangjie strip_trailing_tsv_columns hardening handles annotated lines properly.

Ship it once the flag-ordering nit above is addressed (or documented as intentional).

The g_mecab_initialized / g_cangjie_initialized flags were set at the start of the call_once lambda, before resolve_*_path() + load() ran. In the narrow window between setting the flag and resolving the path, a concurrent set_*_path() would see the flag as true and warn that the new path "will be ignored" -- yet resolve_*_path() could still pick that path up and use it, making the warning incorrect. Move the flag assignment to the end of the lambda, after the load attempt completes, so the "ignored" warning only fires when the path is genuinely too late to take effect. Also restructure to move the unique_ptr into the static only on success instead of constructing-then-resetting on failure. Addresses the remaining non-blocking nit from ogad-tether's re-review on PR #19.

nik and others added 30 commits November 10, 2025 13:02

Add seed parameter for reproducible sampling

80f95ec

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

add_codeowners file

e421573

added approval check worker

962c380

Merge pull request #4 from aegioscy/master

71d3308

QVAC-7457: Add seed parameter for reproducible sampling

Merge pull request #5 from shikha-tether/add_codeowners_file

010d8f9

add_codeowners file

Merge pull request #6 from shikha-tether/add_approval_check_worker

eda3d21

added approval check worker

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

06c478e

Merge pull request #7 from Proletter/DEVOPS-916_update_codeowners

65d5225

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

Add seed parameter for reproducible sampling

7ce31d4

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

add_codeowners file

2cc2313

added approval check worker

6befb6f

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

2a94ba2

Merge branch 'master' into rebase-v1.8.4

8519283

Merge pull request #8 from tetherto/rebase-v1.8.4

e361028

chore: rebase fork to whisper.cpp v1.8.4

feat(bci): add variable conv1 kernel size support

775d436

Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor

fix vcpkg build

e6fcbaa

Made-with: Cursor

fix apple silicon cross compile

461f07d

Made-with: Cursor

fix windows pthread

bbb3535

Made-with: Cursor

Merge pull request #10 from tetherto/feat/bci-patches-v184

2b1e04f

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)

Add parakeet-cpp port

d7ab516

Merge pull request #11 from GustavoA1604/add-parakeet-cpp

a6785de

Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml

Add tts-cpp files

ef840d5

GustavoA1604 and others added 6 commits May 7, 2026 11:07

Merge pull request #14 from GustavoA1604/tts-cpp

be913c8

Add tts-cpp/ subtree (Chatterbox Turbo + Multilingual + Supertonic TTS) + integration fixes

feat: improve Chinese preprocessing output format by adjusting suffix…

6bece95

… handling

freddy311082 requested review from a team as code owners May 14, 2026 23:35

freddy311082 requested a review from GustavoA1604 May 14, 2026 23:35

GustavoA1604 requested changes May 15, 2026

View reviewed changes

feat: enhance multilingual support by initializing MeCab and Cangjie …

cb2817b

…only once and improving token handling

GustavoA1604 requested changes May 18, 2026

View reviewed changes

feat: add script to download and prepare Cangjie5 TSV for tts-cpp

38737c8

GustavoA1604 previously approved these changes May 19, 2026

View reviewed changes

gianni-cor force-pushed the master branch from f8af247 to eabcf6d Compare May 28, 2026 12:36

gianni-cor requested review from a team as code owners May 28, 2026 12:36

freddy311082 dismissed GustavoA1604’s stale review via d45b473 May 29, 2026 15:34

refactor: exclude Chinese from supported languages due to preprocessi…

d3db516

…ng issues

freddy311082 requested review from gianni-cor and jpgaribotti May 29, 2026 22:36

ogad-tether reviewed Jun 1, 2026

View reviewed changes

ogad-tether approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19

feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19
freddy311082 wants to merge 54 commits into
masterfrom
feat/tts-multilingual-23-langs

freddy311082 commented May 14, 2026

Uh oh!

GustavoA1604 left a comment •

edited

Loading

Uh oh!

GustavoA1604 left a comment

Uh oh!

freddy311082 commented May 29, 2026

Uh oh!

ogad-tether left a comment

Uh oh!

ogad-tether left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

freddy311082 commented May 14, 2026

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

🔌 API Changes

Uh oh!

GustavoA1604 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

freddy311082 commented May 29, 2026

Uh oh!

ogad-tether left a comment

Choose a reason for hiding this comment

Review — omar.gad@tether.io

Blockers

1. Data race on global path strings (mtl_tokenizer.cpp)

2. TTS_CPP_HAS_MECAB compile definition should be PRIVATE, not PUBLIC

3. MeCab link library should be PRIVATE, not PUBLIC

Medium

4. build_cangjie_tsv.py defaults to master — not reproducible

5. text_preprocess.h includes <fstream> unnecessarily

Nits

6. CER metric doesn't match standard definition

7. sh_quote() in ASR test is POSIX-only

8. Parakeet patches in this PR

Summary

Uh oh!

ogad-tether left a comment

Choose a reason for hiding this comment

Re-review — omar.gad@tether.io

One remaining nit (non-blocking)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GustavoA1604 left a comment •

edited

Loading

1. Data race on global path strings (`mtl_tokenizer.cpp`)

2. `TTS_CPP_HAS_MECAB` compile definition should be `PRIVATE`, not `PUBLIC`

3. MeCab link library should be `PRIVATE`, not `PUBLIC`

4. `build_cangjie_tsv.py` defaults to `master` — not reproducible

5. `text_preprocess.h` includes `<fstream>` unnecessarily

7. `sh_quote()` in ASR test is POSIX-only