feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19
feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19freddy311082 wants to merge 54 commits into
Conversation
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
add_codeowners file
added approval check worker
DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor
The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor
…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.
…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.
…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
…review) Address @gianni-cor review on PR #11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>
Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml
The standalone setup-ggml.sh + patches/ tooling was dropped from
qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the
CMakeLists.txt still:
* defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and
* unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-"
on the bundled ggml target.
That combination quietly broke standalone bundled-ggml builds: the
filename-prefix patch was no longer applied, so libspeech-ggml-*.so
files existed on disk but ggml's runtime loader still searched for
libggml-*.so under GGML_BACKEND_DL=ON. Vulkan / OpenCL / CUDA
backends silently failed to load on Android.
Fix per reviewer guidance: converge the speech stack on a single ggml
source-of-truth. Standalone-bundled-ggml is no longer a supported
build mode out of this in-tree subtree; the canonical path is
`-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack
`ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships
the patches pre-applied.
Edits:
- TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this
tree. Docstring spells out the rationale + points users at the
standalone github.com/gianni-cor/chatterbox.cpp repo if they need
a bundled-ggml dev build with patches/ present.
- The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to
configure when patches/ is absent: a FATAL_ERROR points at the
right consumption path (vcpkg ggml-speech) and the standalone
fallback. Doesn't break in-tree-with-patches builds (parakeet-cpp
in this same repo still ships patches/, so its bundled path is
unaffected by this guard inside tts-cpp).
- Verified locally: `cmake -S tts-cpp -B build` (no flags) errors
out at find_package(ggml CONFIG REQUIRED) with our new message
pointing at the ggml-speech port; `cmake -S tts-cpp -B build
-DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard
with the no-patches message.
- tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/
that no longer exist; running it would have errored out anyway.
The standalone repo keeps its own setup-ggml.sh; only the in-tree
subtree drops it.
The standalone chatterbox.cpp repo (the one tts-cpp/ was copied
from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/
folder + scripts/setup-ggml.sh. This commit is therefore an
integration-time delta against that source, not a change to the
standalone build flow.
Co-authored-by: Cursor <cursoragent@cursor.com>
…_token
Mirrors src/chatterbox_cli.cpp's MTL tokenisation path and the Python
ChatterboxMultilingualTTS.generate reference (chatterbox-ref/src/chatterbox/
mtl_tts.py:288-291). The MTL T3 prompt graph anchors position 0 on
start_text_token (255); without it the autoregressive decode drops the
first speech tokens, audible as a missing leading syllable
("Hello" -> "lo from the multilingual").
Turbo (gpt2_bpe) is unaffected and keeps the existing single-line tokenise
+ punc_norm path.
Co-authored-by: Cursor <cursoragent@cursor.com>
…gine::run_t3 MTL T3 occasionally emits a plausible end-of-speech silence cadence (three identical tokens in a row) mid-utterance and then hallucinates low-energy content -- silence, hissing, garbage tokens -- until n_predict (1000) is reached, producing ~40 s of trailing junk on a short input. chatterbox_cli.cpp already guards against this via the AlignmentStreamAnalyzer token_repetition port, but Engine::run_t3 was missing the same check, so the addon path (which doesn't go through the CLI) saw the regression on whichever language/seed combinations happen to hit the cadence (most reliably reproduced on German with the default seed=42). Mirrors the CLI's existing guard 1:1, gated on is_mtl since the Turbo codebook has a different cadence signature. Co-authored-by: Cursor <cursoragent@cursor.com>
Add tts-cpp/ subtree (Chatterbox Turbo + Multilingual + Supertonic TTS) + integration fixes
Two interleaved chatterbox concurrency fixes for iOS, collapsed into one history-preserving commit on top of the upstream merge so the qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604 SHA (no chained port-version bumps). 1) gguf_init_from_file race (the SIGABRT seen before this commit): bake_voice_conditioning() must run BEFORE we spawn the s3gen preload thread. Both paths funnel into gguf_init_from_file() (voice_encoder opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the ggml_init / gguf_init_from_file pair underneath is not safe to invoke concurrently from two threads against ggml's process-global state. Empirically races on Apple Silicon with a fast SIGABRT inside ggml_abort coming from the preload thread's ggml_init while the main thread is still executing voice_encoder_load. 2) Metal shared-buffer-type init race (the SIGSEGV in ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the preload thread spawns we now block on wait_for_preload() before the constructor returns, so the SDK e2e bootstrap's load -> immediate unload pattern ("preLoadUnload") can no longer tear down the engine while s3gen_preload is still inside ggml_backend_metal_buffer_type_shared_alloc_buffer -> ggml_metal_buffer_is_shared. Defeats the parallel-preload optimisation (s3gen_preload no longer overlaps with first T3 inference inside synthesize()); revisit once ggml-metal's shared buffer-type init is safe to use from a preload thread concurrent with construction-time teardown. Together these two changes unblock chatterbox load on iPhone 16e (iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal enabled — qvac/pull/1992. Co-authored-by: Cursor <cursoragent@cursor.com>
…and Cangjie for Chinese (need more review) Add runtime MeCab/Cangjie dictionary paths for cross-platform multilingual TTS Replace compile-time hardcoded Homebrew paths with runtime dictionary resolution via CLI flags, EngineOptions, and env var fallbacks (CHATTERBOX_MECAB_DICT, CHATTERBOX_CANGJIE_TSV).
There was a problem hiding this comment.
Need to check the following. Probably can just feed to cursor and it will be able to take care
-
Should zh indeed be in supported_languages()? The PR description states "zh fails (CER ~97% — Cangjie approach needs rework". Also on windows the zh test crashed on windows (see blocker #3)
-
generated.resize(n - 2) in the repetition early-stop trims too aggressively
In chatterbox_cli.cpp, the 3-token early-stop was changed to 2-token with a kMtlMinTokensBeforeCadence = 60 guard. But generated.resize(n - 2) removes both the repeated token and the legitimate one before it. The old code didn't resize at all — the downstream pop_back() handled the trailing token. This should be resize(n - 1). Add a comment explaining the intent. -
preprocess_japanese/preprocess_chinese exceptions cause std::terminate instead of a clean error exit. Both ja and zh crash rather than printing an error and returning 1. Root cause: s3gen_preload_thread (line 1031 of chatterbox_cli.cpp) is spawned when --s3gen-gguf is provided. When a preprocessing exception propagates out of the try block during stack unwind, this std::thread's destructor is called while it is still joinable, which unconditionally calls std::terminate(). The codebase has a comment at line 1184 explicitly noting this hazard, and guards against it in the --input-file path — but the normal synthesis path is unguarded. Fix: wrap s3gen_preload_thread in a RAII join guard, or join it in the catch block before rethrowing.
-
--output flag in test_multilingual_synth.cpp::run_synthesis() does not exist in the CLI
if (!args.tokens_dump_path.empty()) {
cli_args.push_back("--output"); // no such CLI arg
cli_args.push_back(args.tokens_dump_path);
}
chatterbox_cli.cpp has --out for the wav file, no --output. If --dump-tokens is passed to the test binary, tts_cpp_cli_main hits the unknown-argument handler and returns failure. Either remove the dead branch or add the matching CLI flag.
- create_mecab_tagger does not pass -r, so a broken system mecabrc silently kills ja synthesis
Confirmed by local testing: MeCab reads the system mecabrc before honoring -d. On any host where the system rc points to a non-existent dictionary (e.g. MSYS2's default naist-jdic that isn't installed), mecab_new returns nullptr and MeCabTagger::load throws with an empty error message. Fix by passing the materialized mecabrc:
mecab_t * create_mecab_tagger(const std::string & dic_str) {
std::string a0 = "mecab";
std::string a1 = "-d";
std::string a2 = dic_str;
std::string a3 = "-r";
std::string a4 = (std::filesystem::path(dic_str) / "mecabrc").string();
char * argv[] = { a0.data(), a1.data(), a2.data(), a3.data(), a4.data() };
return mecab_new(5, argv);
}
build_mecab_dict.py already copies a neutral mecabrc (; This file can be empty but it needs to exist), so this is self-contained.
- target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR}) must be PRIVATE
Confirmed by build: PUBLIC propagates the entire MinGW include/ directory to every consumer of tts-cpp. On a Windows Clang-cl build this causes a catastrophic cascade of MinGW vs MSVC header conflicts (hundreds of errors) across every translation unit that includes any tts-cpp header. mecab.h is an implementation detail of text_preprocess.cpp — no public header exposes MeCab types.
was:
target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR})
fix:
target_include_directories(tts-cpp PRIVATE ${MECAB_INCLUDE_DIR})
- set_mecab_dict_path / set_cangjie_tsv_path — silent no-op if called after first synthesis
The std::call_once blocks in mtl_tokenizer.cpp capture the global path at first call. If set_mecab_dict_path is called after the first ja synthesis (e.g. constructing a second Engine with a different dict), the once_flag is spent and the new path is silently ignored. At minimum document this constraint on the setters; ideally assert or warn.
- Korean (ko) skips decompose_korean_to_jamo — intentional?
apply_language_preprocessing dispatches on "ja" and "zh" but not "ko". The new decompose_korean_to_jamo function exists but is never called. If ko already works without Jamo decomposition this is fine, but it needs a comment explaining why the function exists without being called here.
- Duplicate UTF-8 decoder
test_multilingual_asr.cpp contains its own utf8_to_codepoints (lines ~14–40) that duplicates decode_utf8 from text_preprocess.h, which is already a public header in the same library. The ASR test should #include "text_preprocess.h" and use the shared function.
…only once and improving token handling
GustavoA1604
left a comment
There was a problem hiding this comment.
The changes look good now. The only thing left is that there is no acquisition script for the Cangjie TSV (unlike MeCab, which has build_mecab_dict.py). After some investigation the canonical source is Jackchows/Cangjie5 — the Cangjie5_TC.txt file is already TAB-separated, so it's largely compatible with CangjieTable::load() already. I've written a build_cangjie_tsv.py script following the same pattern as build_mecab_dict.py
Also Cangjie5_TC.txt has ~2,529 lines with a third annotation column, e.g.:
昸 ahem [u]
parse_tsv_line uses line.find('\t') to find only the first tab, then sets row.code = line.substr(tab + 1). For annotated lines this gives code = "ahem\t[u]", which would emit tokens like [cj_a][cj_h][cj_e][cj_m] followed by a literal \t and [u] — breaking those characters. Fix is one line in text_preprocess.cpp:
row.code = line.substr(tab + 1);
// strip optional annotation column
const auto tab2 = row.code.find('\t');
if (tab2 != std::string::npos) row.code = row.code.substr(0, tab2);
build_cangjie_tsv.py strips the annotation at conversion time (so the TSV written to disk is already clean), but parse_tsv_line should still be hardened for any raw files callers might point at.
Resolves 37 add/add conflicts that accumulated since the last
master merge (May 7). Master moved 326 commits forward, mainly
landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend
registry refactor (`backend_selection.{h,cpp}`, registry-only
device walk replacing the per-backend `#ifdef GGML_USE_<X>`
cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the
`backends_dir` / `opencl_cache_dir` Engine knobs.
Resolution strategy:
- parakeet-cpp/ (19 files): taken from master verbatim. The PR
branch only carried the original port (commits d7ab516 /
c6c3fd7 / 761eca0, all <= May 7); master has 13 newer
commits including TDT/EOU/Sortformer v2.1 + AOSC and the
word-start signal already integrated. Nothing of the PR was
lost on this side.
- .github/CODEOWNERS: taken from master (team reorg to
`qvac-internal-dev` / `qvac-internal-merge`).
- tts-cpp/ stale-from-initial-drop (7 files: voice_encoder,
t3_mtl, s3tokenizer, mel_extract_stft, main, campplus,
campplus_forward.inc): taken from master. Their only PR
commit is the original `ef840d5c Add tts-cpp files` drop;
master has since rewritten them for the registry refactor.
- tts-cpp/ mirror-only (4 files: supertonic/engine.h,
supertonic_engine, supertonic_gguf, chatterbox_tts): taken
from master. The PR's only authored commits on these mirror
pre-existing fixes from chatterbox.cpp that are already on
master.
- tts-cpp/CMakeLists.txt: hybrid merge. Master's Android
dynamic-backend stack, registry-only backend-defs interface
(with `src/backend_selection.cpp` in the source list), and
`target_compile_definitions(test-metal-ops PRIVATE
GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp`
source entry, MeCab/Cangjie find_library block (PRIVATE
include per gianni-cor review), and 23-language multilingual
test matrix retained.
- tts-cpp/include/tts-cpp/chatterbox/engine.h: master's
updated `n_gpu_layers` doc (Adreno-tier policy) and new
`backends_dir` / `opencl_cache_dir` fields retained. PR's
`mecab_dict_path` / `cangjie_tsv_path` fields retained.
- tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` +
`text_preprocess.h` includes, 23 supported_languages,
preprocess_japanese / preprocess_chinese helpers with
call_once-cached MeCab tagger + Cangjie table,
apply_language_preprocessing dispatch, and
`set_mecab_dict_path` / `set_cangjie_tsv_path` setters
(with already-initialised warn) retained. Master's
`// ---- Encode ----` divider kept.
- tts-cpp/src/chatterbox_engine.cpp: master's `#include
"backend_selection.h"` and `backends_dir` /
`opencl_cache_dir` wiring retained. PR's per-Engine
`mtl_tokenizer::set_mecab_dict_path` /
`set_cangjie_tsv_path` calls retained.
- tts-cpp/src/chatterbox_cli.cpp: master's removal of the
per-backend `#include "ggml-{cuda,metal,vulkan}.h"`
cascade (registry-only refactor) and the new
voice-cloning backend comment retained. PR's
`--mecab-dict` / `--cangjie-tsv` flags (declaration, help,
parsing, and per-Engine setter call) retained. PR's RAII
`thread_join_guard` on the s3gen preload thread retained
(addresses GustavoA1604 review #3: std::terminate hazard
during stack unwind). PR's 2-token MTL early-stop with
`kMtlMinTokensBeforeCadence = 60` guard and
`generated.resize(n - 1)` retained (addresses
GustavoA1604 review #2: previous over-aggressive
`resize(n - 2)` trimmed a legitimate token); the log line
was updated to surface the repeated token id.
PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp},
tts-cpp/scripts/build_mecab_dict.py,
tts-cpp/scripts/build_cangjie_tsv.py,
tts-cpp/test/test_multilingual_{synth,asr}.cpp are all
preserved as-is by the merge.
Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
left a comment
There was a problem hiding this comment.
Review — omar.gad@tether.io
I reviewed the full diff (16 files, +2080/−17) against the prior review threads from @GustavoA1604. All 9 items from the first review and both items from the second review appear to be addressed — nice work. Below are new findings.
Blockers
1. Data race on global path strings (mtl_tokenizer.cpp)
g_mecab_dict_path and g_cangjie_tsv_path are bare std::string globals written by set_mecab_dict_path() / set_cangjie_tsv_path() and read by resolve_mecab_dict_path() / resolve_cangjie_tsv_path(). The std::call_once guards protect the tagger/table initialization, but the path strings themselves are read/written without synchronization. If two Engine instances are constructed on different threads (plausible in the addon), this is a data race — undefined behavior per the C++ standard.
Fix options:
- Protect the globals with a
std::mutex, or - Move the path into the
std::call_oncelambda's capture (snapshot at set time), or - Use
std::atomic<std::string*>with a publish pattern.
The simplest fix is probably a std::mutex guarding both the path and the initialized flag together.
2. TTS_CPP_HAS_MECAB compile definition should be PRIVATE, not PUBLIC
target_compile_definitions(tts-cpp PUBLIC TTS_CPP_HAS_MECAB)MeCabTagger lives in src/text_preprocess.h — an internal header, not under include/. No public header exposes MeCab types. Making this PUBLIC leaks an implementation detail to every consumer of tts-cpp and can cause surprising behavior if a consumer happens to #include "text_preprocess.h" transitively. Same rationale as GustavoA1604's review #6 (include dirs) — this should be PRIVATE.
3. MeCab link library should be PRIVATE, not PUBLIC
target_link_libraries(tts-cpp PUBLIC ${MECAB_LIBRARY})MeCab is only used in text_preprocess.cpp. No public header exposes MeCab symbols. PUBLIC linkage means install(EXPORT) records the dependency, and every consumer doing find_package(tts-cpp) must also have libmecab installed — even if they never synthesize Japanese. This should be PRIVATE.
Medium
4. build_cangjie_tsv.py defaults to master — not reproducible
DEFAULT_TAG = "master"The MeCab script (build_mecab_dict.py) pins to a specific PyPI package version via ipadic. The Cangjie script downloads from Jackchows/Cangjie5 at whatever master happens to be. If upstream changes column format or renames the file, builds silently break. Pin to a specific commit or tag (e.g., --tag v1.0.0 or a commit SHA) as the default, with master available as an override.
5. text_preprocess.h includes <fstream> unnecessarily
The header declares CangjieTable::read_tsv_entries(std::ifstream &, ...) which requires <fstream>. But this is a private method — it doesn't need to be in the header at all. Consider moving it to the .cpp file (or making it a free function in the anonymous namespace), which lets you drop <fstream> from the header and reduces include bloat for every TU that includes this header.
Nits
6. CER metric doesn't match standard definition
In test_multilingual_asr.cpp:
const size_t denom = std::max(exp_n.size(), got_n.size());Standard CER uses ref_len (expected length) as the denominator, not max(ref, hyp). Using max caps the metric at 1.0, which is convenient for thresholding but makes the logged "CER" values incomparable with published benchmarks. Consider renaming to "normalized edit distance" or switching to exp_n.size() as denominator. Not a blocker since this is test-only, but the log output says "CER" which could confuse someone comparing against paper numbers.
7. sh_quote() in ASR test is POSIX-only
test_multilingual_asr.cpp uses POSIX single-quote escaping + std::system(). On Windows this won't work (cmd.exe quoting is different). Fine if the L3 tests are POSIX-only, but worth a comment or a #ifdef _WIN32 skip.
8. Parakeet patches in this PR
The diff includes 4 new files under parakeet-cpp/patches/ (+659 lines). These appear to have come through the merge commit with master. They're unrelated to the multilingual feature — just flagging in case the intent was a clean feature-only diff for review purposes. If they were already reviewed and merged via another PR, no action needed.
Summary
The core multilingual implementation is solid — the MeCab/Cangjie architecture, runtime path resolution with env-var fallbacks, the tiered test matrix, and the zh exclusion are all well-thought-out. The three blockers above are focused on thread safety and build hygiene; they should be straightforward to fix. Happy to re-review once those land.
Blockers: - mtl_tokenizer: guard global MeCab/Cangjie path strings + initialized flags with a mutex. The std::call_once flags only serialized tagger/ table init; the path globals were read by resolve_*()/written by set_*() unsynchronized, a data race when two Engines are constructed on different threads. set_*/resolve_* and the in-call_once flag write now take path_mutex(); the flag write is scoped so it releases before resolve_* re-locks (no deadlock on the non-recursive mutex). - CMakeLists: TTS_CPP_HAS_MECAB compile definition PUBLIC -> PRIVATE. - CMakeLists: MECAB_LIBRARY link PUBLIC -> PRIVATE. MeCab is an impl detail of text_preprocess.cpp; no public header exposes its types or symbols, so consumers of tts-cpp no longer inherit a libmecab dep. Medium: - build_cangjie_tsv.py: default ref master -> pinned commit b76ed7fd4b8365529c056096b6a3a2ba749f5a40 for reproducible builds (Jackchows/Cangjie5 has no tags); master still available via --tag. - text_preprocess.h: drop <fstream> (-> <iosfwd>) and take the private read_tsv_entries() by std::istream& instead of std::ifstream& so the header no longer pulls <fstream> into every consumer TU. Nits: - test_multilingual_asr: CER denominator max(ref,hyp) -> ref_len so the metric matches the standard definition and stays comparable with published benchmarks. - test_multilingual_asr: document that sh_quote()/run_asr() are POSIX-only.
ogad-tether
left a comment
There was a problem hiding this comment.
Re-review — omar.gad@tether.io
Commit 63fc0c93 addresses all 7 items from my first review. Verified each fix:
| # | Issue | Status |
|---|---|---|
| 1 | Data race on global path strings | ✅ path_mutex() guards all reads/writes |
| 2 | TTS_CPP_HAS_MECAB PUBLIC → PRIVATE |
✅ |
| 3 | MeCab link PUBLIC → PRIVATE | ✅ |
| 4 | build_cangjie_tsv.py unpinned default |
✅ Pinned to b76ed7fd |
| 5 | text_preprocess.h includes <fstream> |
✅ Now <iosfwd> + std::istream & |
| 6 | CER denominator non-standard | ✅ Now exp_n.size() |
| 7 | sh_quote POSIX-only undocumented |
✅ Comment added |
One remaining nit (non-blocking)
g_mecab_initialized is set before the tagger is actually initialized — misleading warning in a narrow race window
In mecab_tagger():
std::call_once(once, [] {
{
std::lock_guard<std::mutex> lk(path_mutex());
g_mecab_initialized = true; // ← set HERE
}
const std::string path = resolve_mecab_dict_path(); // ← but init happens AFTER
...
t->load(path);
});If Thread B calls set_mecab_dict_path("new_path") in the window between the flag being set and resolve_mecab_dict_path() being called:
- Thread B sees
g_mecab_initialized == true, prints "new path will be ignored" - Thread B then sets
g_mecab_dict_path = "new_path" - Thread A calls
resolve_mecab_dict_path(), reads the new path, and uses it
So the warning says "will be ignored" but the path IS used. The fix is to move g_mecab_initialized = true (and the Cangjie equivalent) to the end of the call_once lambda, after t->load() completes. That way:
set_*()before anymecab_tagger()call → sets path, no warning, tagger uses it ✅set_*()aftermecab_tagger()completes → flag is true, warning fires, path genuinely ignored ✅set_*()duringcall_once(inherently racy by the caller) → no flag yet, no misleading warning ✅
Same pattern for cangjie_table().
This is a very narrow window and the practical impact is a spurious warning message (not UB), so it's non-blocking — but it's a quick one-line move that makes the semantics correct.
Verdict
Everything else looks good. The mutex pattern is correct (no deadlocks — the scoped lock releases before resolve_*() re-acquires, and std::call_once prevents re-entrant init). The thread_join_guard RAII is correct (safe on default-constructed thread). The 2-token early-stop with kMtlMinTokensBeforeCadence = 60 and resize(n-1) is correct. The Cangjie strip_trailing_tsv_columns hardening handles annotated lines properly.
Ship it once the flag-ordering nit above is addressed (or documented as intentional).
The g_mecab_initialized / g_cangjie_initialized flags were set at the start of the call_once lambda, before resolve_*_path() + load() ran. In the narrow window between setting the flag and resolving the path, a concurrent set_*_path() would see the flag as true and warn that the new path "will be ignored" -- yet resolve_*_path() could still pick that path up and use it, making the warning incorrect. Move the flag assignment to the end of the lambda, after the load attempt completes, so the "ignored" warning only fires when the path is genuinely too late to take effect. Also restructure to move the unique_ptr into the static only on success instead of constructing-then-resetting on failure. Addresses the remaining non-blocking nit from ogad-tether's re-review on PR #19.
🎯 What problem does this PR solve?
/opt/homebrew/lib/mecab/dic/ipadic, breaking cross-platform builds📝 How does it solve it?
text_preprocess.h/.cppwith Japanese (MeCab), Chinese (Cangjie), and Korean (Jamo) preprocessingsupported_languages()from 18 to 23EngineOptions::mecab_dict_path/cangjie_tsv_path, CLI--mecab-dict/--cangjie-tsv, env var fallbacksCHATTERBOX_MECAB_DICT/CHATTERBOX_CANGJIE_TSVscripts/build_mecab_dict.pyto materialize IPAdic from pip (same pattern as tts-onnx)🧪 How was it tested?
🔌 API Changes