Skip to content

feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19

Open
freddy311082 wants to merge 54 commits into
masterfrom
feat/tts-multilingual-23-langs
Open

feat(tts-cpp): add 23-language multilingual support with runtime MeCab/Cangjie paths#19
freddy311082 wants to merge 54 commits into
masterfrom
feat/tts-multilingual-23-langs

Conversation

@freddy311082
Copy link
Copy Markdown

🎯 What problem does this PR solve?

  • tts-cpp only supported 18 tier-1 languages; 5 tier-2 languages (ja, he, ru, zh, hi) were missing
  • MeCab dictionary path for Japanese was hardcoded to /opt/homebrew/lib/mecab/dic/ipadic, breaking cross-platform builds
  • Cangjie TSV path for Chinese was baked in at compile time, same problem

📝 How does it solve it?

  • Add text_preprocess.h/.cpp with Japanese (MeCab), Chinese (Cangjie), and Korean (Jamo) preprocessing
  • Expand supported_languages() from 18 to 23
  • Replace compile-time dictionary paths with runtime resolution: EngineOptions::mecab_dict_path / cangjie_tsv_path, CLI --mecab-dict / --cangjie-tsv, env var fallbacks CHATTERBOX_MECAB_DICT / CHATTERBOX_CANGJIE_TSV
  • Add scripts/build_mecab_dict.py to materialize IPAdic from pip (same pattern as tts-onnx)
  • Add 23-language CTest integration tests (L0 registry, L1 wav structure, L2 audio sanity)
  • Add optional L3 ASR verification tests via whisper-cli

🧪 How was it tested?

  • All 23 languages pass L0/L1/L2 synthesis tests
  • L3 ASR matrix: 22/23 pass, zh fails (CER ~97% — Cangjie approach needs rework)

🔌 API Changes

struct EngineOptions {
    // ...existing fields...
    std::string mecab_dict_path;   // runtime path to MeCab IPAdic dir (for ja)
    std::string cangjie_tsv_path;  // runtime path to Cangjie5_TC.tsv (for zh)
};

nik and others added 30 commits November 10, 2025 13:02
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models
to use a non-standard first convolution kernel size. Standard
whisper models default to kernel size 3.

Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams
- When present, encoder self-attention is restricted to a local window
  for layers up to last_window_layer
- Bypass flash attention when windowed mask is active (Metal FA does
  not support custom F32 masks); flash attention remains enabled for
  non-BCI models and for the decoder
- Populate window_mask data on the encoder graph (not the cross graph)
- Add proper SOS token (language + transcribe) initialization for BCI
  models

Backward-compatible: n_audio_window_size defaults to 0 and
n_audio_last_window_layer defaults to -1, disabling windowed
attention entirely for standard whisper models.

Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback:

1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel,
   n_audio_window_size, n_audio_last_window_layer) behind a
   n_mels > 256 check. Standard whisper models have n_mels <= 128
   and do not contain these fields — reading them unconditionally
   would corrupt the file position and break model loading.

2. Add explicit is_bci flag to hparams struct, set when BCI fields
   are detected during loading.

3. Use is_bci flag (instead of n_audio_window_size > 0) to guard
   the BCI-specific decoder SOS token initialization.

4. Log BCI-specific hparams when a BCI model is detected.

Made-with: Cursor
The windowed attention mask values depend only on n_ctx and
window_size, both fixed after model load. Move the O(n_ctx^2)
computation from whisper_encode_internal (called every encode)
to whisper_init_state (called once). The encode path now just
copies the precomputed data to the graph tensor.

Made-with: Cursor
…, Threads

1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed
   mask uses hparams.n_audio_ctx, but the graph tensor is sized from
   exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back
   to recomputing the mask at the effective n_ctx when sizes differ,
   preventing a buffer overflow into the smaller tensor.

2. Update whisper.pc.in: the install interface was changed to
   include/whisper but the pkg-config includedir still pointed to
   include/. Consumers using pkg-config would not find whisper.h.

3. Fix whisper-config.cmake.in: the whisper target publicly links
   Threads::Threads but find_dependency(Threads) was skipped on
   Windows, leaving downstream find_package(whisper) with an
   unresolved imported target. Now always resolve Threads.
…ash attention

1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the
   default n_audio_ctx, the window mask is now recomputed once and
   cached in wstate (keyed on window_mask_n_ctx) instead of allocating
   a new std::vector on every whisper_encode_internal call.

2. Per-layer flash attention: layers above last_window_layer no longer
   need the windowed attention mask. The flash attention path is now
   used for those layers even when BCI windowed attention is active,
   instead of globally falling back to the softmax path for the entire
   encoder.

3. Use std::abs instead of C abs in both init-time and encode-time
   mask computation paths.
…alidation

1. Extract compute_window_mask() helper on whisper_state to eliminate
   the duplicated O(n_ctx^2) mask fill loop that appeared in both
   whisper_init_state and whisper_encode_internal. Both call sites
   now use the single helper, preventing future drift.

2. Guard the encode-time mask block with hparams.is_bci before doing
   the ggml_graph_get_tensor lookup. Cheaper and more explicit than
   relying on the tensor name string to determine whether BCI
   windowed attention is active.

3. Add hparams.is_bci to the graph builder guard for window_mask
   tensor creation, aligning it with the other BCI code paths.

4. Add validation for BCI hparams after reading from file:
   n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0.
   Log an error and return false on invalid values instead of
   proceeding with garbage.

5. Add comment explaining the n_mels > 256 threshold used to
   discriminate BCI models from standard whisper models, and noting
   that a dedicated file-format marker should be introduced if this
   assumption ever breaks.

Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
…review)

Address @gianni-cor review on PR #11: switch the bundled ggml filename
prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech
stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a
single ggml file set instead of each library shipping its own copy.

  - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`,
    GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`,
    option blurb + status message updated.
  - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh,
    patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example
    updated to reference the new `speech-` prefix.

Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure
prints `bundled ggml libraries will be emitted as libspeech-ggml-*`;
build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib;
parakeet binary's otool -L now references `libspeech-ggml*` exclusively.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml
The standalone setup-ggml.sh + patches/ tooling was dropped from
qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the
CMakeLists.txt still:
  * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and
  * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-"
    on the bundled ggml target.
That combination quietly broke standalone bundled-ggml builds: the
filename-prefix patch was no longer applied, so libspeech-ggml-*.so
files existed on disk but ggml's runtime loader still searched for
libggml-*.so under GGML_BACKEND_DL=ON.  Vulkan / OpenCL / CUDA
backends silently failed to load on Android.

Fix per reviewer guidance: converge the speech stack on a single ggml
source-of-truth.  Standalone-bundled-ggml is no longer a supported
build mode out of this in-tree subtree; the canonical path is
`-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack
`ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships
the patches pre-applied.

Edits:

- TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this
  tree.  Docstring spells out the rationale + points users at the
  standalone github.com/gianni-cor/chatterbox.cpp repo if they need
  a bundled-ggml dev build with patches/ present.

- The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to
  configure when patches/ is absent: a FATAL_ERROR points at the
  right consumption path (vcpkg ggml-speech) and the standalone
  fallback.  Doesn't break in-tree-with-patches builds (parakeet-cpp
  in this same repo still ships patches/, so its bundled path is
  unaffected by this guard inside tts-cpp).

- Verified locally: `cmake -S tts-cpp -B build` (no flags) errors
  out at find_package(ggml CONFIG REQUIRED) with our new message
  pointing at the ggml-speech port; `cmake -S tts-cpp -B build
  -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard
  with the no-patches message.

- tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/
  that no longer exist; running it would have errored out anyway.
  The standalone repo keeps its own setup-ggml.sh; only the in-tree
  subtree drops it.

The standalone chatterbox.cpp repo (the one tts-cpp/ was copied
from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/
folder + scripts/setup-ggml.sh.  This commit is therefore an
integration-time delta against that source, not a change to the
standalone build flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 and others added 6 commits May 7, 2026 11:07
…_token

Mirrors src/chatterbox_cli.cpp's MTL tokenisation path and the Python
ChatterboxMultilingualTTS.generate reference (chatterbox-ref/src/chatterbox/
mtl_tts.py:288-291).  The MTL T3 prompt graph anchors position 0 on
start_text_token (255); without it the autoregressive decode drops the
first speech tokens, audible as a missing leading syllable
("Hello" -> "lo from the multilingual").

Turbo (gpt2_bpe) is unaffected and keeps the existing single-line tokenise
+ punc_norm path.

Co-authored-by: Cursor <cursoragent@cursor.com>
…gine::run_t3

MTL T3 occasionally emits a plausible end-of-speech silence cadence
(three identical tokens in a row) mid-utterance and then hallucinates
low-energy content -- silence, hissing, garbage tokens -- until
n_predict (1000) is reached, producing ~40 s of trailing junk on a
short input.  chatterbox_cli.cpp already guards against this via the
AlignmentStreamAnalyzer token_repetition port, but Engine::run_t3 was
missing the same check, so the addon path (which doesn't go through the
CLI) saw the regression on whichever language/seed combinations happen
to hit the cadence (most reliably reproduced on German with the default
seed=42).

Mirrors the CLI's existing guard 1:1, gated on is_mtl since the Turbo
codebook has a different cadence signature.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add tts-cpp/ subtree (Chatterbox Turbo + Multilingual + Supertonic TTS) + integration fixes
Two interleaved chatterbox concurrency fixes for iOS, collapsed into
one history-preserving commit on top of the upstream merge so the
qvac-registry-vcpkg tts-cpp port still pins to a single GustavoA1604
SHA (no chained port-version bumps).

1) gguf_init_from_file race (the SIGABRT seen before this commit):
   bake_voice_conditioning() must run BEFORE we spawn the s3gen preload
   thread.  Both paths funnel into gguf_init_from_file() (voice_encoder
   opens the T3 GGUF, s3gen_preload opens the s3gen GGUF), and the
   ggml_init / gguf_init_from_file pair underneath is not safe to
   invoke concurrently from two threads against ggml's process-global
   state.  Empirically races on Apple Silicon with a fast SIGABRT
   inside ggml_abort coming from the preload thread's ggml_init while
   the main thread is still executing voice_encoder_load.

2) Metal shared-buffer-type init race (the SIGSEGV in
   ggml_metal_buffer_is_shared, surfaced once #1 was fixed): after the
   preload thread spawns we now block on wait_for_preload() before
   the constructor returns, so the SDK e2e bootstrap's
   load -> immediate unload pattern ("preLoadUnload") can no longer
   tear down the engine while s3gen_preload is still inside
   ggml_backend_metal_buffer_type_shared_alloc_buffer ->
   ggml_metal_buffer_is_shared.  Defeats the parallel-preload
   optimisation (s3gen_preload no longer overlaps with first T3
   inference inside synthesize()); revisit once ggml-metal's shared
   buffer-type init is safe to use from a preload thread concurrent
   with construction-time teardown.

Together these two changes unblock chatterbox load on iPhone 16e
(iOS 18.6.2) through the QVAC SDK e2e test consumer with Metal
enabled — qvac/pull/1992.

Co-authored-by: Cursor <cursoragent@cursor.com>
…and Cangjie for Chinese (need more review)

Add runtime MeCab/Cangjie dictionary paths for cross-platform multilingual TTS
Replace compile-time hardcoded Homebrew paths with runtime dictionary
resolution via CLI flags, EngineOptions, and env var fallbacks
(CHATTERBOX_MECAB_DICT, CHATTERBOX_CANGJIE_TSV).
@freddy311082 freddy311082 requested review from a team as code owners May 14, 2026 23:35
@freddy311082 freddy311082 requested a review from GustavoA1604 May 14, 2026 23:35
Copy link
Copy Markdown

@GustavoA1604 GustavoA1604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check the following. Probably can just feed to cursor and it will be able to take care

  1. Should zh indeed be in supported_languages()? The PR description states "zh fails (CER ~97% — Cangjie approach needs rework". Also on windows the zh test crashed on windows (see blocker #3)

  2. generated.resize(n - 2) in the repetition early-stop trims too aggressively
    In chatterbox_cli.cpp, the 3-token early-stop was changed to 2-token with a kMtlMinTokensBeforeCadence = 60 guard. But generated.resize(n - 2) removes both the repeated token and the legitimate one before it. The old code didn't resize at all — the downstream pop_back() handled the trailing token. This should be resize(n - 1). Add a comment explaining the intent.

  3. preprocess_japanese/preprocess_chinese exceptions cause std::terminate instead of a clean error exit. Both ja and zh crash rather than printing an error and returning 1. Root cause: s3gen_preload_thread (line 1031 of chatterbox_cli.cpp) is spawned when --s3gen-gguf is provided. When a preprocessing exception propagates out of the try block during stack unwind, this std::thread's destructor is called while it is still joinable, which unconditionally calls std::terminate(). The codebase has a comment at line 1184 explicitly noting this hazard, and guards against it in the --input-file path — but the normal synthesis path is unguarded. Fix: wrap s3gen_preload_thread in a RAII join guard, or join it in the catch block before rethrowing.

  4. --output flag in test_multilingual_synth.cpp::run_synthesis() does not exist in the CLI

if (!args.tokens_dump_path.empty()) {
cli_args.push_back("--output"); // no such CLI arg
cli_args.push_back(args.tokens_dump_path);
}
chatterbox_cli.cpp has --out for the wav file, no --output. If --dump-tokens is passed to the test binary, tts_cpp_cli_main hits the unknown-argument handler and returns failure. Either remove the dead branch or add the matching CLI flag.

  1. create_mecab_tagger does not pass -r, so a broken system mecabrc silently kills ja synthesis

Confirmed by local testing: MeCab reads the system mecabrc before honoring -d. On any host where the system rc points to a non-existent dictionary (e.g. MSYS2's default naist-jdic that isn't installed), mecab_new returns nullptr and MeCabTagger::load throws with an empty error message. Fix by passing the materialized mecabrc:

mecab_t * create_mecab_tagger(const std::string & dic_str) {
std::string a0 = "mecab";
std::string a1 = "-d";
std::string a2 = dic_str;
std::string a3 = "-r";
std::string a4 = (std::filesystem::path(dic_str) / "mecabrc").string();
char * argv[] = { a0.data(), a1.data(), a2.data(), a3.data(), a4.data() };
return mecab_new(5, argv);
}
build_mecab_dict.py already copies a neutral mecabrc (; This file can be empty but it needs to exist), so this is self-contained.

  1. target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR}) must be PRIVATE

Confirmed by build: PUBLIC propagates the entire MinGW include/ directory to every consumer of tts-cpp. On a Windows Clang-cl build this causes a catastrophic cascade of MinGW vs MSVC header conflicts (hundreds of errors) across every translation unit that includes any tts-cpp header. mecab.h is an implementation detail of text_preprocess.cpp — no public header exposes MeCab types.

was:
target_include_directories(tts-cpp PUBLIC ${MECAB_INCLUDE_DIR})
fix:
target_include_directories(tts-cpp PRIVATE ${MECAB_INCLUDE_DIR})

  1. set_mecab_dict_path / set_cangjie_tsv_path — silent no-op if called after first synthesis

The std::call_once blocks in mtl_tokenizer.cpp capture the global path at first call. If set_mecab_dict_path is called after the first ja synthesis (e.g. constructing a second Engine with a different dict), the once_flag is spent and the new path is silently ignored. At minimum document this constraint on the setters; ideally assert or warn.

  1. Korean (ko) skips decompose_korean_to_jamo — intentional?

apply_language_preprocessing dispatches on "ja" and "zh" but not "ko". The new decompose_korean_to_jamo function exists but is never called. If ko already works without Jamo decomposition this is fine, but it needs a comment explaining why the function exists without being called here.

  1. Duplicate UTF-8 decoder

test_multilingual_asr.cpp contains its own utf8_to_codepoints (lines ~14–40) that duplicates decode_utf8 from text_preprocess.h, which is already a public header in the same library. The ASR test should #include "text_preprocess.h" and use the shared function.

Copy link
Copy Markdown

@GustavoA1604 GustavoA1604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good now. The only thing left is that there is no acquisition script for the Cangjie TSV (unlike MeCab, which has build_mecab_dict.py). After some investigation the canonical source is Jackchows/Cangjie5 — the Cangjie5_TC.txt file is already TAB-separated, so it's largely compatible with CangjieTable::load() already. I've written a build_cangjie_tsv.py script following the same pattern as build_mecab_dict.py

Also Cangjie5_TC.txt has ~2,529 lines with a third annotation column, e.g.:

昸 ahem [u]
parse_tsv_line uses line.find('\t') to find only the first tab, then sets row.code = line.substr(tab + 1). For annotated lines this gives code = "ahem\t[u]", which would emit tokens like [cj_a][cj_h][cj_e][cj_m] followed by a literal \t and [u] — breaking those characters. Fix is one line in text_preprocess.cpp:

row.code = line.substr(tab + 1);
// strip optional annotation column
const auto tab2 = row.code.find('\t');
if (tab2 != std::string::npos) row.code = row.code.substr(0, tab2);

build_cangjie_tsv.py strips the annotation at conversion time (so the TSV written to disk is already clean), but parse_tsv_line should still be hardened for any raw files callers might point at.

GustavoA1604
GustavoA1604 previously approved these changes May 19, 2026
Resolves 37 add/add conflicts that accumulated since the last
master merge (May 7). Master moved 326 commits forward, mainly
landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend
registry refactor (`backend_selection.{h,cpp}`, registry-only
device walk replacing the per-backend `#ifdef GGML_USE_<X>`
cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the
`backends_dir` / `opencl_cache_dir` Engine knobs.

Resolution strategy:

- parakeet-cpp/ (19 files): taken from master verbatim. The PR
  branch only carried the original port (commits d7ab516 /
  c6c3fd7 / 761eca0, all <= May 7); master has 13 newer
  commits including TDT/EOU/Sortformer v2.1 + AOSC and the
  word-start signal already integrated. Nothing of the PR was
  lost on this side.

- .github/CODEOWNERS: taken from master (team reorg to
  `qvac-internal-dev` / `qvac-internal-merge`).

- tts-cpp/ stale-from-initial-drop (7 files: voice_encoder,
  t3_mtl, s3tokenizer, mel_extract_stft, main, campplus,
  campplus_forward.inc): taken from master. Their only PR
  commit is the original `ef840d5c Add tts-cpp files` drop;
  master has since rewritten them for the registry refactor.

- tts-cpp/ mirror-only (4 files: supertonic/engine.h,
  supertonic_engine, supertonic_gguf, chatterbox_tts): taken
  from master. The PR's only authored commits on these mirror
  pre-existing fixes from chatterbox.cpp that are already on
  master.

- tts-cpp/CMakeLists.txt: hybrid merge. Master's Android
  dynamic-backend stack, registry-only backend-defs interface
  (with `src/backend_selection.cpp` in the source list), and
  `target_compile_definitions(test-metal-ops PRIVATE
  GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp`
  source entry, MeCab/Cangjie find_library block (PRIVATE
  include per gianni-cor review), and 23-language multilingual
  test matrix retained.

- tts-cpp/include/tts-cpp/chatterbox/engine.h: master's
  updated `n_gpu_layers` doc (Adreno-tier policy) and new
  `backends_dir` / `opencl_cache_dir` fields retained. PR's
  `mecab_dict_path` / `cangjie_tsv_path` fields retained.

- tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` +
  `text_preprocess.h` includes, 23 supported_languages,
  preprocess_japanese / preprocess_chinese helpers with
  call_once-cached MeCab tagger + Cangjie table,
  apply_language_preprocessing dispatch, and
  `set_mecab_dict_path` / `set_cangjie_tsv_path` setters
  (with already-initialised warn) retained. Master's
  `// ---- Encode ----` divider kept.

- tts-cpp/src/chatterbox_engine.cpp: master's `#include
  "backend_selection.h"` and `backends_dir` /
  `opencl_cache_dir` wiring retained. PR's per-Engine
  `mtl_tokenizer::set_mecab_dict_path` /
  `set_cangjie_tsv_path` calls retained.

- tts-cpp/src/chatterbox_cli.cpp: master's removal of the
  per-backend `#include "ggml-{cuda,metal,vulkan}.h"`
  cascade (registry-only refactor) and the new
  voice-cloning backend comment retained. PR's
  `--mecab-dict` / `--cangjie-tsv` flags (declaration, help,
  parsing, and per-Engine setter call) retained. PR's RAII
  `thread_join_guard` on the s3gen preload thread retained
  (addresses GustavoA1604 review #3: std::terminate hazard
  during stack unwind). PR's 2-token MTL early-stop with
  `kMtlMinTokensBeforeCadence = 60` guard and
  `generated.resize(n - 1)` retained (addresses
  GustavoA1604 review #2: previous over-aggressive
  `resize(n - 2)` trimmed a legitimate token); the log line
  was updated to surface the repeated token id.

PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp},
tts-cpp/scripts/build_mecab_dict.py,
tts-cpp/scripts/build_cangjie_tsv.py,
tts-cpp/test/test_multilingual_{synth,asr}.cpp are all
preserved as-is by the merge.

Co-authored-by: Cursor <cursoragent@cursor.com>
@freddy311082
Copy link
Copy Markdown
Author

Copy link
Copy Markdown

@ogad-tether ogad-tether left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — omar.gad@tether.io

I reviewed the full diff (16 files, +2080/−17) against the prior review threads from @GustavoA1604. All 9 items from the first review and both items from the second review appear to be addressed — nice work. Below are new findings.


Blockers

1. Data race on global path strings (mtl_tokenizer.cpp)

g_mecab_dict_path and g_cangjie_tsv_path are bare std::string globals written by set_mecab_dict_path() / set_cangjie_tsv_path() and read by resolve_mecab_dict_path() / resolve_cangjie_tsv_path(). The std::call_once guards protect the tagger/table initialization, but the path strings themselves are read/written without synchronization. If two Engine instances are constructed on different threads (plausible in the addon), this is a data race — undefined behavior per the C++ standard.

Fix options:

  • Protect the globals with a std::mutex, or
  • Move the path into the std::call_once lambda's capture (snapshot at set time), or
  • Use std::atomic<std::string*> with a publish pattern.

The simplest fix is probably a std::mutex guarding both the path and the initialized flag together.

2. TTS_CPP_HAS_MECAB compile definition should be PRIVATE, not PUBLIC

target_compile_definitions(tts-cpp PUBLIC TTS_CPP_HAS_MECAB)

MeCabTagger lives in src/text_preprocess.h — an internal header, not under include/. No public header exposes MeCab types. Making this PUBLIC leaks an implementation detail to every consumer of tts-cpp and can cause surprising behavior if a consumer happens to #include "text_preprocess.h" transitively. Same rationale as GustavoA1604's review #6 (include dirs) — this should be PRIVATE.

3. MeCab link library should be PRIVATE, not PUBLIC

target_link_libraries(tts-cpp PUBLIC ${MECAB_LIBRARY})

MeCab is only used in text_preprocess.cpp. No public header exposes MeCab symbols. PUBLIC linkage means install(EXPORT) records the dependency, and every consumer doing find_package(tts-cpp) must also have libmecab installed — even if they never synthesize Japanese. This should be PRIVATE.


Medium

4. build_cangjie_tsv.py defaults to master — not reproducible

DEFAULT_TAG = "master"

The MeCab script (build_mecab_dict.py) pins to a specific PyPI package version via ipadic. The Cangjie script downloads from Jackchows/Cangjie5 at whatever master happens to be. If upstream changes column format or renames the file, builds silently break. Pin to a specific commit or tag (e.g., --tag v1.0.0 or a commit SHA) as the default, with master available as an override.

5. text_preprocess.h includes <fstream> unnecessarily

The header declares CangjieTable::read_tsv_entries(std::ifstream &, ...) which requires <fstream>. But this is a private method — it doesn't need to be in the header at all. Consider moving it to the .cpp file (or making it a free function in the anonymous namespace), which lets you drop <fstream> from the header and reduces include bloat for every TU that includes this header.


Nits

6. CER metric doesn't match standard definition

In test_multilingual_asr.cpp:

const size_t denom = std::max(exp_n.size(), got_n.size());

Standard CER uses ref_len (expected length) as the denominator, not max(ref, hyp). Using max caps the metric at 1.0, which is convenient for thresholding but makes the logged "CER" values incomparable with published benchmarks. Consider renaming to "normalized edit distance" or switching to exp_n.size() as denominator. Not a blocker since this is test-only, but the log output says "CER" which could confuse someone comparing against paper numbers.

7. sh_quote() in ASR test is POSIX-only

test_multilingual_asr.cpp uses POSIX single-quote escaping + std::system(). On Windows this won't work (cmd.exe quoting is different). Fine if the L3 tests are POSIX-only, but worth a comment or a #ifdef _WIN32 skip.

8. Parakeet patches in this PR

The diff includes 4 new files under parakeet-cpp/patches/ (+659 lines). These appear to have come through the merge commit with master. They're unrelated to the multilingual feature — just flagging in case the intent was a clean feature-only diff for review purposes. If they were already reviewed and merged via another PR, no action needed.


Summary

The core multilingual implementation is solid — the MeCab/Cangjie architecture, runtime path resolution with env-var fallbacks, the tiered test matrix, and the zh exclusion are all well-thought-out. The three blockers above are focused on thread safety and build hygiene; they should be straightforward to fix. Happy to re-review once those land.

Blockers:
- mtl_tokenizer: guard global MeCab/Cangjie path strings + initialized
  flags with a mutex. The std::call_once flags only serialized tagger/
  table init; the path globals were read by resolve_*()/written by
  set_*() unsynchronized, a data race when two Engines are constructed
  on different threads. set_*/resolve_* and the in-call_once flag write
  now take path_mutex(); the flag write is scoped so it releases before
  resolve_* re-locks (no deadlock on the non-recursive mutex).
- CMakeLists: TTS_CPP_HAS_MECAB compile definition PUBLIC -> PRIVATE.
- CMakeLists: MECAB_LIBRARY link PUBLIC -> PRIVATE. MeCab is an impl
  detail of text_preprocess.cpp; no public header exposes its types or
  symbols, so consumers of tts-cpp no longer inherit a libmecab dep.

Medium:
- build_cangjie_tsv.py: default ref master -> pinned commit
  b76ed7fd4b8365529c056096b6a3a2ba749f5a40 for reproducible builds
  (Jackchows/Cangjie5 has no tags); master still available via --tag.
- text_preprocess.h: drop <fstream> (-> <iosfwd>) and take the private
  read_tsv_entries() by std::istream& instead of std::ifstream& so the
  header no longer pulls <fstream> into every consumer TU.

Nits:
- test_multilingual_asr: CER denominator max(ref,hyp) -> ref_len so the
  metric matches the standard definition and stays comparable with
  published benchmarks.
- test_multilingual_asr: document that sh_quote()/run_asr() are POSIX-only.
Copy link
Copy Markdown

@ogad-tether ogad-tether left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — omar.gad@tether.io

Commit 63fc0c93 addresses all 7 items from my first review. Verified each fix:

# Issue Status
1 Data race on global path strings path_mutex() guards all reads/writes
2 TTS_CPP_HAS_MECAB PUBLIC → PRIVATE
3 MeCab link PUBLIC → PRIVATE
4 build_cangjie_tsv.py unpinned default ✅ Pinned to b76ed7fd
5 text_preprocess.h includes <fstream> ✅ Now <iosfwd> + std::istream &
6 CER denominator non-standard ✅ Now exp_n.size()
7 sh_quote POSIX-only undocumented ✅ Comment added

One remaining nit (non-blocking)

g_mecab_initialized is set before the tagger is actually initialized — misleading warning in a narrow race window

In mecab_tagger():

std::call_once(once, [] {
    {
        std::lock_guard<std::mutex> lk(path_mutex());
        g_mecab_initialized = true;   // ← set HERE
    }
    const std::string path = resolve_mecab_dict_path();  // ← but init happens AFTER
    ...
    t->load(path);
});

If Thread B calls set_mecab_dict_path("new_path") in the window between the flag being set and resolve_mecab_dict_path() being called:

  • Thread B sees g_mecab_initialized == true, prints "new path will be ignored"
  • Thread B then sets g_mecab_dict_path = "new_path"
  • Thread A calls resolve_mecab_dict_path(), reads the new path, and uses it

So the warning says "will be ignored" but the path IS used. The fix is to move g_mecab_initialized = true (and the Cangjie equivalent) to the end of the call_once lambda, after t->load() completes. That way:

  • set_*() before any mecab_tagger() call → sets path, no warning, tagger uses it ✅
  • set_*() after mecab_tagger() completes → flag is true, warning fires, path genuinely ignored ✅
  • set_*() during call_once (inherently racy by the caller) → no flag yet, no misleading warning ✅

Same pattern for cangjie_table().

This is a very narrow window and the practical impact is a spurious warning message (not UB), so it's non-blocking — but it's a quick one-line move that makes the semantics correct.


Verdict

Everything else looks good. The mutex pattern is correct (no deadlocks — the scoped lock releases before resolve_*() re-acquires, and std::call_once prevents re-entrant init). The thread_join_guard RAII is correct (safe on default-constructed thread). The 2-token early-stop with kMtlMinTokensBeforeCadence = 60 and resize(n-1) is correct. The Cangjie strip_trailing_tsv_columns hardening handles annotated lines properly.

Ship it once the flag-ordering nit above is addressed (or documented as intentional).

The g_mecab_initialized / g_cangjie_initialized flags were set at the
start of the call_once lambda, before resolve_*_path() + load() ran. In
the narrow window between setting the flag and resolving the path, a
concurrent set_*_path() would see the flag as true and warn that the new
path "will be ignored" -- yet resolve_*_path() could still pick that path
up and use it, making the warning incorrect.

Move the flag assignment to the end of the lambda, after the load attempt
completes, so the "ignored" warning only fires when the path is genuinely
too late to take effect. Also restructure to move the unique_ptr into the
static only on success instead of constructing-then-resetting on failure.

Addresses the remaining non-blocking nit from ogad-tether's re-review on
PR #19.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants