Skip to content

Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31

Open
ogad-tether wants to merge 116 commits into
masterfrom
supertonic_optimizations
Open

Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31
ogad-tether wants to merge 116 commits into
masterfrom
supertonic_optimizations

Conversation

@ogad-tether
Copy link
Copy Markdown

@ogad-tether ogad-tether commented May 22, 2026

Summary

  • Merge of QVAC-18605 supertonic Vulkan optimisation rounds 1-13 onto current master, reconciling with master's ggml-backend registry refactor + Android GGML_BACKEND_DL=ON dynamic-loader path.
  • Extends tts_cpp::detail::init_gpu_backend() with an optional vulkan_device arg (0 = first adapter, N > 0 = explicit index, -1 = free-VRAM auto-pick with UMA bias) so the round-3 / round-12 Vulkan device-selection policy survives master's registry-only refactor without bringing back direct ggml_backend_vk_* calls. Implemented via the public registry APIs (ggml_backend_dev_memory + ggml_backend_dev_type) so it works in both GGML_BACKEND_DL=ON and =OFF builds. Default value is 0, so chatterbox / s3gen / parakeet call sites are unaffected.
  • GGML_USE_VULKAN compile define is re-enabled on tts-cpp-backend-defs only when GGML_VULKAN AND NOT GGML_BACKEND_DL — the supertonic optimisation paths (F16 K/V flash-attention, pinned-host upload buffers, ggml_backend_vk_host_buffer_type() per-step uploads, backend_name() device-description annotation) call direct ggml-vulkan symbols that are only linkable when Vulkan is statically linked. On the Android DL build those paths fall back to the registry-walked non-Vulkan code, matching master's design intent.
  • Fixes a pre-existing build issue in master where 8 test executables (test-voice-features / -resample / -voice-encoder / -fbank / -voice-embedding / -s3tokenizer / -streaming / -cpu-caches) compile internal sources directly without linking libtts-cpp.a — undefined references to init_gpu_backend / init_cpu_backend. Adds src/backend_selection.cpp to each.
  • Fixes a pre-existing bit-exactness bug in QVAC-18605 cache work (commits ccec5924, round 10/12): leaf input tensors uploaded once at build / once per synth via the round-10 upload-skip tracker had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload. Fix is to mark each affected tensor as INPUT + OUTPUT (the OUTPUT flag is what gallocr's ggml_gallocr_free_node checks before releasing). Affects: relpos attention masks, per-group RoPE cos/sin tables, front-block RoPE cos/sin tables, style_v_in / kctx_in in build_res_style_qkv_cache, text_in_t in supertonic_vector_trace_proj_ggml, and text_in in build_group_graph_cache.

Conflict resolution notes

Three conflict files, all in tts-cpp/supertonic_*:

File Resolution
include/tts-cpp/supertonic/engine.h Kept both sets of EngineOptions fields (HEAD's precision/f16_attn/vulkan_device/f16_weights/kv_attn_type/vulkan_env_overrides + master's backends_dir/opencl_cache_dir).
src/supertonic_engine.cpp Combined: backends_dir/opencl_cache_dir setters → precision mapping → apply_vulkan_env_overrides()load_supertonic_gguf() with HEAD's extra args. Order matters: all setters must precede init_supertonic_backend().
src/supertonic_gguf.cpp Replaced HEAD's hand-rolled #ifdef GGML_USE_VULKAN cascade with delegation to tts_cpp::detail::init_gpu_backend(), threading vulkan_device through. Kept convert_supertonic_tensor_data (HEAD-only addition).

Test plan

  • cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFF configures cleanly (bundled qvac-ext-ggml@speech pin 60a172e)
  • cmake --build build -j builds 100% clean (library + supertonic-cli + tts-cli + all unit/integration test binaries on macOS arm64 + Metal)
  • ctest -L unit -j 425/25 passing, including every QVAC-18605 logic harness: test-supertonic-vulkan-device-select, vulkan-env-overrides, kv-attn-type (+ -api), capability-cache, pinned-host-buffer, text-encoder-gpu-bridge, upload-skip-tracker, voice-host-cache, f16-deny-list-api, f16-attn-parity, warm-up-api, input-scratchpad, backend-dispatch, portable-ops, vulkan-dispatch, in-graph-transpose, graph-to-graph-blit, rope-in-graph, rope-packed-qk, profile-csv, convnext-block-fused
  • ctest -L fixture16/16 passing (supertonic-ref-quick fixture, pointed via -DTTS_CPP_TEST_MODEL_DIR + -DTTS_CPP_TEST_REF_DIR). Including test-supertonic-pipeline (end-to-end vs ONNX reference WAV, max_abs_err = 1.1e-04 against 1e-3 threshold), test-supertonic-graph-rewrites (F3/F8/F11 5/5), test-supertonic-audit3-caches (F17/F18/F19 8/8)
  • End-to-end Supertonic synthesis via supertonic-cli against supertonic2.gguf on Metal — 8.15 s of 44.1 kHz mono PCM produced; Metal pipeline log shows the QVAC custom kernels (e.g. kernel_supertonic_edge_pad_1d_f32) compiling and running. WAV length matches ONNX reference exactly (136 970 vs 136 972 samples — 2-sample EOF rounding).
  • supertonic-bench on Apple M-series + Metal: 43.5× realtime (RTF 0.023, median over 3 runs). All QVAC-18605 auto-policies engaged: f16_attn=on / f16_weights=on / native_leaky_relu=on / kv_attn_type=f16 / q8_0_kv_attn=available / bf16_kv_attn=available.
  • Vulkan validation on a multi-adapter desktop (auto-pick + UMA bias path). The macOS reviewer environment can't exercise the Vulkan branch; recommend a desktop reviewer with > 1 Vulkan adapter run supertonic-cli --n-gpu-layers 99 --vulkan-device -1 --vulkan-perf-logger against supertonic2.gguf and confirm the auto-pick log line + steady-state perf numbers match round 12.
  • Android GGML_BACKEND_DL=ON smoke test. The merge accepts that the supertonic Vulkan-specific code paths compile out under GGML_USE_VULKAN-disabled (Android DL); registry-walked fallback should remain functional. Recommend a smoke test on a Snapdragon / non-Apple Android target before tagging.
  • Test fixture regeneration for chatterbox harnesses. The 14 fixture tests that need chatterbox-s3gen.gguf / chatterbox-t3-mtl.gguf / s3gen-ref/ / streaming-ref/ / t3-mtl-ref/ etc. are still auto-disabled because those fixtures aren't shipped in-tree. Out of scope for this PR but worth tracking.

Known follow-ups (not blocking merge)

  • supertonic_engine.cpp:backend_name() Vulkan device-description annotation is inert under GGML_BACKEND_DL=ON (depends on ggml_backend_vk_get_device_description). Cheap fix: route through ggml_backend_dev_description(ggml_backend_get_device(backend)).
  • supertonic_gguf.cpp:backend_supports_pinned_host_buffer_uncached and the F16-KV flash-attn capability probes similarly use direct ggml-vulkan entries. Same registry-API fix would let those optimisations stay active on Android DL too.
  • A trailing GGML_ASSERT([rsets->data count] == 0) fires on Metal device shutdown at process exit (post-synth, doesn't affect output). Tracked separately; appears to live in qvac-ext-ggml@speech (ggml-metal-device.m:612), not in this merge.
  • Audit the remaining 142 ggml_set_input call sites in tts-cpp for the same cache-state-leak pattern. Only sites with constant inputs OR upload-skip trackers are at risk; no other tests are failing today, so any latent same-shape bugs there don't surface in the current harness.

🤖 Generated with Claude Code

nik and others added 30 commits November 10, 2025 13:02
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models
to use a non-standard first convolution kernel size. Standard
whisper models default to kernel size 3.

Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams
- When present, encoder self-attention is restricted to a local window
  for layers up to last_window_layer
- Bypass flash attention when windowed mask is active (Metal FA does
  not support custom F32 masks); flash attention remains enabled for
  non-BCI models and for the decoder
- Populate window_mask data on the encoder graph (not the cross graph)
- Add proper SOS token (language + transcribe) initialization for BCI
  models

Backward-compatible: n_audio_window_size defaults to 0 and
n_audio_last_window_layer defaults to -1, disabling windowed
attention entirely for standard whisper models.

Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback:

1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel,
   n_audio_window_size, n_audio_last_window_layer) behind a
   n_mels > 256 check. Standard whisper models have n_mels <= 128
   and do not contain these fields — reading them unconditionally
   would corrupt the file position and break model loading.

2. Add explicit is_bci flag to hparams struct, set when BCI fields
   are detected during loading.

3. Use is_bci flag (instead of n_audio_window_size > 0) to guard
   the BCI-specific decoder SOS token initialization.

4. Log BCI-specific hparams when a BCI model is detected.

Made-with: Cursor
The windowed attention mask values depend only on n_ctx and
window_size, both fixed after model load. Move the O(n_ctx^2)
computation from whisper_encode_internal (called every encode)
to whisper_init_state (called once). The encode path now just
copies the precomputed data to the graph tensor.

Made-with: Cursor
…, Threads

1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed
   mask uses hparams.n_audio_ctx, but the graph tensor is sized from
   exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back
   to recomputing the mask at the effective n_ctx when sizes differ,
   preventing a buffer overflow into the smaller tensor.

2. Update whisper.pc.in: the install interface was changed to
   include/whisper but the pkg-config includedir still pointed to
   include/. Consumers using pkg-config would not find whisper.h.

3. Fix whisper-config.cmake.in: the whisper target publicly links
   Threads::Threads but find_dependency(Threads) was skipped on
   Windows, leaving downstream find_package(whisper) with an
   unresolved imported target. Now always resolve Threads.
…ash attention

1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the
   default n_audio_ctx, the window mask is now recomputed once and
   cached in wstate (keyed on window_mask_n_ctx) instead of allocating
   a new std::vector on every whisper_encode_internal call.

2. Per-layer flash attention: layers above last_window_layer no longer
   need the windowed attention mask. The flash attention path is now
   used for those layers even when BCI windowed attention is active,
   instead of globally falling back to the softmax path for the entire
   encoder.

3. Use std::abs instead of C abs in both init-time and encode-time
   mask computation paths.
…alidation

1. Extract compute_window_mask() helper on whisper_state to eliminate
   the duplicated O(n_ctx^2) mask fill loop that appeared in both
   whisper_init_state and whisper_encode_internal. Both call sites
   now use the single helper, preventing future drift.

2. Guard the encode-time mask block with hparams.is_bci before doing
   the ggml_graph_get_tensor lookup. Cheaper and more explicit than
   relying on the tensor name string to determine whether BCI
   windowed attention is active.

3. Add hparams.is_bci to the graph builder guard for window_mask
   tensor creation, aligning it with the other BCI code paths.

4. Add validation for BCI hparams after reading from file:
   n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0.
   Log an error and return false on invalid values instead of
   proceeding with garbage.

5. Add comment explaining the n_mels > 256 threshold used to
   discriminate BCI models from standard whisper models, and noting
   that a dedicated file-format marker should be introduced if this
   assumption ever breaks.

Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:

1. install(TARGETS whisper.coreml) did not join the whisper-targets
   export set. Since whisper PRIVATE-links to whisper.coreml and is
   itself in whisper-targets, CMake refuses to generate with
       install(EXPORT "whisper-targets" ...) includes target "whisper"
       which requires target "whisper.coreml" that is not in any
       export set.
   Add EXPORT whisper-targets to the install (must come before LIBRARY
   in CMake's install(TARGETS ...) signature).

2. Once whisper.coreml is in the export set, its PUBLIC include dirs
   are validated against the install interface. The current "."
   include dir is a raw source-tree path with no
   $<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
   with
       INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
       which is prefixed in the source directory.
   The headers under coreml/ are internal implementation details only
   consumed by whisper.cpp (in the same directory), so the correct fix
   is to mark them PRIVATE rather than wrapping them in install/build
   generator expressions.

Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.

This unblocks the ios-xcode-build CI job on PR #12.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
…review)

Address @gianni-cor review on PR #11: switch the bundled ggml filename
prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech
stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a
single ggml file set instead of each library shipping its own copy.

  - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`,
    GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`,
    option blurb + status message updated.
  - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh,
    patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example
    updated to reference the new `speech-` prefix.

Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure
prints `bundled ggml libraries will be emitted as libspeech-ggml-*`;
build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib;
parakeet binary's otool -L now references `libspeech-ggml*` exclusively.

Co-authored-by: Cursor <cursoragent@cursor.com>
The bindings-java tests testGetDefaultFullParams_Greedy /
testGetDefaultFullParams_BeamSearch on PR #12 fail with

    expected: <5> but was: <0>     (greedy.best_of)
    expected: <5> but was: <-1>    (beam_search.beam_size)

while whisper_full_default_params() still returns 5 for both — the
actual transcription test (testFullTranscribe) produces correct text.

Diagnosis: the Java JNA WhisperFullParams Structure is missing fields
that exist in the C whisper_full_params struct, so JNA computes wrong
offsets and reads garbage at greedy.best_of / beam_search.beam_size.

Specifically the Java layout was missing:

  1. int32_t seed           — added by tetherto's local seed patch
                              between no_speech_thold and greedy
                              (include/whisper.h:553). This single
                              omission shifts every subsequent field
                              by 4 bytes and is the proximate cause of
                              both failing assertions.
  2. bool vad               — added by upstream
  3. const char * vad_model_path
  4. whisper_vad_params vad_params (struct)

Fix:

* New WhisperVadParams.java JNA Structure mirroring
  whisper_vad_params {threshold, min_speech_duration_ms,
  min_silence_duration_ms, max_speech_duration_s, speech_pad_ms,
  samples_overlap}.
* Add `public int seed`, `public CBool vad`, `public String
  vad_model_path`, `public WhisperVadParams vad_params` fields and
  thread them into getFieldOrder() at the matching positions.

Field order in WhisperFullParams.getFieldOrder() now matches the C
struct in include/whisper.h field-for-field, so JNA-computed offsets
agree with the native side.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 and others added 15 commits May 19, 2026 10:42
… + voice cache threading + round-5 gap

Pure docs / comments change.  No production-logic surface
modified.  CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit`
25 / 25; CPU + Vulkan end-to-end synth produce valid speech
WAVs (99.7% non-zero samples, healthy rms).

Addresses three reviewer asks on PR #18:

1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md).
   Adds an explicit "Note on the round 5 gap" section between
   round 4 and round 7 documenting that the round-4 plan
   reserved the name "Round 5 = pinned-host-buffer per-step
   uploads" as a placeholder, that the actual implementation
   was deferred behind round-7's bench observability
   prerequisite, and that it ultimately landed as round 12 #5.
   No code was dropped; round numbers stay contiguous so PR
   descriptions and CI logs match the round labels in this log
   without rebase churn.

2. UMA-bias assumption (supertonic_gguf.cpp —
   resolve_vulkan_device_index).  Adds a long comment in the
   requested == -1 auto-pick branch documenting the assumption
   that is_uma_per_device[i] is sourced from
   ggml_backend_dev_get_props().type and the failure mode when
   a discrete adapter's driver mis-reports its type as _IGPU
   (some Thunderbolt eGPU configs; some ARM SoC dGPU paths).
   Three sub-cases enumerated: (a) discrete-only with
   mis-classification falls through to round-3 all-device
   argmax and still picks discrete by free-VRAM (coincidentally
   correct), (b) mixed UMA-iGPU + mis-classified-discrete picks
   iGPU silently (regression vs. round 3 — operator escape
   hatch: --vulkan-device N is UMA-agnostic and
   --vulkan-perf-logger exposes the choice).  Future-work
   pointer to a "free-VRAM ceiling" heuristic (UMA reports
   system-RAM-scale; a discrete reporting > 256 GB is
   implausible and can be re-classified) tracked in
   aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.

3. voice_host_cache threading model (supertonic_internal.h).
   Tightens the reference-stability docstring from "must NOT
   call clear() while holding the reference" to a full
   thread-safety section explicitly calling out single-threaded
   -per-Engine as the supported model (matches what the iOS
   load/unload race fix 36a2c56 enforces for s3gen).  Explains
   why no internal lock today (cache exists to eliminate per
   -call GPU downloads; internal locking would give back the
   saving) and what a future thread-pool refactor must do
   (external mutex around get_or_load + downstream .data()
   capture, OR switch to a std::shared_mutex-guarded internal
   lock).  Also clarifies the unordered_map guarantee: element
   references survive insert even when the table rehashes;
   only iterators are invalidated.

Reviewer's fourth ask — "the round-11 fix is redone in PR
#21" — was resolved by the rebase landing in this same branch
state.  After rebasing onto upstream/supertonic_optimizations
(which now contains PR #21's QVAC-18966 narrower 2-site fix),
this branch's round-11 commit is a delta of only the 2
Vulkan-only V-transpose sites needed for round 8's front-block
GPU bridge + round 9's style GPU bridge.  No double-application;
the QVAC-18966 fix is applied exactly once via PR #21 in the
new base.

Co-authored-by: Cursor <cursoragent@cursor.com>
… surface explicit-dtype downgrades

Pure additive change (one new resolver out-param defaulting to
nullptr; two test files extended; two doc-comment blocks added).
No production-logic surface modified for existing callers.

Regression status:
- CPU `ctest -L unit`: 25 / 25, 256 individual checks
  (was 25 / 25, ~209 checks pre-change).
- Vulkan `ctest -L unit`: 25 / 25.
- CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV
  (rms=285.6, abs_max=4703 on both backends, same seed +
  text), confirming no rounds-1..13 optimisation regressed.

Addresses Omar's five non-blocker findings on PR #18:

1. test_resolver_returns_concrete_only (kv_attn_type).  The
   original exhaustive 5 x 2 x 8 sweep only asserted dt !=
   autoselect, so a typo returning f16 when bf16 was
   requested+supported would pass silently.  Rewritten with a
   second pure-function `expected()` mirror of the resolver's
   matrix; every one of the 80 grid points now CHECKs the
   resolver's return value against the expected concrete
   dtype.  Added cross-contamination spot checks (requesting
   bf16 with f16+q8_0 supported but bf16 NOT supported must
   fall to f32, not silently to f16 or q8_0).  Now 205 checks
   passed in test-supertonic-kv-attn-type.

2. test_cpu_fallback_returns_valid_buffer (input_scratchpad).
   Original only round-tripped x_in (one of two allocated
   tensors).  Now round-trips BOTH x_in and temb_in with
   distinct payload patterns (1.0f vs 2.5f), plus a
   cross-aliasing recheck (after writing temb_in, x_in must
   still read back its original 1.0f) — a binding-collision
   bug where both tensors share memory would now fail this
   check.

3. resolve_kv_attn_type silent fallback on explicit operator
   request.  Added optional `bool * out_was_downgraded` output
   parameter to the resolver — set to true IFF the operator
   explicitly requested f16/bf16/q8_0 AND the corresponding
   backend probe returned false AND we therefore returned f32.
   The auto path (-1) leaves the flag false (no operator
   surprise — auto-policy is doing its job).  Engine ctor +
   supertonic-bench wired to emit a one-line
   `fprintf(stderr, "warning: requested --kv-attn-type %s but
   the resolved backend's flash-attn probe rejected it;
   falling back to f32 (set --kv-attn-type auto to silence)")`
   on a downgrade.  Defaulted nullptr keeps the pure-logic
   unit tests stderr-clean.  New test_downgrade_flag_signal
   pins the contract on every relevant path (auto + missing
   probe -> flag false; explicit + matching probe -> flag
   false; explicit + missing probe -> flag true; nullptr out-
   ptr safe).

4. test_uma_aware_tiebreak_equal_vram_discretes
   (vulkan_device_select).  Added a dedicated UMA-bias-active
   test case: two discrete cards with EQUAL VRAM (32 GB each)
   alongside a UMA iGPU.  Pins three sub-cases: interleaved
   UMA in the middle, adjacent discretes with no UMA, three-
   way all-discrete tie.  Lower index wins in every case.
   The existing test 11's second CHECK already covered the
   interleaved-UMA case; this hoists the contract into its
   own named test so a future refactor reading the test
   names knows the tiebreak case is pinned.

5. cached_backend_capabilities UaF risk under test-only
   clear().  Added a long comment on the function documenting
   the four invariants:
   (a) production callers may hold the returned ref across
       subsequent calls for OTHER backends (unordered_map's
       insert-doesn't-invalidate-references guarantee);
   (b) production callers MUST NOT keep the ref alive across
       a clear() call (test code's responsibility);
   (c) multi-threaded callers must externally synchronise
       deref vs. clear (the cache's lock protects map
       structure, NOT element lifetime);
   (d) if a future refactor adds a production-reachable
       erase / clear path, this function must switch to
       return-by-value or std::shared_ptr<const T>.

Co-authored-by: Cursor <cursoragent@cursor.com>
…605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic
…ew-comments

parakeet-cpp: address PR #22 AOSC v2.1 review comments
g_s3gen_cache_refcount (line ~189) is declared as
`static std::atomic<int>` and is later used with `.fetch_add()`,
`.fetch_sub()`, `.store()`, but the translation unit only pulls in
ggml/cstring/stl headers — never `<atomic>` directly. libstdc++
happens to expose `std::atomic` transitively via `<mutex>` on most
hosts so the build appears clean, but on the ggml-speech sync path
where header transitivity changes (the qvac-ext-ggml@speech merge
of ggml-org v0.10.2 cuts a few of those transitive paths) the
translation unit fails with:

    chatterbox_tts.cpp:189: variable `std::atomic<int>
    g_s3gen_cache_refcount' has initializer but incomplete type

Reproduces on the pre-merge speech HEAD too -- it was previously
hidden by header transitivity. Add `#include <atomic>` explicitly.

Verified by a clean rebuild of tts-cpp against an `-DBUILD_SHARED_LIBS=ON`
install of qvac-ext-ggml@speech HEAD (45dbdecd, day-2 ggml-speech).

Co-authored-by: Cursor <cursoragent@cursor.com>
When vcpkg's arm64-android triplet forces VCPKG_LIBRARY_LINKAGE=static
(=> BUILD_SHARED_LIBS=OFF) the bundled ggml unconditionally aborts at
CMake configure time with:

    FATAL_ERROR: GGML_BACKEND_DL requires BUILD_SHARED_LIBS

even though the static dispatcher + MODULE backend .so files combo
actually works: the dispatcher just needs PIC (already gated by the
same BUILD_SHARED_LIBS branch below) so it can be dlsym'd from the
MODULE-built backend libraries.

Three guards changed from `BUILD_SHARED_LIBS` to
`BUILD_SHARED_LIBS OR GGML_BACKEND_DL` (FATAL_ERROR removed,
GGML_BACKEND_BUILD/SHARED defs on each backend, PIC + GGML_BUILD on
the core targets), so the Android dynamic-backend recipe used by
qvac-registry-vcpkg's whisper-cpp port (-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON) now configures.

Mirrors the equivalent change carried in qvac-ext-ggml@speech for the
parallel speech-stack consumers (parakeet-cpp / tts-cpp).

Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with the flags above (all 7 per-arch libggml-cpu-android_armv*_*.so
produced clean).

Co-authored-by: Cursor <cursoragent@cursor.com>
Android app packaging keeps native libraries compressed inside the APK
with no on-disk directory to scan (AGP's `useLegacyPackaging=false`
default since 3.6). The directory-iterator pass in
`ggml_backend_load_best` therefore finds nothing on Android and the
existing per-search_path `fs::exists` filename fallback also returns
false, leaving the loader to return nullptr and the consumer to fail
`init_cpu_backend()`.

For backends that ship as a single library (Vulkan / OpenCL / ...)
the bare `lib<prefix>ggml-<name>.so` filename is enough to resolve
via Android's in-APK linker lookup, but with
`GGML_CPU_ALL_VARIANTS=ON` (the qvac-registry-vcpkg whisper-cpp port
default for Android per QVAC-18993) the CPU backend ships only as
per-arch variants -- there is no plain `libggml-cpu.so` for the
fallback to compose, so the CPU backend silently never registers.

Enumerate the known per-arch Android variants as additional candidate
names for the "cpu" backend and run each through the standard
`ggml_backend_score` selection so the device's HWCAP picks the right
tier (armv8.0 baseline through armv9.2_2; matches the variants list
emitted by `ggml_add_cpu_backend_variant()` in ggml/src/CMakeLists.txt
around lines 410-416).

Fast-path for the size-1 candidate case (every backend on every
non-Android platform, plus Vulkan / OpenCL / Metal / ... on Android):
single load_backend call, identical cost to the previous code path.
The score-then-reload loop only runs when there's an actual choice
to make.

Mirrors qvac-ext-ggml@speech commit 9562ed04 ("ggml-backend: android
per-arch CPU variant dlopen fallback", @GustavoA1604, PR #11). Carried
here as a separate commit on top of the v1.8.4.3 upstream-sync branch
so the whisper-cpp vcpkg port can ship Android dynamic-backend mode
without a port-level patch (`patches/0002-...`).

Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with -DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=OFF
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON:
  - all 7 per-arch libggml-cpu-android_armv*_*.so produced clean;
  - `strings ggml-backend-reg.cpp.o | grep cpu-android_armv`
    confirms the __ANDROID__ block compiles into the dispatcher
    object.

Co-authored-by: Cursor <cursoragent@cursor.com>
…omic-include

QVAC-18966: tts-cpp — add missing <atomic> include in chatterbox_tts.cpp
…pp-upstream

QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)
tts-cpp: Add dynamic backend selection for Android
…backend

QVAC-18993: bundled-ggml — Android dynamic backend + per-arch CPU dlopen fallback
Reconcile QVAC-18605 supertonic Vulkan optimisation rounds 1-13 with master's
ggml-backend registry refactor + Android GGML_BACKEND_DL=ON dynamic-loader
path. Three conflict files (all in tts-cpp/supertonic_*):

- supertonic_engine.cpp + engine.h: combine both sets of EngineOptions
  fields (HEAD's precision/f16_attn/vulkan_device/f16_weights/kv_attn_type/
  vulkan_env_overrides + master's backends_dir/opencl_cache_dir). Order the
  ctor body so backends_dir/opencl_cache_dir setters run before the
  precision mapping + apply_vulkan_env_overrides + load_supertonic_gguf
  call, all of which must precede backend init.

- supertonic_gguf.cpp: replace the HEAD hand-rolled CUDA/Vulkan/OpenCL/CPU
  cascade in init_supertonic_backend with delegation to master's
  tts_cpp::detail::init_gpu_backend + init_cpu_backend. The HEAD cascade
  cannot survive: master dropped the GGML_USE_VULKAN/CUDA/OPENCL compile
  defines and the direct ggml_backend_<backend>_init symbols are not
  linkable under GGML_BACKEND_DL=ON (Android). Keep convert_supertonic_
  tensor_data (HEAD-only addition).

To preserve the QVAC-18605 round-3 / round-12 Vulkan device-selection work,
extend tts_cpp::detail::init_gpu_backend with an optional vulkan_device
parameter:
- vulkan_device == 0 (default): first Vulkan adapter, registry order
  (unchanged behaviour for chatterbox/s3gen call sites).
- vulkan_device == N > 0: that index in the Vulkan-only subset, range-
  checked.
- vulkan_device == -1: free-VRAM argmax with UMA bias (excludes integrated
  GPUs whenever a discrete adapter is also visible).
The policy is implemented via the public registry APIs (ggml_backend_dev_
memory + ggml_backend_dev_type) so it works in both DL=ON and DL=OFF builds.
init_supertonic_backend threads opts.vulkan_device through unchanged.

Restore GGML_USE_VULKAN on tts-cpp-backend-defs only when GGML_VULKAN AND
NOT GGML_BACKEND_DL. The supertonic optimisation paths (F16 K/V flash-
attention, pinned-host upload buffers, backend_name() device-description
annotation) still call direct ggml-vulkan symbols that are only linkable
when Vulkan is statically linked. On the Android DL build those code paths
fall back to the registry-walked non-Vulkan branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es using init_*_backend

Eight test executables compile chatterbox_tts.cpp / mel_extract_stft.cpp /
voice_encoder.cpp / s3tokenizer.cpp / campplus_link_chain directly (rather
than linking libtts-cpp.a) so they can reach internal test-hook entry
points. Master's registry-refactor moved init_gpu_backend / init_cpu_backend
into src/backend_selection.cpp without bumping these test targets, leaving
them with undefined references after the supertonic_optimizations merge
exposed the 4-arg init_gpu_backend overload.

Add src/backend_selection.cpp to each affected test target:
test-voice-features, test-resample, test-voice-encoder, test-fbank,
test-voice-embedding, test-s3tokenizer, test-streaming, test-cpu-caches.

Verified: cmake --build (default) succeeds end-to-end; ctest -L unit
reports 25/25 passing including the QVAC-18605 Vulkan-specific harnesses
(test-supertonic-vulkan-device-select, vulkan-env-overrides, kv-attn-type,
capability-cache, pinned-host-buffer, text-encoder-gpu-bridge, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether requested review from a team as code owners May 22, 2026 14:07
ogad-tether and others added 3 commits May 22, 2026 15:59
…t" inputs

Three pre-existing bit-exactness regressions in the QVAC-18605 cache work
(F8 style-residual cached-graph parity, F18 text-encoder convnext-front
graph cache, F19 vector-estimator front-block cache) shared one root
cause: leaf input tensors uploaded ONLY at build time (because their
contents depend solely on cache-key fields like L / text_len / θ) had
their backend buffers released by ggml-alloc's free pass once their last
consumer in the graph ran. On the second compute pass through the same
cache, intermediates aliased into the freed offsets and silently
overwrote the "stable" upload — every downstream tensor went stale.

The freed-leaf-input behaviour is documented inside ggml-alloc.c:
`ggml_gallocr_free_node` exits early only when the tensor has
`GGML_TENSOR_FLAG_OUTPUT` — the input flag does not extend that
guarantee. Marking each affected tensor as INPUT and OUTPUT keeps its
buffer alive across compute passes, so the one-shot upload at build
remains valid for the cache's full lifetime.

Affected tensors:
- supertonic_text_encoder.cpp:build_relpos_cache — `masks[9]` relpos
  attention masks (9 × L×L floats, encode integer position deltas
  −4..+4).
- supertonic_vector_estimator.cpp:build_group_graph_cache — RoPE
  cos/sin tables (q_cos_in / q_sin_in / k_cos_in / k_sin_in).
- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
  front_cache RoPE cos/sin tables (same shape, separate cache).
- supertonic_vector_estimator.cpp:build_res_style_qkv_cache —
  `style_v_in` / `kctx_in`. Both use the F4 pointer-compare upload-
  skip; without OUTPUT the skip preserved a host pointer to a
  backend buffer that gallocr had already released.

Test fallout on tts-cpp/test (with bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick fixture):

  before  test-supertonic-audit3-caches  6/8 checks pass  (F18, F19 fail)
  after   test-supertonic-audit3-caches  8/8 checks pass

  before  test-supertonic-graph-rewrites  4/5 checks pass  (F8 fails)
  after   test-supertonic-graph-rewrites  5/5 checks pass

  fixture suite:  9/16 → 15/16  (only `test-supertonic-pipeline` still
  fails — that's a separate ONNX-vs-GGUF reference drift, not a cache
  bug; the per-stage tests that take ref inputs directly all pass).

  unit suite:  25/25 (unchanged).

Verified on the supertonic_optimizations branch pre-merge (`184c6410`)
that the failures are identical in magnitude — this is a pre-existing
bug in QVAC-18605 rounds 3+ cache work, not a regression from the
master merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peline test mask

Same root cause as the previous F8/F18/F19 fix: leaf input tensors that
the round-10 upload-skip tracker treats as "stable across denoise steps
within one synth" (uploaded only on `current_step == 0`, skipped on
steps 1..N-1) need INPUT + OUTPUT flags so ggml-alloc's free pass doesn't
release the buffer after step 0 and silently corrupt the skipped uploads
on subsequent steps.

Two more affected tensors found by tracing the pipeline parity test's
per-step divergence:

- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
  front_cache.text_in_t  (vector-estimator front-block text input)

- supertonic_vector_estimator.cpp:build_group_graph_cache
  cache.text_in  (vector-estimator group 1/2/3 text input)

Pipeline test (`test-supertonic-pipeline`) per-step max_abs_err:
  before:  step0 1.4e-05, step1 8.5e-01, step2 1.7e+00, … final 3.28e-01
  after:   step0 1.4e-05, step1 3.9e-05, step2 6.8e-05, … final 1.11e-04
The step-by-step error is now pure floating-point round-off
accumulation (~1e-5 per step), 4 orders of magnitude under the test's
1e-3 threshold.

Also: align the pipeline test's input prep with the
`dump-supertonic-reference.py` harness — the Python script feeds the
ONNX vector_step a pre-masked input (`xt = noise * latent_mask`) and
the vocoder a pre-masked latent (`vocoder({"latent": xt * latent_mask})`).
For the supertonic-ref-quick fixture the mask is all 1.0 so this is a
no-op today, but a fixture with padded tail latents would otherwise
diverge from the reference at every padded position.

Fixture suite on tts-cpp/build (bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick):

  before:  15/16 fixture tests passing (test-supertonic-pipeline FAIL)
  after:   16/16 fixture tests passing

Unit suite unchanged (25/25).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing for ggml_reshape_2d

CodeQL cpp/integer-multiplication-cast-to-long flagged
`n_heads * head_dim` (both `int`, multiplied as `int` and then implicitly
converted to `int64_t` for `ggml_reshape_2d`'s shape argument). For
Supertonic's vector-estimator the values are 4 × 64 = 256 so there is
no actual overflow risk today, but a tts-cpp callsite that ever uses
larger n_heads / head_dim would silently truncate. Cast first to make
the multiplication 64-bit. No behaviour change for any current caller.

Alert was not introduced by this PR (line dates back to the original
tts-cpp add `ef840d5c3`) but surfaces on PR #31 because the surrounding
file was touched. Fixing here keeps the PR's CodeQL gate green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@GustavoA1604 GustavoA1604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. backend_selection.cpp — missing #include <stdexcept>

Throws std::runtime_error in 4 places, compiles on macOS libc++ via transitive include, fails on libstdc++ (Linux / MSYS2-GCC). One line:

 #include <mutex>
+#include <stdexcept>
 #include <string>
  1. Android GGML_BACKEND_DL=ON must keep the supertonic Vulkan optimisations — please don't ship them gated off

The PR currently lists this as a known follow-up, but Mali / non-Adreno-700+ Snapdragon / Exynos Xclipse are exactly the targets where the round-10 pinned-host-buffer + round-12 F16-KV bandwidth wins matter most; silently turning them off on DL undoes the QVAC-18605 business case on mobile.

Every direct ggml_backend_vk_* call in this PR has a public registry-API equivalent today at the 60a172e4 ggml pin:

  • ggml_backend_is_vk(backend)strcmp(ggml_backend_reg_name(ggml_backend_dev_backend_reg(ggml_backend_get_device(backend))), "Vulkan") == 0
  • ggml_backend_vk_host_buffer_type()ggml_backend_dev_host_buffer_type(ggml_backend_get_device(backend))
  • ggml_backend_vk_get_device_description(...)ggml_backend_dev_description(ggml_backend_get_device(backend))
  • F16-KV / Q8_0-KV / BF16-KV FA capability predicates → build a probe tensor and call ggml_backend_dev_supports_op(dev, op)

Please migrate the four call-site classes in this PR, drop the NOT GGML_BACKEND_DL clause from the GGML_USE_VULKAN define in tts-cpp/CMakeLists.txt:180-181, and add a Snapdragon DL smoke test confirming the round-10 / 12 logs fire on the dynamic-loader build. init_gpu_backend already proves the registry-only pattern works — extending it the rest of the way is mechanical and keeps tts-cpp's source under the same "no direct backend symbols" invariant parakeet-cpp ships today.

#1, #2)

Addresses PR #31 review feedback from @GustavoA1604:

  1. backend_selection.cpp — missing `#include <stdexcept>`.  Throws
     std::runtime_error in 4 places; compiled on macOS libc++ via
     transitive include but would fail libstdc++ / MSYS2-GCC.

  2. Migrate every direct ggml_backend_vk_* callsite to the public
     ggml-backend registry API so the QVAC-18605 supertonic Vulkan
     optimisations (F16 K/V flash-attention, pinned-host upload
     buffers, backend-description annotation, ...) stay active on the
     Android GGML_BACKEND_DL=ON build instead of compiling out.

Migrations:

  - ggml_backend_is_vk(b)
      → tts_cpp::detail::backend_is_vulkan(b) — strcmp against
        ggml_backend_reg_name(ggml_backend_dev_backend_reg(
        ggml_backend_get_device(b))).  Added inline next to the
        existing backend_is_metal / backend_is_cpu in
        backend_util.h (mirrors parakeet-cpp's helper module).

  - ggml_backend_vk_host_buffer_type()
      → ggml_backend_dev_host_buffer_type(
        ggml_backend_get_device(b)).  Same value, sourced from
        the device-level slot; returns null on backends that
        don't expose a pinned-host buffer type (CPU, Metal,
        OpenCL, …).  Affects:
          * backend_supports_pinned_host_buffer_uncached
          * try_alloc_inputs_in_pinned_host_buffer

  - ggml_backend_vk_get_device_description(idx, buf, len)
      → ggml_backend_dev_description(
        ggml_backend_get_device(b)).  Same string, no host buf
        round-trip.  Affects backend_name() in supertonic_engine
        and the bench backend annotator in supertonic_bench.

Drop:

  - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp
    and supertonic_bench.cpp (no longer needed; registry API lives
    in ggml-backend.h).
  - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all
    paths now compile unconditionally).
  - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in
    tts-cpp/CMakeLists.txt — no code references it any more.  tts-cpp
    now mirrors parakeet-cpp's "no direct backend symbols" invariant.

The F16/Q8_0/BF16 KV-FA capability probes were already routed through
`ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change
needed there.

Verified on macOS arm64 + Metal:
  - cmake --build builds 100% clean
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass
  - supertonic-cli end-to-end synth produces audible WAV
  - The `backend_is_vk` engine field still flips correctly via the
    registry path (bench reports `backend: Vulkan (device N: <name>)`
    on a desktop Vulkan box per the same registry lookup).

Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon
smoke test from a hardware-owning reviewer — `init_gpu_backend`
already proved the registry-only pattern works on DL builds, so this
change extends the same invariant to the remaining four callsite
classes mechanically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether
Copy link
Copy Markdown
Author

@GustavoA1604 thanks for the review — both items addressed in 00eb3f36:

1. #include <stdexcept> in backend_selection.cpp — added alongside the existing <mutex> / <string> imports.

2. Direct ggml_backend_vk_* calls migrated to the registry API. All four callsite classes you flagged are now registry-routed; GGML_USE_VULKAN is gone from tts-cpp/CMakeLists.txt entirely (no source references it any more). Diff: tts-cpp/CMakeLists.txt | 34 +++++++--------------.

Concrete swaps:

  • ggml_backend_is_vk(b) → new tts_cpp::detail::backend_is_vulkan(b) in backend_util.h, parallel to the existing backend_is_metal (parakeet-cpp pattern).
  • ggml_backend_vk_host_buffer_type()ggml_backend_dev_host_buffer_type(ggml_backend_get_device(b)) (backend_supports_pinned_host_buffer_uncached + try_alloc_inputs_in_pinned_host_buffer).
  • ggml_backend_vk_get_device_description(idx, buf, len)ggml_backend_dev_description(ggml_backend_get_device(b)) (engine + bench backend annotators).
  • F16/Q8_0/BF16 KV-FA capability probes were already on ggml_backend_supports_op(backend, op) (added in ccec5924), so no change there.

#include "ggml-vulkan.h" is gone from both supertonic_engine.cpp and supertonic_bench.cpp. Every #ifdef GGML_USE_VULKAN guard in tts-cpp source is removed — all paths compile unconditionally now.

Local verification on macOS arm64 + Metal:

  • cmake --build clean
  • ctest -L unit 25/25
  • ctest -L fixture 16/16 (incl. test-supertonic-pipeline end-to-end vs ONNX reference)
  • supertonic-cli end-to-end synth produces an audible 3.0s WAV

Android GGML_BACKEND_DL=ON smoke test on Snapdragon is still flagged as a TODO in the PR body — I don't have hardware here, but the registry-only invariant matches what init_gpu_backend already proved works on DL builds.

Heads-up: branch was DIRTY against master (the v1.8.5 sync + EOU work merged in while this PR was open). Resolving that next, then will re-request review.

Pulls in the master-side activity since PR #31 opened:

  - QVAC-19386: v1.8.5 + sync vendored whisper.cpp + ggml to ggml-org
    upstream (#33).  Bumps whisper version, refreshes the in-tree ggml,
    re-adds tts-cpp from a fresh snapshot of chatterbox.cpp's port.
  - QVAC-19270: parakeet EOU streaming mid-stream-boundary handling.
  - QVAC-19213: Adreno Vulkan fixes (mul_mat_vec subgroup->shmem,
    get_max_size cap scoped to Qualcomm/Adreno).

Conflict resolution (all 24 conflicts were `add/add` because the
merge-base — `4bf733672` `talk-llama : sync llama.cpp` — predates QVAC
adding `tts-cpp/` and `parakeet-cpp/`):

  - tts-cpp/* → kept HEAD (`--ours`).  This branch is the canonical
    home of the QVAC-18605 supertonic Vulkan optimisation rounds 1-13
    + the registry-API migration + the cache-state-leak fixes.  The
    chatterbox.cpp-mirrored fixes that master's `fce9d211 Add tts-cpp
    files` brought in (N1-N7 docstrings, ggml-quants.h fix,
    backend_device() public API) are already present in HEAD's
    starting point and surface as no-op diffs.

  - parakeet-cpp/* → took master (`--theirs`).  Master is the
    canonical home of QVAC-19270 EOU streaming work; this branch has
    no parakeet-cpp changes to defend.

  - .github/CODEOWNERS → took master (team rename to
    `qvac-internal-dev` / `qvac-internal-merge`).

Verified on macOS arm64 + Metal:
  - cmake --build cleanly
  - ctest -L unit   → 25/25 pass
  - ctest -L fixture → 16/16 pass (incl. test-supertonic-pipeline
    end-to-end vs ONNX reference, max_abs_err = 1.1e-04 ≪ 1e-3
    threshold)

The branch is now in sync with origin/master at `eabcf6da`; the
mergeStateStatus on PR #31 should flip from DIRTY back to UNSTABLE
(then green, once the pre-existing master CI fails resolve too).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether requested a review from GustavoA1604 June 1, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants