Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31
Merge supertonic_optimizations into master — QVAC-18605 rounds 1-13 + master reconcile#31ogad-tether wants to merge 116 commits into
Conversation
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
add_codeowners file
added approval check worker
DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor
The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor
…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.
…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.
…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:
1. install(TARGETS whisper.coreml) did not join the whisper-targets
export set. Since whisper PRIVATE-links to whisper.coreml and is
itself in whisper-targets, CMake refuses to generate with
install(EXPORT "whisper-targets" ...) includes target "whisper"
which requires target "whisper.coreml" that is not in any
export set.
Add EXPORT whisper-targets to the install (must come before LIBRARY
in CMake's install(TARGETS ...) signature).
2. Once whisper.coreml is in the export set, its PUBLIC include dirs
are validated against the install interface. The current "."
include dir is a raw source-tree path with no
$<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
with
INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
which is prefixed in the source directory.
The headers under coreml/ are internal implementation details only
consumed by whisper.cpp (in the same directory), so the correct fix
is to mark them PRIVATE rather than wrapping them in install/build
generator expressions.
Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.
This unblocks the ios-xcode-build CI job on PR #12.
QVAC-18300
Co-authored-by: Cursor <cursoragent@cursor.com>
…review) Address @gianni-cor review on PR #11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>
The bindings-java tests testGetDefaultFullParams_Greedy / testGetDefaultFullParams_BeamSearch on PR #12 fail with expected: <5> but was: <0> (greedy.best_of) expected: <5> but was: <-1> (beam_search.beam_size) while whisper_full_default_params() still returns 5 for both — the actual transcription test (testFullTranscribe) produces correct text. Diagnosis: the Java JNA WhisperFullParams Structure is missing fields that exist in the C whisper_full_params struct, so JNA computes wrong offsets and reads garbage at greedy.best_of / beam_search.beam_size. Specifically the Java layout was missing: 1. int32_t seed — added by tetherto's local seed patch between no_speech_thold and greedy (include/whisper.h:553). This single omission shifts every subsequent field by 4 bytes and is the proximate cause of both failing assertions. 2. bool vad — added by upstream 3. const char * vad_model_path 4. whisper_vad_params vad_params (struct) Fix: * New WhisperVadParams.java JNA Structure mirroring whisper_vad_params {threshold, min_speech_duration_ms, min_silence_duration_ms, max_speech_duration_s, speech_pad_ms, samples_overlap}. * Add `public int seed`, `public CBool vad`, `public String vad_model_path`, `public WhisperVadParams vad_params` fields and thread them into getFieldOrder() at the matching positions. Field order in WhisperFullParams.getFieldOrder() now matches the C struct in include/whisper.h field-for-field, so JNA-computed offsets agree with the native side. QVAC-18300 Co-authored-by: Cursor <cursoragent@cursor.com>
… + voice cache threading + round-5 gap Pure docs / comments change. No production-logic surface modified. CPU `ctest -L unit` 25 / 25; Vulkan `ctest -L unit` 25 / 25; CPU + Vulkan end-to-end synth produce valid speech WAVs (99.7% non-zero samples, healthy rms). Addresses three reviewer asks on PR #18: 1. Round-5 gap explanation (PROGRESS_SUPERTONIC.md). Adds an explicit "Note on the round 5 gap" section between round 4 and round 7 documenting that the round-4 plan reserved the name "Round 5 = pinned-host-buffer per-step uploads" as a placeholder, that the actual implementation was deferred behind round-7's bench observability prerequisite, and that it ultimately landed as round 12 #5. No code was dropped; round numbers stay contiguous so PR descriptions and CI logs match the round labels in this log without rebase churn. 2. UMA-bias assumption (supertonic_gguf.cpp — resolve_vulkan_device_index). Adds a long comment in the requested == -1 auto-pick branch documenting the assumption that is_uma_per_device[i] is sourced from ggml_backend_dev_get_props().type and the failure mode when a discrete adapter's driver mis-reports its type as _IGPU (some Thunderbolt eGPU configs; some ARM SoC dGPU paths). Three sub-cases enumerated: (a) discrete-only with mis-classification falls through to round-3 all-device argmax and still picks discrete by free-VRAM (coincidentally correct), (b) mixed UMA-iGPU + mis-classified-discrete picks iGPU silently (regression vs. round 3 — operator escape hatch: --vulkan-device N is UMA-agnostic and --vulkan-perf-logger exposes the choice). Future-work pointer to a "free-VRAM ceiling" heuristic (UMA reports system-RAM-scale; a discrete reporting > 256 GB is implausible and can be re-classified) tracked in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. 3. voice_host_cache threading model (supertonic_internal.h). Tightens the reference-stability docstring from "must NOT call clear() while holding the reference" to a full thread-safety section explicitly calling out single-threaded -per-Engine as the supported model (matches what the iOS load/unload race fix 36a2c56 enforces for s3gen). Explains why no internal lock today (cache exists to eliminate per -call GPU downloads; internal locking would give back the saving) and what a future thread-pool refactor must do (external mutex around get_or_load + downstream .data() capture, OR switch to a std::shared_mutex-guarded internal lock). Also clarifies the unordered_map guarantee: element references survive insert even when the table rehashes; only iterators are invalidated. Reviewer's fourth ask — "the round-11 fix is redone in PR #21" — was resolved by the rebase landing in this same branch state. After rebasing onto upstream/supertonic_optimizations (which now contains PR #21's QVAC-18966 narrower 2-site fix), this branch's round-11 commit is a delta of only the 2 Vulkan-only V-transpose sites needed for round 8's front-block GPU bridge + round 9's style GPU bridge. No double-application; the QVAC-18966 fix is applied exactly once via PR #21 in the new base. Co-authored-by: Cursor <cursoragent@cursor.com>
… surface explicit-dtype downgrades Pure additive change (one new resolver out-param defaulting to nullptr; two test files extended; two doc-comment blocks added). No production-logic surface modified for existing callers. Regression status: - CPU `ctest -L unit`: 25 / 25, 256 individual checks (was 25 / 25, ~209 checks pre-change). - Vulkan `ctest -L unit`: 25 / 25. - CPU + Vulkan end-to-end synth: bit-identical 10.10 s WAV (rms=285.6, abs_max=4703 on both backends, same seed + text), confirming no rounds-1..13 optimisation regressed. Addresses Omar's five non-blocker findings on PR #18: 1. test_resolver_returns_concrete_only (kv_attn_type). The original exhaustive 5 x 2 x 8 sweep only asserted dt != autoselect, so a typo returning f16 when bf16 was requested+supported would pass silently. Rewritten with a second pure-function `expected()` mirror of the resolver's matrix; every one of the 80 grid points now CHECKs the resolver's return value against the expected concrete dtype. Added cross-contamination spot checks (requesting bf16 with f16+q8_0 supported but bf16 NOT supported must fall to f32, not silently to f16 or q8_0). Now 205 checks passed in test-supertonic-kv-attn-type. 2. test_cpu_fallback_returns_valid_buffer (input_scratchpad). Original only round-tripped x_in (one of two allocated tensors). Now round-trips BOTH x_in and temb_in with distinct payload patterns (1.0f vs 2.5f), plus a cross-aliasing recheck (after writing temb_in, x_in must still read back its original 1.0f) — a binding-collision bug where both tensors share memory would now fail this check. 3. resolve_kv_attn_type silent fallback on explicit operator request. Added optional `bool * out_was_downgraded` output parameter to the resolver — set to true IFF the operator explicitly requested f16/bf16/q8_0 AND the corresponding backend probe returned false AND we therefore returned f32. The auto path (-1) leaves the flag false (no operator surprise — auto-policy is doing its job). Engine ctor + supertonic-bench wired to emit a one-line `fprintf(stderr, "warning: requested --kv-attn-type %s but the resolved backend's flash-attn probe rejected it; falling back to f32 (set --kv-attn-type auto to silence)")` on a downgrade. Defaulted nullptr keeps the pure-logic unit tests stderr-clean. New test_downgrade_flag_signal pins the contract on every relevant path (auto + missing probe -> flag false; explicit + matching probe -> flag false; explicit + missing probe -> flag true; nullptr out- ptr safe). 4. test_uma_aware_tiebreak_equal_vram_discretes (vulkan_device_select). Added a dedicated UMA-bias-active test case: two discrete cards with EQUAL VRAM (32 GB each) alongside a UMA iGPU. Pins three sub-cases: interleaved UMA in the middle, adjacent discretes with no UMA, three- way all-discrete tie. Lower index wins in every case. The existing test 11's second CHECK already covered the interleaved-UMA case; this hoists the contract into its own named test so a future refactor reading the test names knows the tiebreak case is pinned. 5. cached_backend_capabilities UaF risk under test-only clear(). Added a long comment on the function documenting the four invariants: (a) production callers may hold the returned ref across subsequent calls for OTHER backends (unordered_map's insert-doesn't-invalidate-references guarantee); (b) production callers MUST NOT keep the ref alive across a clear() call (test code's responsibility); (c) multi-threaded callers must externally synchronise deref vs. clear (the cache's lock protects map structure, NOT element lifetime); (d) if a future refactor adds a production-reachable erase / clear path, this function must switch to return-by-value or std::shared_ptr<const T>. Co-authored-by: Cursor <cursoragent@cursor.com>
…605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic Supertonic optimizations qvac 18605 tts ggml add and optimize vulkan for supertonic
…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments
g_s3gen_cache_refcount (line ~189) is declared as
`static std::atomic<int>` and is later used with `.fetch_add()`,
`.fetch_sub()`, `.store()`, but the translation unit only pulls in
ggml/cstring/stl headers — never `<atomic>` directly. libstdc++
happens to expose `std::atomic` transitively via `<mutex>` on most
hosts so the build appears clean, but on the ggml-speech sync path
where header transitivity changes (the qvac-ext-ggml@speech merge
of ggml-org v0.10.2 cuts a few of those transitive paths) the
translation unit fails with:
chatterbox_tts.cpp:189: variable `std::atomic<int>
g_s3gen_cache_refcount' has initializer but incomplete type
Reproduces on the pre-merge speech HEAD too -- it was previously
hidden by header transitivity. Add `#include <atomic>` explicitly.
Verified by a clean rebuild of tts-cpp against an `-DBUILD_SHARED_LIBS=ON`
install of qvac-ext-ggml@speech HEAD (45dbdecd, day-2 ggml-speech).
Co-authored-by: Cursor <cursoragent@cursor.com>
When vcpkg's arm64-android triplet forces VCPKG_LIBRARY_LINKAGE=static
(=> BUILD_SHARED_LIBS=OFF) the bundled ggml unconditionally aborts at
CMake configure time with:
FATAL_ERROR: GGML_BACKEND_DL requires BUILD_SHARED_LIBS
even though the static dispatcher + MODULE backend .so files combo
actually works: the dispatcher just needs PIC (already gated by the
same BUILD_SHARED_LIBS branch below) so it can be dlsym'd from the
MODULE-built backend libraries.
Three guards changed from `BUILD_SHARED_LIBS` to
`BUILD_SHARED_LIBS OR GGML_BACKEND_DL` (FATAL_ERROR removed,
GGML_BACKEND_BUILD/SHARED defs on each backend, PIC + GGML_BUILD on
the core targets), so the Android dynamic-backend recipe used by
qvac-registry-vcpkg's whisper-cpp port (-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON) now configures.
Mirrors the equivalent change carried in qvac-ext-ggml@speech for the
parallel speech-stack consumers (parakeet-cpp / tts-cpp).
Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with the flags above (all 7 per-arch libggml-cpu-android_armv*_*.so
produced clean).
Co-authored-by: Cursor <cursoragent@cursor.com>
Android app packaging keeps native libraries compressed inside the APK
with no on-disk directory to scan (AGP's `useLegacyPackaging=false`
default since 3.6). The directory-iterator pass in
`ggml_backend_load_best` therefore finds nothing on Android and the
existing per-search_path `fs::exists` filename fallback also returns
false, leaving the loader to return nullptr and the consumer to fail
`init_cpu_backend()`.
For backends that ship as a single library (Vulkan / OpenCL / ...)
the bare `lib<prefix>ggml-<name>.so` filename is enough to resolve
via Android's in-APK linker lookup, but with
`GGML_CPU_ALL_VARIANTS=ON` (the qvac-registry-vcpkg whisper-cpp port
default for Android per QVAC-18993) the CPU backend ships only as
per-arch variants -- there is no plain `libggml-cpu.so` for the
fallback to compose, so the CPU backend silently never registers.
Enumerate the known per-arch Android variants as additional candidate
names for the "cpu" backend and run each through the standard
`ggml_backend_score` selection so the device's HWCAP picks the right
tier (armv8.0 baseline through armv9.2_2; matches the variants list
emitted by `ggml_add_cpu_backend_variant()` in ggml/src/CMakeLists.txt
around lines 410-416).
Fast-path for the size-1 candidate case (every backend on every
non-Android platform, plus Vulkan / OpenCL / Metal / ... on Android):
single load_backend call, identical cost to the previous code path.
The score-then-reload loop only runs when there's an actual choice
to make.
Mirrors qvac-ext-ggml@speech commit 9562ed04 ("ggml-backend: android
per-arch CPU variant dlopen fallback", @GustavoA1604, PR #11). Carried
here as a separate commit on top of the v1.8.4.3 upstream-sync branch
so the whisper-cpp vcpkg port can ship Android dynamic-backend mode
without a port-level patch (`patches/0002-...`).
Validated by an NDK r29 cross-compile of bundled ggml + whisper.cpp
with -DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=OFF
-DGGML_CPU_ALL_VARIANTS=ON -DGGML_CPU_REPACK=ON:
- all 7 per-arch libggml-cpu-android_armv*_*.so produced clean;
- `strings ggml-backend-reg.cpp.o | grep cpu-android_armv`
confirms the __ANDROID__ block compiles into the dispatcher
object.
Co-authored-by: Cursor <cursoragent@cursor.com>
…omic-include QVAC-18966: tts-cpp — add missing <atomic> include in chatterbox_tts.cpp
…pp-upstream QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)
tts-cpp: Add dynamic backend selection for Android
…backend QVAC-18993: bundled-ggml — Android dynamic backend + per-arch CPU dlopen fallback
Reconcile QVAC-18605 supertonic Vulkan optimisation rounds 1-13 with master's ggml-backend registry refactor + Android GGML_BACKEND_DL=ON dynamic-loader path. Three conflict files (all in tts-cpp/supertonic_*): - supertonic_engine.cpp + engine.h: combine both sets of EngineOptions fields (HEAD's precision/f16_attn/vulkan_device/f16_weights/kv_attn_type/ vulkan_env_overrides + master's backends_dir/opencl_cache_dir). Order the ctor body so backends_dir/opencl_cache_dir setters run before the precision mapping + apply_vulkan_env_overrides + load_supertonic_gguf call, all of which must precede backend init. - supertonic_gguf.cpp: replace the HEAD hand-rolled CUDA/Vulkan/OpenCL/CPU cascade in init_supertonic_backend with delegation to master's tts_cpp::detail::init_gpu_backend + init_cpu_backend. The HEAD cascade cannot survive: master dropped the GGML_USE_VULKAN/CUDA/OPENCL compile defines and the direct ggml_backend_<backend>_init symbols are not linkable under GGML_BACKEND_DL=ON (Android). Keep convert_supertonic_ tensor_data (HEAD-only addition). To preserve the QVAC-18605 round-3 / round-12 Vulkan device-selection work, extend tts_cpp::detail::init_gpu_backend with an optional vulkan_device parameter: - vulkan_device == 0 (default): first Vulkan adapter, registry order (unchanged behaviour for chatterbox/s3gen call sites). - vulkan_device == N > 0: that index in the Vulkan-only subset, range- checked. - vulkan_device == -1: free-VRAM argmax with UMA bias (excludes integrated GPUs whenever a discrete adapter is also visible). The policy is implemented via the public registry APIs (ggml_backend_dev_ memory + ggml_backend_dev_type) so it works in both DL=ON and DL=OFF builds. init_supertonic_backend threads opts.vulkan_device through unchanged. Restore GGML_USE_VULKAN on tts-cpp-backend-defs only when GGML_VULKAN AND NOT GGML_BACKEND_DL. The supertonic optimisation paths (F16 K/V flash- attention, pinned-host upload buffers, backend_name() device-description annotation) still call direct ggml-vulkan symbols that are only linkable when Vulkan is statically linked. On the Android DL build those code paths fall back to the registry-walked non-Vulkan branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es using init_*_backend Eight test executables compile chatterbox_tts.cpp / mel_extract_stft.cpp / voice_encoder.cpp / s3tokenizer.cpp / campplus_link_chain directly (rather than linking libtts-cpp.a) so they can reach internal test-hook entry points. Master's registry-refactor moved init_gpu_backend / init_cpu_backend into src/backend_selection.cpp without bumping these test targets, leaving them with undefined references after the supertonic_optimizations merge exposed the 4-arg init_gpu_backend overload. Add src/backend_selection.cpp to each affected test target: test-voice-features, test-resample, test-voice-encoder, test-fbank, test-voice-embedding, test-s3tokenizer, test-streaming, test-cpu-caches. Verified: cmake --build (default) succeeds end-to-end; ctest -L unit reports 25/25 passing including the QVAC-18605 Vulkan-specific harnesses (test-supertonic-vulkan-device-select, vulkan-env-overrides, kv-attn-type, capability-cache, pinned-host-buffer, text-encoder-gpu-bridge, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t" inputs Three pre-existing bit-exactness regressions in the QVAC-18605 cache work (F8 style-residual cached-graph parity, F18 text-encoder convnext-front graph cache, F19 vector-estimator front-block cache) shared one root cause: leaf input tensors uploaded ONLY at build time (because their contents depend solely on cache-key fields like L / text_len / θ) had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload — every downstream tensor went stale. The freed-leaf-input behaviour is documented inside ggml-alloc.c: `ggml_gallocr_free_node` exits early only when the tensor has `GGML_TENSOR_FLAG_OUTPUT` — the input flag does not extend that guarantee. Marking each affected tensor as INPUT and OUTPUT keeps its buffer alive across compute passes, so the one-shot upload at build remains valid for the cache's full lifetime. Affected tensors: - supertonic_text_encoder.cpp:build_relpos_cache — `masks[9]` relpos attention masks (9 × L×L floats, encode integer position deltas −4..+4). - supertonic_vector_estimator.cpp:build_group_graph_cache — RoPE cos/sin tables (q_cos_in / q_sin_in / k_cos_in / k_sin_in). - supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml front_cache RoPE cos/sin tables (same shape, separate cache). - supertonic_vector_estimator.cpp:build_res_style_qkv_cache — `style_v_in` / `kctx_in`. Both use the F4 pointer-compare upload- skip; without OUTPUT the skip preserved a host pointer to a backend buffer that gallocr had already released. Test fallout on tts-cpp/test (with bundled qvac-ext-ggml@speech 60a172e, supertonic2.gguf + supertonic-ref-quick fixture): before test-supertonic-audit3-caches 6/8 checks pass (F18, F19 fail) after test-supertonic-audit3-caches 8/8 checks pass before test-supertonic-graph-rewrites 4/5 checks pass (F8 fails) after test-supertonic-graph-rewrites 5/5 checks pass fixture suite: 9/16 → 15/16 (only `test-supertonic-pipeline` still fails — that's a separate ONNX-vs-GGUF reference drift, not a cache bug; the per-stage tests that take ref inputs directly all pass). unit suite: 25/25 (unchanged). Verified on the supertonic_optimizations branch pre-merge (`184c6410`) that the failures are identical in magnitude — this is a pre-existing bug in QVAC-18605 rounds 3+ cache work, not a regression from the master merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peline test mask
Same root cause as the previous F8/F18/F19 fix: leaf input tensors that
the round-10 upload-skip tracker treats as "stable across denoise steps
within one synth" (uploaded only on `current_step == 0`, skipped on
steps 1..N-1) need INPUT + OUTPUT flags so ggml-alloc's free pass doesn't
release the buffer after step 0 and silently corrupt the skipped uploads
on subsequent steps.
Two more affected tensors found by tracing the pipeline parity test's
per-step divergence:
- supertonic_vector_estimator.cpp:supertonic_vector_trace_proj_ggml
front_cache.text_in_t (vector-estimator front-block text input)
- supertonic_vector_estimator.cpp:build_group_graph_cache
cache.text_in (vector-estimator group 1/2/3 text input)
Pipeline test (`test-supertonic-pipeline`) per-step max_abs_err:
before: step0 1.4e-05, step1 8.5e-01, step2 1.7e+00, … final 3.28e-01
after: step0 1.4e-05, step1 3.9e-05, step2 6.8e-05, … final 1.11e-04
The step-by-step error is now pure floating-point round-off
accumulation (~1e-5 per step), 4 orders of magnitude under the test's
1e-3 threshold.
Also: align the pipeline test's input prep with the
`dump-supertonic-reference.py` harness — the Python script feeds the
ONNX vector_step a pre-masked input (`xt = noise * latent_mask`) and
the vocoder a pre-masked latent (`vocoder({"latent": xt * latent_mask})`).
For the supertonic-ref-quick fixture the mask is all 1.0 so this is a
no-op today, but a fixture with padded tail latents would otherwise
diverge from the reference at every padded position.
Fixture suite on tts-cpp/build (bundled qvac-ext-ggml@speech 60a172e,
supertonic2.gguf + supertonic-ref-quick):
before: 15/16 fixture tests passing (test-supertonic-pipeline FAIL)
after: 16/16 fixture tests passing
Unit suite unchanged (25/25).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing for ggml_reshape_2d CodeQL cpp/integer-multiplication-cast-to-long flagged `n_heads * head_dim` (both `int`, multiplied as `int` and then implicitly converted to `int64_t` for `ggml_reshape_2d`'s shape argument). For Supertonic's vector-estimator the values are 4 × 64 = 256 so there is no actual overflow risk today, but a tts-cpp callsite that ever uses larger n_heads / head_dim would silently truncate. Cast first to make the multiplication 64-bit. No behaviour change for any current caller. Alert was not introduced by this PR (line dates back to the original tts-cpp add `ef840d5c3`) but surfaces on PR #31 because the surrounding file was touched. Fixing here keeps the PR's CodeQL gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
backend_selection.cpp— missing#include <stdexcept>
Throws std::runtime_error in 4 places, compiles on macOS libc++ via transitive include, fails on libstdc++ (Linux / MSYS2-GCC). One line:
#include <mutex>
+#include <stdexcept>
#include <string>- Android
GGML_BACKEND_DL=ONmust keep the supertonic Vulkan optimisations — please don't ship them gated off
The PR currently lists this as a known follow-up, but Mali / non-Adreno-700+ Snapdragon / Exynos Xclipse are exactly the targets where the round-10 pinned-host-buffer + round-12 F16-KV bandwidth wins matter most; silently turning them off on DL undoes the QVAC-18605 business case on mobile.
Every direct ggml_backend_vk_* call in this PR has a public registry-API equivalent today at the 60a172e4 ggml pin:
ggml_backend_is_vk(backend)→strcmp(ggml_backend_reg_name(ggml_backend_dev_backend_reg(ggml_backend_get_device(backend))), "Vulkan") == 0ggml_backend_vk_host_buffer_type()→ggml_backend_dev_host_buffer_type(ggml_backend_get_device(backend))ggml_backend_vk_get_device_description(...)→ggml_backend_dev_description(ggml_backend_get_device(backend))- F16-KV / Q8_0-KV / BF16-KV FA capability predicates → build a probe tensor and call
ggml_backend_dev_supports_op(dev, op)
Please migrate the four call-site classes in this PR, drop the NOT GGML_BACKEND_DL clause from the GGML_USE_VULKAN define in tts-cpp/CMakeLists.txt:180-181, and add a Snapdragon DL smoke test confirming the round-10 / 12 logs fire on the dynamic-loader build. init_gpu_backend already proves the registry-only pattern works — extending it the rest of the way is mechanical and keeps tts-cpp's source under the same "no direct backend symbols" invariant parakeet-cpp ships today.
#1, #2) Addresses PR #31 review feedback from @GustavoA1604: 1. backend_selection.cpp — missing `#include <stdexcept>`. Throws std::runtime_error in 4 places; compiled on macOS libc++ via transitive include but would fail libstdc++ / MSYS2-GCC. 2. Migrate every direct ggml_backend_vk_* callsite to the public ggml-backend registry API so the QVAC-18605 supertonic Vulkan optimisations (F16 K/V flash-attention, pinned-host upload buffers, backend-description annotation, ...) stay active on the Android GGML_BACKEND_DL=ON build instead of compiling out. Migrations: - ggml_backend_is_vk(b) → tts_cpp::detail::backend_is_vulkan(b) — strcmp against ggml_backend_reg_name(ggml_backend_dev_backend_reg( ggml_backend_get_device(b))). Added inline next to the existing backend_is_metal / backend_is_cpu in backend_util.h (mirrors parakeet-cpp's helper module). - ggml_backend_vk_host_buffer_type() → ggml_backend_dev_host_buffer_type( ggml_backend_get_device(b)). Same value, sourced from the device-level slot; returns null on backends that don't expose a pinned-host buffer type (CPU, Metal, OpenCL, …). Affects: * backend_supports_pinned_host_buffer_uncached * try_alloc_inputs_in_pinned_host_buffer - ggml_backend_vk_get_device_description(idx, buf, len) → ggml_backend_dev_description( ggml_backend_get_device(b)). Same string, no host buf round-trip. Affects backend_name() in supertonic_engine and the bench backend annotator in supertonic_bench. Drop: - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp and supertonic_bench.cpp (no longer needed; registry API lives in ggml-backend.h). - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all paths now compile unconditionally). - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in tts-cpp/CMakeLists.txt — no code references it any more. tts-cpp now mirrors parakeet-cpp's "no direct backend symbols" invariant. The F16/Q8_0/BF16 KV-FA capability probes were already routed through `ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change needed there. Verified on macOS arm64 + Metal: - cmake --build builds 100% clean - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass - supertonic-cli end-to-end synth produces audible WAV - The `backend_is_vk` engine field still flips correctly via the registry path (bench reports `backend: Vulkan (device N: <name>)` on a desktop Vulkan box per the same registry lookup). Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon smoke test from a hardware-owning reviewer — `init_gpu_backend` already proved the registry-only pattern works on DL builds, so this change extends the same invariant to the remaining four callsite classes mechanically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@GustavoA1604 thanks for the review — both items addressed in 1. 2. Direct Concrete swaps:
Local verification on macOS arm64 + Metal:
Android Heads-up: branch was DIRTY against master (the v1.8.5 sync + EOU work merged in while this PR was open). Resolving that next, then will re-request review. |
Pulls in the master-side activity since PR #31 opened: - QVAC-19386: v1.8.5 + sync vendored whisper.cpp + ggml to ggml-org upstream (#33). Bumps whisper version, refreshes the in-tree ggml, re-adds tts-cpp from a fresh snapshot of chatterbox.cpp's port. - QVAC-19270: parakeet EOU streaming mid-stream-boundary handling. - QVAC-19213: Adreno Vulkan fixes (mul_mat_vec subgroup->shmem, get_max_size cap scoped to Qualcomm/Adreno). Conflict resolution (all 24 conflicts were `add/add` because the merge-base — `4bf733672` `talk-llama : sync llama.cpp` — predates QVAC adding `tts-cpp/` and `parakeet-cpp/`): - tts-cpp/* → kept HEAD (`--ours`). This branch is the canonical home of the QVAC-18605 supertonic Vulkan optimisation rounds 1-13 + the registry-API migration + the cache-state-leak fixes. The chatterbox.cpp-mirrored fixes that master's `fce9d211 Add tts-cpp files` brought in (N1-N7 docstrings, ggml-quants.h fix, backend_device() public API) are already present in HEAD's starting point and surface as no-op diffs. - parakeet-cpp/* → took master (`--theirs`). Master is the canonical home of QVAC-19270 EOU streaming work; this branch has no parakeet-cpp changes to defend. - .github/CODEOWNERS → took master (team rename to `qvac-internal-dev` / `qvac-internal-merge`). Verified on macOS arm64 + Metal: - cmake --build cleanly - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass (incl. test-supertonic-pipeline end-to-end vs ONNX reference, max_abs_err = 1.1e-04 ≪ 1e-3 threshold) The branch is now in sync with origin/master at `eabcf6da`; the mergeStateStatus on PR #31 should flip from DIRTY back to UNSTABLE (then green, once the pre-existing master CI fails resolve too). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
GGML_BACKEND_DL=ONdynamic-loader path.tts_cpp::detail::init_gpu_backend()with an optionalvulkan_devicearg (0 = first adapter, N > 0 = explicit index, -1 = free-VRAM auto-pick with UMA bias) so the round-3 / round-12 Vulkan device-selection policy survives master's registry-only refactor without bringing back directggml_backend_vk_*calls. Implemented via the public registry APIs (ggml_backend_dev_memory+ggml_backend_dev_type) so it works in bothGGML_BACKEND_DL=ONand=OFFbuilds. Default value is 0, so chatterbox / s3gen / parakeet call sites are unaffected.GGML_USE_VULKANcompile define is re-enabled ontts-cpp-backend-defsonly whenGGML_VULKAN AND NOT GGML_BACKEND_DL— the supertonic optimisation paths (F16 K/V flash-attention, pinned-host upload buffers,ggml_backend_vk_host_buffer_type()per-step uploads,backend_name()device-description annotation) call direct ggml-vulkan symbols that are only linkable when Vulkan is statically linked. On the Android DL build those paths fall back to the registry-walked non-Vulkan code, matching master's design intent.init_gpu_backend/init_cpu_backend. Addssrc/backend_selection.cppto each.ccec5924, round 10/12): leaf input tensors uploaded once at build / once per synth via the round-10 upload-skip tracker had their backend buffers released by ggml-alloc's free pass once their last consumer in the graph ran. On the second compute pass through the same cache, intermediates aliased into the freed offsets and silently overwrote the "stable" upload. Fix is to mark each affected tensor asINPUT + OUTPUT(the OUTPUT flag is what gallocr'sggml_gallocr_free_nodechecks before releasing). Affects: relpos attention masks, per-group RoPE cos/sin tables, front-block RoPE cos/sin tables,style_v_in/kctx_ininbuild_res_style_qkv_cache,text_in_tinsupertonic_vector_trace_proj_ggml, andtext_ininbuild_group_graph_cache.Conflict resolution notes
Three conflict files, all in
tts-cpp/supertonic_*:include/tts-cpp/supertonic/engine.hEngineOptionsfields (HEAD'sprecision/f16_attn/vulkan_device/f16_weights/kv_attn_type/vulkan_env_overrides+ master'sbackends_dir/opencl_cache_dir).src/supertonic_engine.cppapply_vulkan_env_overrides()→load_supertonic_gguf()with HEAD's extra args. Order matters: all setters must precedeinit_supertonic_backend().src/supertonic_gguf.cpp#ifdef GGML_USE_VULKANcascade with delegation totts_cpp::detail::init_gpu_backend(), threadingvulkan_devicethrough. Keptconvert_supertonic_tensor_data(HEAD-only addition).Test plan
cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFFconfigures cleanly (bundledqvac-ext-ggml@speechpin60a172e)cmake --build build -jbuilds 100% clean (library + supertonic-cli + tts-cli + all unit/integration test binaries on macOS arm64 + Metal)ctest -L unit -j 4→ 25/25 passing, including every QVAC-18605 logic harness:test-supertonic-vulkan-device-select,vulkan-env-overrides,kv-attn-type(+-api),capability-cache,pinned-host-buffer,text-encoder-gpu-bridge,upload-skip-tracker,voice-host-cache,f16-deny-list-api,f16-attn-parity,warm-up-api,input-scratchpad,backend-dispatch,portable-ops,vulkan-dispatch,in-graph-transpose,graph-to-graph-blit,rope-in-graph,rope-packed-qk,profile-csv,convnext-block-fusedctest -L fixture→ 16/16 passing (supertonic-ref-quick fixture, pointed via-DTTS_CPP_TEST_MODEL_DIR+-DTTS_CPP_TEST_REF_DIR). Includingtest-supertonic-pipeline(end-to-end vs ONNX reference WAV, max_abs_err = 1.1e-04 against 1e-3 threshold),test-supertonic-graph-rewrites(F3/F8/F11 5/5),test-supertonic-audit3-caches(F17/F18/F19 8/8)supertonic-cliagainstsupertonic2.ggufon Metal — 8.15 s of 44.1 kHz mono PCM produced; Metal pipeline log shows the QVAC custom kernels (e.g.kernel_supertonic_edge_pad_1d_f32) compiling and running. WAV length matches ONNX reference exactly (136 970 vs 136 972 samples — 2-sample EOF rounding).supertonic-benchon Apple M-series + Metal: 43.5× realtime (RTF 0.023, median over 3 runs). All QVAC-18605 auto-policies engaged:f16_attn=on / f16_weights=on / native_leaky_relu=on / kv_attn_type=f16 / q8_0_kv_attn=available / bf16_kv_attn=available.supertonic-cli --n-gpu-layers 99 --vulkan-device -1 --vulkan-perf-loggeragainstsupertonic2.ggufand confirm the auto-pick log line + steady-state perf numbers match round 12.GGML_BACKEND_DL=ONsmoke test. The merge accepts that the supertonic Vulkan-specific code paths compile out underGGML_USE_VULKAN-disabled (Android DL); registry-walked fallback should remain functional. Recommend a smoke test on a Snapdragon / non-Apple Android target before tagging.chatterbox-s3gen.gguf/chatterbox-t3-mtl.gguf/s3gen-ref//streaming-ref//t3-mtl-ref/etc. are still auto-disabled because those fixtures aren't shipped in-tree. Out of scope for this PR but worth tracking.Known follow-ups (not blocking merge)
supertonic_engine.cpp:backend_name()Vulkan device-description annotation is inert underGGML_BACKEND_DL=ON(depends onggml_backend_vk_get_device_description). Cheap fix: route throughggml_backend_dev_description(ggml_backend_get_device(backend)).supertonic_gguf.cpp:backend_supports_pinned_host_buffer_uncachedand the F16-KV flash-attn capability probes similarly use direct ggml-vulkan entries. Same registry-API fix would let those optimisations stay active on Android DL too.GGML_ASSERT([rsets->data count] == 0)fires on Metal device shutdown at process exit (post-synth, doesn't affect output). Tracked separately; appears to live inqvac-ext-ggml@speech(ggml-metal-device.m:612), not in this merge.ggml_set_inputcall sites in tts-cpp for the same cache-state-leak pattern. Only sites with constant inputs OR upload-skip trackers are at risk; no other tests are failing today, so any latent same-shape bugs there don't surface in the current harness.🤖 Generated with Claude Code