Skip to content

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17

Open
Zbig9000 wants to merge 66 commits into
tetherto:masterfrom
Zbig9000:QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic
Open

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17
Zbig9000 wants to merge 66 commits into
tetherto:masterfrom
Zbig9000:QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Conversation

@Zbig9000
Copy link
Copy Markdown

@Zbig9000 Zbig9000 commented May 12, 2026

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds eleven rounds of Vulkan-specific deltas — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability contract for future regressions.

Rounds 1–6 are dispatch + capability infrastructure (probes, flags, multi-device auto-pick, deny-list, multi-dtype K/V). Rounds 8–10 are observability + per-step sync-point elimination on the GPU bridges. Round 11 is a critical correctness fix that turns the prior 10 rounds from "passes CI" into "actually runs end-to-end on every Vulkan adapter we have." Without round 11, every prior round was hitting a latent assertion-failure during the first real synth call.

Scope vs. PR #16: this PR sits on top of the OpenCL branch (QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All Vulkan-specific deltas are restated here; the OpenCL audit work is not. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.

End-to-end validation (on real hardware)

Tested on three Vulkan adapters in one machine — the gold-standard hybrid dev-rig setup:

Adapter Driver Result Per-synth (5-step denoise)
NVIDIA RTX 5090 (discrete, KHR_coopmat, FP16, no BF16) NVIDIA 590.48.01, Vulkan 1.4.325 ✅ 6.53s WAV 44 ms total, 74× realtime short prompt / 76 ms, 123× realtime long prompt
AMD Ryzen 9 9950X3D iGPU (UMA, RADV, FP16) Mesa 25.2.8 RADV, Vulkan 1.4.318 ✅ 3.64s WAV 178 ms total, 7× realtime
Mesa lavapipe (CPU-Vulkan correctness baseline) Mesa 25.2.8 lavapipe (LLVM 20.1.2) ✅ 1.21s WAV — (correctness baseline only)
CPU baseline (16-thread Ryzen 9 9950X3D) ✅ 3.89s WAV 121 ms total, 10× realtime

RTX 5090 per-step breakdown (median over 5 runs, F16 K/V default, post-prewarm):

preprocess             med=  0.00  ms
duration               med=  0.97  ms
text_encoder           med=  2.94  ms
vector_estimator       med= 37.70  ms (5 steps)
  vector_step[0]       med=  7.44  ms   (cold pipeline)
  vector_step[1..4]    med=  7.01–7.05  ms   (steady state)
vocoder                med=  2.47  ms
total                  med= 44.08  ms

The round-3/4/7/8/9/10 wins are all in those numbers — round 7's prewarm hides the ~2.3s cold shader-compile, round 8/9/10 eliminate ~166 sync points/synth so the steady-state per-step time is dominated by actual compute rather than host↔GPU bookkeeping.

Net new surface (against PR #16):

Category Delta
Vulkan-specific commits 11 (rounds 1–11)
New backend-capability probes 5 (native_leaky_relu, f16_kv_flash_attn, f16_mul_mat, q8_0_kv_flash_attn, bf16_kv_flash_attn, pinned_host_buffer)
New thread-local dispatch flags 2 (use_native_leaky_relu, kv_attn_type) — joins the round-1 use_f16_attn
New EngineOptions knobs 8 (vulkan_device, prewarm_text, f16_weights_deny_list, kv_attn_type + 4 Vulkan env-var passthroughs)
New CLI flags (× 3 binaries) --vulkan-device, --prewarm, --f16-weights-deny, --kv-attn-type, --vulkan-prefer-host-memory, --vulkan-disable-coopmat2, --vulkan-disable-bfloat16, --vulkan-perf-logger, --vulkan-async-transfer, --vulkan-env KEY=VALUE, --bench-per-step, --bench-sync, --json-out
New unit tests (ctest -L unit) 9 new + 3 extended (vulkan-dispatch, capability-cache, warm-up-api, vulkan-device-select, f16-deny-list-api, kv-attn-type, kv-attn-type-api, vulkan-env-overrides, upload-skip-tracker; rope-packed-qk rewritten for correct contract)
Whole ctest -L unit 22 / 22 PASS, 0 regressions, 0 flakes (CPU build + Vulkan build)
Sync-points eliminated per synth (vs. PR #16 baseline) ~166 (30 from round 8 + 120 from round 9 + 16 from round 10)

Investigation methodology (TDD throughout)

Every round followed the same workflow:

  1. Audit: identify a Vulkan-specific gap (capability probe, multi-GPU support, drift recovery, per-step sync hotspot, observability gap, etc.).
  2. Test first: write the CPU-only unit gate that pins the new contract (resolver behaviour matrix, API surface, parity bound, layout contract). Commit + observe failure on the missing symbol (compile error or assertion).
  3. Implement: minimal-surgery production change. Pure-logic helpers split out so the policy is testable on CPU without a Vulkan device.
  4. Re-run: every new test + every existing test must pass before commit.
  5. Update PROGRESS_SUPERTONIC.md + commit.

The CPU-only test strategy is deliberate: a fresh checkout's ctest exercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer.

Commit-by-commit walkthrough

33fd5c34 — Round 1: Vulkan bring-up

Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used model.use_f16_attn = !backend_is_cpu because the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan the HSK % 8 == 0 supports_op gate has to be respected, so the auto-policy needs a probe.

  • Two new supertonic_model flags populated at GGUF load: backend_is_vk (informational; appended to the backend-description string) and use_native_leaky_relu (resolved via ggml_backend_supports_op(LEAKY_RELU) against a synthetic node).
  • New backend-capability probe supertonic_backend_supports_f16_kv_flash_attn gates the use_f16_attn auto-policy.
  • EngineOptions::vulkan_device int + --vulkan-device N CLI flag plumbed through all three binaries. Range-checked at load (out-of-range = hard error).
  • Verbose mode + bench output append ggml_backend_vk_get_device_description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran.
  • New CPU-only TDD harness test-supertonic-vulkan-dispatch (29 checks).

d080a1e4 — Pre-existing missing-include fix

tts-cpp/src/chatterbox_tts.cpp used std::atomic<int> without #include <atomic>. One-line fix kept as a separate commit so it's trivially revertable.

e09d4278 — Round 2: capability-cache + 3 probes + prewarm

  • Process-wide cached_backend_capabilities map keyed by ggml_backend_t, guarded by a single std::mutex. Eliminates 3× redundant probe calls per backend.
  • 3 new probes: supertonic_backend_supports_f16_mul_mat (gates use_f16_weights auto-policy), supertonic_backend_supports_q8_0_kv_flash_attn (forward-compat), supertonic_backend_supports_native_leaky_relu (wraps round 1).
  • Engine::warm_up(text) API + EngineOptions::prewarm_text + --prewarm TEXT CLI. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines compile up-front; operator-visible first synthesize() hits steady-state latency. No-op on CPU.
  • New tests: test-supertonic-capability-cache, test-supertonic-warm-up-api.

8ae15996 — Round 3: multi-device auto-pick + 2 forward-compat probes

  • --vulkan-device -1 auto-pick policy: resolve_vulkan_device_index pure-logic helper picks argmax(free_vram) via ggml_backend_vk_get_device_memory(). Tie-break = lower index.
  • 2 new forward-compat probes: supertonic_backend_supports_bf16_kv_flash_attn (for coopmat2 on Ampere+ / RDNA3+), supertonic_backend_supports_pinned_host_buffer (for future per-engine input-scratchpad refactor).
  • New test test-supertonic-vulkan-device-select (23 checks).

⚠️ Known issue (pre-existing on this round's policy): on heterogeneous discrete+iGPU machines, UMA iGPUs report system RAM as "free VRAM" and win the argmax even when a discrete GPU is available. On the test machine, --vulkan-device -1 picks the AMD iGPU (178 ms) over the RTX 5090 (44 ms) — a 4× regression for users who follow the help text. Trivially worked around by explicit --vulkan-device 0. Tracked for a follow-up: bias against UMA when a discrete is present.

32703fcd — Round 6: F16-weights operator deny-list

  • 2-arg should_materialise_f16_weight(source_name, deny_list) overload layered on top of the curated allow-list. Each entry is a substring; any match keeps that tensor at its native storage type.
  • EngineOptions::f16_weights_deny_list + --f16-weights-deny PAT1,PAT2,... CLI flag (comma-split parser shared between all three binaries).
  • Tests: test-supertonic-f16-weights extended (+29 checks), test-supertonic-f16-deny-list-api (NEW, 9 checks).

2e1c9468 — Round 4: multi-dtype K/V flash-attention dispatch

Generalises the round-1 F16-only K/V path into a multi-dtype dispatch.

  • kv_attn_dtype enum (autoselect, f32, f16, bf16, q8_0) + EngineOptions::kv_attn_type field.
  • resolve_kv_attn_type pure-logic helper with full {requested × legacy × probe-mask} behaviour matrix.
  • --kv-attn-type CLI flag on all three binaries with parse hardening.
  • Tests: test-supertonic-kv-attn-type (106 checks), test-supertonic-kv-attn-type-api (18 checks), test-supertonic-f16-attn-parity extended for BF16.

ba6d1749 — Round 7: bench observability + voice cache + Vulkan env-var passthrough

Three independent observability/UX wins shipped together:

  • --bench-per-step + --bench-sync + --prewarm (already from round 2) + --json-out FILE: per-denoise-step timings on a single timeline (cold pipeline step[0] distinguishable from steady-state step[1..4]); operator can attribute Vulkan stalls to a specific stage on real hardware without GPU-side profilers.
  • Voice cache: precomputed style buffers reused across synths.
  • Vulkan env-var CLI passthrough: --vulkan-prefer-host-memory, --vulkan-disable-coopmat2, --vulkan-disable-bfloat16, --vulkan-perf-logger, --vulkan-async-transfer, --vulkan-env KEY=VALUE — sets the corresponding GGML_VK_* env var before backend init. Operator-set shell env STILL wins over the CLI override (audit-friendly).
  • New test test-supertonic-vulkan-env-overrides (29 checks).

e8bbc728 — Round 8: front-block attn0 GPU bridge

The single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1/g2/g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.

Strict gating on front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0 — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors.

Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth.

df895fd6 — Round 9: style flash-attn GPU bridge

Same pattern as round 8, applied to the 4 style attention sites (front-block style0 + style attentions in g1/g2/g3 caches). Gated Q/K/V host downloads on trace mode in run_res_style_qkv_cache (production path skips them entirely).

Eliminates 3 sync points × 4 sites × 5 denoise steps = 60 GPU→host downloads / synth.

358d7aa8 — Round 10: per-step text-input upload-skip

Generalised the F4 pointer-compare upload-skip pattern (style_v_in / kctx_in in vector_res_style_qkv_cache) into a reusable upload_skip_tracker helper.

Applied to text_in_t on front-block cache + text_in on 3 group caches. Caught and documented a cross-synth pointer-reuse hazard: stack-local text_emb vectors very often re-issue the same address (allocator size-class reuse); the tracker.reset() at synth boundaries prevents the naive pointer-compare from leaking prior-synth GPU data into next-synth attention.

New test test-supertonic-upload-skip-tracker (7 functions, 41 checks) explicitly simulates the cross-synth hazard.

Eliminates 16 redundant uploads / synth (~512 KB at text_len=32, linear in prompt length).

c383e70d — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS

After the IDE-freeze recovery, the first end-to-end synth attempt on real hardware crashed at:

supertonic_internal.h:1154: GGML_ASSERT(HD == n_heads * head_dim) failed

on every backend (CPU + Vulkan RTX 5090 + RADV + lavapipe).

Root cause: apply_rope_to_packed_qk (introduced in PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] channel-fastest-in-memory tensor. In fact, the matmul (both the CPU cblas_sgemm fast path and the GPU conv1d_f32(K=1) fallback) produces ne=[L, HD] with channel-major-flat memory (data[t + c*L]) — the bit-exact transpose of the helper's input contract.

The CPU unit test that landed alongside the helper (test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong [HD, L] shape, so the failure mode was invisible to CI — and rounds 8/9/10 were ALSO broken (the GPU bridge ggml_backend_tensor_copy(q_src, q_tc_in) would have aborted at ggml_are_same_layout because V (and the style sq/sk/sv which have no RoPE to mask the layout flip) flowed into the GPU bridge from matmul → channel-major-flat bytes → mismatched layout against q_tc_in time-major-flat).

The fix (strict TDD):

  1. Test rewritten under production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then GREEN (14 / 14 checks).
  2. apply_rope_to_packed_qk head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (which IS the layout q_tc_in expects).
  3. V (and style sq/sk/sv) graph-side transpose: V has no RoPE to hide behind — open-coded the same ggml_cont(ggml_transpose(...)) at the matmul output in build_group_graph_cache, ve_front_block_proj_cache, and build_res_style_qkv_cache × all three sq/sk/sv outputs so all four GPU-bridge attention sites get bit-for-bit matching layouts.
  4. Legacy host-bridge downloads switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references consume, so the raw download is the correct call.
Backend Pre-fix Post-fix
CPU abort on first denoise step writes 3.89s 44.1 kHz WAV
Vulkan RTX 5090 abort writes 6.53s WAV; 44 ms / 5 steps; 74× realtime
Vulkan AMD RADV iGPU abort writes 3.64s WAV; 178 ms; 7× realtime
Vulkan Mesa lavapipe abort writes 1.21s WAV

The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.

Test plan

CPU-only — a fresh checkout's ctest -L unit exercises every new contract without needing a Vulkan adapter.

cmake -S tts-cpp -B build-tts
cmake --build build-tts --parallel
ctest --test-dir build-tts -L unit --output-on-failure

Expected: 22 / 22 tests, 0 failures, 0 regressions.

Test Purpose Round Checks
test-supertonic-vulkan-dispatch Backend-flag dispatch + F16-K/V probe smoke 1 29
test-supertonic-portable-ops (UPDATED) LEAKY_RELU decomposition path stays exercised 1
test-supertonic-capability-cache Probe-counter regression + new-probe coverage 2 + 3
test-supertonic-warm-up-api SFINAE gate for Engine::warm_up 2
test-supertonic-vulkan-device-select resolve_vulkan_device_index behaviour matrix 3 23
test-supertonic-f16-weights (UPDATED) Deny-list overload 6 65
test-supertonic-f16-deny-list-api SFINAE gate for the deny-list field 6 9
test-supertonic-kv-attn-type resolve_kv_attn_type behaviour matrix 4 106
test-supertonic-kv-attn-type-api SFINAE gate for the enum + EngineOptions field 4 18
test-supertonic-f16-attn-parity (UPDATED) F16 + BF16 K/V parity vs F32 reference 4 8
test-supertonic-vulkan-env-overrides Env-var CLI passthrough; operator-set env wins 7 29
test-supertonic-upload-skip-tracker (NEW) Pointer-compare upload-skip + cross-synth pointer-reuse hazard 10 41
test-supertonic-rope-packed-qk (REWRITTEN) Production matmul shape contract + output layout pin 11 14
Every other unit test Zero-regression gate unchanged

Smoke testing the CLIs

./build-tts/supertonic-cli --help 2>&1 | grep -A 6 kv-attn-type
./build-tts/supertonic-bench --help 2>&1 | grep -A 5 bench-per-step

# Real-Vulkan validation on RTX 5090 (74× realtime)
./build-tts/supertonic-cli --model models/supertonic2.gguf --text "Hello world" \
  --out /tmp/out.wav --voice M1 --n-gpu-layers 99 --vulkan-device 0 --prewarm "warm up"

./build-tts/supertonic-bench --model models/supertonic2.gguf --text "Hello world" \
  --voice M1 --n-gpu-layers 99 --vulkan-device 0 --runs 5 --warmup 1 \
  --prewarm "warm" --bench-per-step --json-out /tmp/bench.json

Bench JSON includes "kv_attn_type" (resolved), "kv_attn_type_requested" (raw int), and per-step timings so probe misses and per-step variance are attributable in CI/operator triage.

Backwards compatibility

  • --vulkan-device 0 semantics unchanged — round 1 introduced the flag; round 3's -1 is opt-in only.
  • --f16-weights 0|1 semantics unchanged — round 6's --f16-weights-deny is opt-in only.
  • --prewarm defaults to empty (no-op).
  • --kv-attn-type defaults to auto which falls back to round-1's use_f16_attn boolean — every existing config keeps the round-1 behaviour.
  • model.use_f16_attn boolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.
  • All round-1 / round-3 probes throw on out-of-range CLI input (loud failure for actual config errors); all probe-gated dispatches fall back to F32 silently (advisory-probe contract — visible in bench output).
  • Round 11 fix: the new apply_rope_to_packed_qk contract is backwards-incompatible with the old (broken) one, but the old contract never actually worked in production — pre-fix it crashed on every backend. The 14-check test now pins both the input and output contracts so a future regression fails at compile time on the shape check.

File-by-file change summary

38 files changed, 13713 insertions(+), 692 deletions(-)
File Δ Notes
tts-cpp/PROGRESS_SUPERTONIC.md +1219 11 round writeups + cross-references
tts-cpp/CMakeLists.txt +252 New test targets + Vulkan-build wiring
tts-cpp/include/tts-cpp/supertonic/engine.h +155 New EngineOptions fields + Engine::warm_up()
tts-cpp/src/supertonic_internal.h +1254 kv_attn_dtype enum, 5 new probes, resolvers, upload_skip_tracker helper, apply_rope_to_packed_qk (round-11 fix)
tts-cpp/src/supertonic_gguf.cpp +1509 Capability cache, multi-device auto-pick, dispatch-scope plumbing, deny-list, env-var passthrough
tts-cpp/src/supertonic_vector_estimator.cpp +1781 Round-4 enum dispatch, round-8/9 GPU bridges, round-10 upload-skip, round-11 V/QKV transposes + helper rewrites
tts-cpp/src/supertonic_engine.cpp +147 Probe-gated auto-policy, multi-device auto-pick, warm_up impl
tts-cpp/src/supertonic_bench.cpp +406 All round flags + bench surface (per-step, sync, JSON, env passthrough)
tts-cpp/src/supertonic_cli.cpp +80 Round flags + try/catch arg-parse hardening
tts-cpp/src/chatterbox_cli.cpp +139 Round flags mirrored on the tts-cli alias
tts-cpp/src/chatterbox_tts.cpp +1 #include <atomic> (pre-existing missing-include fix)
13 new test files +3640 Rounds 1, 2, 3, 4, 6, 7, 10, 11 + audit-follow-up parity harnesses
3 updated test files +900 Round 1, 4, 6, 11 extensions

Deferred follow-ups (intentionally out of scope; pre-existing on master)

Tracked in tts-cpp/PROGRESS_SUPERTONIC.md "Deferred work" section.

  1. Auto-pick on hybrid discrete+iGPU machines — round 3's argmax(free_vram) policy picks the iGPU on machines like the one we tested (RTX 5090 + AMD RADV) because UMA reports system RAM as free VRAM. Pre-existing in this PR; fix candidate: bias against UMA when a discrete is present. Workaround: explicit --vulkan-device 0.
  2. test-supertonic-audit3-caches F18 + F19 cache-reuse failures — these pre-existed on master (verified pairwise). Pre-round-11 they were hidden by the rope crash; post-round-11 they're newly observable but neither introduced nor fixable by this PR's content (text encoder for F18; cross-cache state-leak for F19). Both should be wired into CI as a separate ticket; F18/F19 affect the OpenCL build identically.
  3. Persistent VkPipelineCache (chatterbox PROGRESS.md §3.32): recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by <vendorID>-<deviceID>-<driverVersion>. This is a ggml-vulkan internal patch (~199 lines) that benefits all Vulkan workloads. Round 7's --prewarm is an in-process workaround.
  4. Pinned-host-buffer per-step uploads: round 3 added the capability probe so the cache + bench surface know whether the path is available. The actual per-engine input-scratchpad refactor is deferred until measured on a real Vulkan adapter so we can quantify the reduction in latent upload latency.

Linked

nik and others added 30 commits November 10, 2025 13:02
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
…ners

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure
- Default seed value is 0 (maintains backward compatibility)
- Each decoder uses seed + decoder_index for unique seeds
- Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models
to use a non-standard first convolution kernel size. Standard
whisper models default to kernel size 3.

Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams
- When present, encoder self-attention is restricted to a local window
  for layers up to last_window_layer
- Bypass flash attention when windowed mask is active (Metal FA does
  not support custom F32 masks); flash attention remains enabled for
  non-BCI models and for the decoder
- Populate window_mask data on the encoder graph (not the cross graph)
- Add proper SOS token (language + transcribe) initialization for BCI
  models

Backward-compatible: n_audio_window_size defaults to 0 and
n_audio_last_window_layer defaults to -1, disabling windowed
attention entirely for standard whisper models.

Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback:

1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel,
   n_audio_window_size, n_audio_last_window_layer) behind a
   n_mels > 256 check. Standard whisper models have n_mels <= 128
   and do not contain these fields — reading them unconditionally
   would corrupt the file position and break model loading.

2. Add explicit is_bci flag to hparams struct, set when BCI fields
   are detected during loading.

3. Use is_bci flag (instead of n_audio_window_size > 0) to guard
   the BCI-specific decoder SOS token initialization.

4. Log BCI-specific hparams when a BCI model is detected.

Made-with: Cursor
The windowed attention mask values depend only on n_ctx and
window_size, both fixed after model load. Move the O(n_ctx^2)
computation from whisper_encode_internal (called every encode)
to whisper_init_state (called once). The encode path now just
copies the precomputed data to the graph tensor.

Made-with: Cursor
…, Threads

1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed
   mask uses hparams.n_audio_ctx, but the graph tensor is sized from
   exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back
   to recomputing the mask at the effective n_ctx when sizes differ,
   preventing a buffer overflow into the smaller tensor.

2. Update whisper.pc.in: the install interface was changed to
   include/whisper but the pkg-config includedir still pointed to
   include/. Consumers using pkg-config would not find whisper.h.

3. Fix whisper-config.cmake.in: the whisper target publicly links
   Threads::Threads but find_dependency(Threads) was skipped on
   Windows, leaving downstream find_package(whisper) with an
   unresolved imported target. Now always resolve Threads.
…ash attention

1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the
   default n_audio_ctx, the window mask is now recomputed once and
   cached in wstate (keyed on window_mask_n_ctx) instead of allocating
   a new std::vector on every whisper_encode_internal call.

2. Per-layer flash attention: layers above last_window_layer no longer
   need the windowed attention mask. The flash attention path is now
   used for those layers even when BCI windowed attention is active,
   instead of globally falling back to the softmax path for the entire
   encoder.

3. Use std::abs instead of C abs in both init-time and encode-time
   mask computation paths.
…alidation

1. Extract compute_window_mask() helper on whisper_state to eliminate
   the duplicated O(n_ctx^2) mask fill loop that appeared in both
   whisper_init_state and whisper_encode_internal. Both call sites
   now use the single helper, preventing future drift.

2. Guard the encode-time mask block with hparams.is_bci before doing
   the ggml_graph_get_tensor lookup. Cheaper and more explicit than
   relying on the tensor name string to determine whether BCI
   windowed attention is active.

3. Add hparams.is_bci to the graph builder guard for window_mask
   tensor creation, aligning it with the other BCI code paths.

4. Add validation for BCI hparams after reading from file:
   n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0.
   Log an error and return false on invalid values instead of
   proceeding with garbage.

5. Add comment explaining the n_mels > 256 threshold used to
   discriminate BCI models from standard whisper models, and noting
   that a dedicated file-format marker should be introduced if this
   assumption ever breaks.

Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
…review)

Address @gianni-cor review on PR tetherto#11: switch the bundled ggml filename
prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech
stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a
single ggml file set instead of each library shipping its own copy.

  - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`,
    GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`,
    option blurb + status message updated.
  - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh,
    patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example
    updated to reference the new `speech-` prefix.

Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure
prints `bundled ggml libraries will be emitted as libspeech-ggml-*`;
build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib;
parakeet binary's otool -L now references `libspeech-ggml*` exclusively.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml
…herto#6)

The standalone setup-ggml.sh + patches/ tooling was dropped from
qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the
CMakeLists.txt still:
  * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and
  * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-"
    on the bundled ggml target.
That combination quietly broke standalone bundled-ggml builds: the
filename-prefix patch was no longer applied, so libspeech-ggml-*.so
files existed on disk but ggml's runtime loader still searched for
libggml-*.so under GGML_BACKEND_DL=ON.  Vulkan / OpenCL / CUDA
backends silently failed to load on Android.

Fix per reviewer guidance: converge the speech stack on a single ggml
source-of-truth.  Standalone-bundled-ggml is no longer a supported
build mode out of this in-tree subtree; the canonical path is
`-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack
`ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships
the patches pre-applied.

Edits:

- TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this
  tree.  Docstring spells out the rationale + points users at the
  standalone github.com/gianni-cor/chatterbox.cpp repo if they need
  a bundled-ggml dev build with patches/ present.

- The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to
  configure when patches/ is absent: a FATAL_ERROR points at the
  right consumption path (vcpkg ggml-speech) and the standalone
  fallback.  Doesn't break in-tree-with-patches builds (parakeet-cpp
  in this same repo still ships patches/, so its bundled path is
  unaffected by this guard inside tts-cpp).

- Verified locally: `cmake -S tts-cpp -B build` (no flags) errors
  out at find_package(ggml CONFIG REQUIRED) with our new message
  pointing at the ggml-speech port; `cmake -S tts-cpp -B build
  -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard
  with the no-patches message.

- tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/
  that no longer exist; running it would have errored out anyway.
  The standalone repo keeps its own setup-ggml.sh; only the in-tree
  subtree drops it.

The standalone chatterbox.cpp repo (the one tts-cpp/ was copied
from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/
folder + scripts/setup-ggml.sh.  This commit is therefore an
integration-time delta against that source, not a change to the
standalone build flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
Zbig9000 and others added 5 commits May 12, 2026 10:53
… / vector graph caches

QVAC-18607 follow-up tetherto#3.  Three more audit findings landed on top of
follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync
points + ~6 allocator churn cycles per synth.

  F17  Duration scalar-continuation `read_f32` cache.
       Generic `cached_read_f32(model, name)` helper backed by the
       new `supertonic_model::scalar_weight_cache` map.  Replaces
       ~30 backend tensor reads per synth across
       `self_attention`, `ffn_block`, and the
       `duration_sentence_proj_ggml_impl` scalar continuation
       (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out,
       predictor layers + activation).  Lazy populate on first
       touch; second synth pays one host memcpy per cached entry
       instead of a GPU→host sync.

  F18  Text-encoder convnext-front graph cached across synths.
       `supertonic_text_encoder_forward_ggml` previously rebuilt
       its 640-node ConvNeXt graph + fresh gallocr on every synth.
       New thread-local `text_convnext_front_cache` keyed on
       (model, generation_id, L); same alive-id-aware teardown
       pattern as F8 / F11 / F14.

  F19  Vector-estimator front-block graph cached across denoise
       steps.  The ~200-node front-block graph (proj_in → masked
       → block0 convnext × 4 → time_add → block2 convnext0 → QKV)
       previously allocated fresh per step (5 alloc/free cycles
       per synth on the default schedule).  Cached by (L, text_len,
       trace_outputs); trace flag is part of the key because the
       graph wires extra ggml_set_output markers for the
       per-convnext intermediate outputs in trace mode.

New TDD harness (fixture-bound):

  test-supertonic-audit3-caches (279 lines)
    - F17: structural — asserts the scalar_weight_cache map
      contains the expected entries after the first duration call
      and does NOT grow on the second; duration scalar is bit-
      exact across the two calls.
    - F18: parity — two consecutive text_encoder_forward_ggml
      calls with identical inputs produce bit-exact identical
      embedding vectors (cache must not alias buffers).
    - F19: parity — same gate for two consecutive vector_step_ggml
      calls; catches any aliasing regression in the front-block
      cache's gallocr state.

Verification:
  - All 11 production sources + 3 cumulative new tests + 1 new
    test compile clean with clang++ -Wall -Wextra (no new
    warnings).
  - Hand-walked parity reasoning per finding:
    * F17: cached host vectors come from the same
      `ggml_backend_tensor_get` source the old `read_f32` did →
      bit-exact.
    * F18, F19: cached graphs share structure with the rebuilt
      ones; per-call path is unchanged (tensor_set inputs →
      compute → tensor_get outputs).  Bit-exact across calls.
  - Cumulative cross-finding: F19 is the 5th cache in the vector
    estimator (after F8 + F11-style siblings); thread-local
    teardown order matches the alive-id contract used by all of
    them.

Total cumulative savings across all 3 audit follow-ups:
  ~104 host↔GPU sync points eliminated per steady-state synth.

Diff:
  6 sources changed, 1 new test, 1 CMakeLists update.
  +327 / -172 in src/ + CMakeLists + internal header.
  +279 new test.

What's next (tomorrow):
  - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync
    points / synth).  Needs device parity gate.
  - Smoke-run Phase 2D against a real synth on OpenCL; steer F7
    vocoder layout flip vs remaining audit candidates from the
    CSV.

Co-authored-by: Cursor <cursoragent@cursor.com>
… helper (F20 partial)

Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side
`make_rope_cos_sin_tables(theta, L, half)` precompute helper in
supertonic_internal.h. Both use only universally-supported GGML ops
(reshape / view / permute / mul / add) so the rotation can later run
on the OpenCL / Metal / Vulkan backends without per-element scalar
CPU work or extra get/set sync points.

Integration into the 8 attention sites is deferred to keep this
change small and reviewable — the existing scalar `apply_rope` path
is unchanged.

Test: new test/test_supertonic_rope_in_graph.cpp verifies
  - parity vs scalar apply_rope on a synthetic Q tensor
  - identity behaviour when cos=1 / sin=0
Wired into CMakeLists.txt with the "unit" label.

Co-authored-by: Cursor <cursoragent@cursor.com>
… integration (F20+F23)

Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.

Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.

Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries.  cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).

Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.

Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire.  Bit-exact (max_abs_err=0.0).  Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).

Full sweep verification:
  - 9 / 9 supertonic source files: clean syntax-check
  - 21 / 21 test files: clean syntax-check
  - 98 / 98 CPU-only unit-test checks pass across
    test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
    backend-dispatch, f16-attn-parity, profile-csv}.

Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
…on, in-graph transpose, Q/K/V GPU bridge

Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).

F7 — Vocoder ConvNeXt block fusion:
  * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
    [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
    ggml_mul_mat against that layout, eliminating the layer-norm back-permute
    and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
    across the 10 blocks).
  * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
    max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.

F12 — In-graph time/channel transpose:
  * transpose_time_channel_ggml (supertonic_internal.h) replaces the
    pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
    in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
    / tail).  Cache inputs now declare ne=[C, L]; callers upload CPU-native
    x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
  * Also drops a redundant double-transpose on the tail-graph noisy_latent path.
  * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
    = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.

F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
  * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
    handles harvested from the group cache's graph.
  * run_text_attention_cache_gpu — new overload that consumes those handles
    via ggml_backend_tensor_copy (same-backend device→device blit) instead of
    the historical tensor_get + tensor_set pair.
  * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
    gated on (trace != nullptr || !apply_rope); production runs with in-graph
    RoPE skip them entirely.
  * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
    GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
    vector_rope_theta).  Net: 90 sync points / synth eliminated.  Front-block
    and the four style attention sites still pay the round-trip; targeting
    them is the next iteration.
  * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
    five representative attn/style shapes plus L=1.

Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in
the committed source tree alongside production code.  Move it out of
tts-cpp/ so the subtree only ships the implementation; the file continues
to live locally under aiDocs/ for ongoing iteration.

No code or build changes; documentation-only.

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 and others added 12 commits May 12, 2026 18:45
…and-optimize-OpenCL-for-supertonic

Qvac 18607 tts ggml add and optimize open cl for supertonic
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups
(PR tetherto#16): the audit-driven optimisations there are backend-portable by
construction (every host-sync / bandwidth / fusion win uses the same
GPU dispatch path Vulkan walks), so this PR only adds the
Vulkan-specific dispatch deltas the OpenCL bring-up did not need.

Vulkan-specific deltas
- supertonic_model gains backend_is_vk + use_native_leaky_relu, both
  resolved at GGUF load time:
  - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine
    backend_name() can annotate the device with
    ggml_backend_vk_get_device_description().
  - use_native_leaky_relu via a ggml_backend_supports_op probe against
    a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml
    to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched
    OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for
    plain upstream OpenCL.  Dynamic probe self-adapts to whichever
    ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml
    ships in.
- supertonic_backend_supports_f16_kv_flash_attn probe (synthetic
  Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the
  use_f16_attn auto-policy so a backend that ships flash_attn_ext but
  rejects the F16-K/V variant for Supertonic shapes keeps the F32 path
  instead of crashing at first synth call.  Manual --f16-attn 1 still
  forces F16 (debug knob).
- Vulkan device selection: replaces the historical hard-coded
  ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed
  through EngineOptions::vulkan_device, range-checked against
  ggml_backend_vk_get_device_count() at load (out-of-range index is a
  hard error — surfaces operator typos / wrong-machine config loud
  rather than silently falling back to CPU).  Verbose mode + bench
  output append the Vulkan device description so multi-GPU / multi-ICD
  machines unambiguously identify which adapter ran.
- supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu
  slot so the scope correctly mirrors the new model field through
  thread-local dispatch.

Tests
- test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness
  covering the new flags through supertonic_op_dispatch_scope plus a
  smoke test for the F16-K/V flash-attn probe.  29/29 checks pass.
- test-supertonic-portable-ops (existing): fixture model now requests
  use_native_leaky_relu = false explicitly so the GPU-decomposition
  correctness gate stays green now that the helper short-circuits on
  backends with native LEAKY_RELU.  10/10 checks pass.
- test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass.
- All audit follow-up tests from tetherto#16 unchanged, all PASS.

Build
- All changed source files compile clean with both -DGGML_USE_VULKAN
  defined and undefined; non-Vulkan builds compile clean.
- No public-API break: EngineOptions::vulkan_device defaults to 0
  (the historical hard-coded value), load_supertonic_gguf gains a new
  optional last argument with the same default; existing callers are
  source-compatible.

Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"):
persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all
Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device
load-balancing (--vulkan-device -1 auto-pick).

Co-authored-by: Cursor <cursoragent@cursor.com>
`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but
`<atomic>` was never included; the file relied on a transitive
include chain that broke once any consumer rearranged includes.
Surfaces as `error: variable 'std::atomic<int> ... has initializer
but incomplete type'` on a clean build.

Pre-existing bug, unrelated to QVAC-18605 itself but blocked
local CTest runs against the Vulkan-optimisation work.  Trivial
additive include with no behaviour change.

Co-authored-by: Cursor <cursoragent@cursor.com>
…s + prewarm

Layered on top of the QVAC-18605 Vulkan bring-up commit; the
round-2 changes generalise the bring-up's "load-time backend
probe" pattern into a process-wide capability cache and add
three more probes / dispatch hooks that fit the same shape.
Net effect on Vulkan: redundant supports_op traffic eliminated,
defensive auto-policy gating extended to F16 weights, forward-
compat Q8_0 K/V probe primed for a follow-up dispatch flip,
and an opt-in --prewarm hook that lets operators amortise the
~hundreds-of-ms cold-start shader-compile cost outside the
operator-visible first synth call.

1) Process-wide capability-probe cache keyed by ggml_backend_t

   The bring-up's three load sites (load_supertonic_gguf,
   Engine::Engine, supertonic_bench's main) each ran the
   LEAKY_RELU + F16-K/V flash-attn supports_op queries
   independently — 2-3x redundant probe traffic per backend.
   On Vulkan, supports_op may inspect the device's pipeline
   state (~50-200 us per query on Adreno / llvmpipe / RADV in
   microbenchmarks); the cache short-circuits 100 % of the
   duplicates.  Test seam (supertonic_clear_capability_cache +
   supertonic_capability_probe_call_count) lets the unit test
   verify the cache is hit on the second call by comparing the
   counter before / after.  Per-backend independence verified
   against two distinct CPU backend handles.

2) F16 mul_mat backend-capability probe

   Symmetric to the F16-K/V flash-attn probe.  The bring-up
   auto-enabled use_f16_weights on `!backend_is_cpu` blindly;
   a partial-port backend that ships F16 storage but rejects
   the hot vector-estimator W_query mul_mat shape would crash
   at first synth call.  Probe builds the live shape ([256,256]
   F16 weight x [256,16] F32 activation) and asks the backend;
   auto-policy refuses materialisation on a `false` answer
   (slower F32 path stays correct).  Manual --f16-weights 1
   still forces materialisation (debug-shim escape hatch).
   Probe cached; test verifies CPU returns true.

3) Q8_0 K/V flash-attn forward-compat probe

   Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0
   (and Q4_0) K/V types in scalar + coopmat2 paths.  Switching
   K/V from F16 to Q8_0 would halve the per-step upload
   bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape;
   ~1 MB / synth on the default 5-step x 4-site schedule) in
   exchange for a small (~0.5 %) drift on the attention output.
   This commit adds the probe + caches the result; live
   dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift
   measurement against the parity harness on a real Vulkan
   adapter.  Bench output annotates `(q8_0_kv_attn=available)`
   when the probe says yes so operators can confirm their
   hardware is ready for the follow-up.

4) Engine::warm_up(text) + EngineOptions::prewarm_text +
   --prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench)

   First-synth-latency reduction on Vulkan / OpenCL.  In-tree
   thread_local graph caches handle every subsequent call but
   can't avoid the first pipeline-compile cost (~hundreds of
   ms on Adreno / RADV per chatterbox PROGRESS.md).  warm_up
   runs one throwaway synth at construction time on a caller-
   supplied sample text so the operator-visible first synth
   sees steady-state latency.  Auto-no-op on CPU (no shader-
   compile cost).  Bench's --prewarm runs the cold-start synth
   BEFORE the timed loop (independent of --warmup N which only
   discards N timed runs from the median); cold-start latency
   logged as `[prewarm] cold-start synth on '...' took N.Nms`
   and emitted to --json-out as "prewarm_ms".

5) Bench output extended

   Backend log line surfaces every dispatch flag plus the
   cold-start prewarm latency:
     Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on)
       (native_leaky_relu=on) (q8_0_kv_attn=available)
   --json-out gains "f16_attn", "f16_weights",
   "native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms"
   keys for downstream analysis tooling.

Tests
- test-supertonic-capability-cache (NEW, LABEL "unit"): probe
  cache short-circuit + clear seam + per-backend independence
  + idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke.
  18 / 18 checks pass.
- test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface
  contract for EngineOptions::prewarm_text + Engine::warm_up
  via SFINAE.  9 / 9 checks pass.
- All existing CPU-only unit tests (test-supertonic-vulkan-
  dispatch, -portable-ops, -backend-dispatch, -rope-in-graph,
  -rope-packed-qk, -in-graph-transpose, -convnext-block-fused,
  -graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus
  resample / cpu-caches / t3-caches): all 13 pass unchanged.
- ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ /
  184+ individual checks).

Build
- All changed source files compile clean with both
  -DGGML_USE_VULKAN defined and undefined.
- No public-API break: EngineOptions::prewarm_text is a new
  optional field defaulting to empty (no-op), Engine::warm_up
  is a new method (existing callers don't have to invoke it).

Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"):
persistent VkPipelineCache (cross-process), BF16 K/V flash-attn,
Q8_0 K/V live dispatch wiring, multi-device load-balancing.

Co-authored-by: Cursor <cursoragent@cursor.com>
…vice auto-pick + 2 forward-compat probes

Three more Vulkan-specific deltas, all developed test-first.  New
tests were committed first, observed to fail on the missing
symbol, and only then was the implementation written and the
tests re-run to verify green.

1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities
   flag).  Symmetric to the round-2 Q8_0 K/V probe.  Vulkan's
   FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2-
   only path; BF16 has the same 2-byte per-element footprint as
   F16 (so identical upload bandwidth) but the wider 8-bit
   exponent range avoids the F16 underflow on small attention
   scores.  Forward-compat — the live --kv-attn-type bf16 dispatch
   wiring is deferred to a follow-up that measures drift against
   the parity harness on a real Vulkan adapter.

2. Multi-device auto-pick for --vulkan-device -1.  Wires the
   previously-reserved auto-pick API: walks every visible adapter,
   queries ggml_backend_vk_get_device_memory() to read free VRAM,
   and dispatches into a pure-logic helper
   resolve_vulkan_device_index(requested, free_vram_per_device)
   that picks argmax(free_vram); ties → lower index for stable
   per-run assignment on identical-spec multi-GPU machines.  The
   pure-logic helper is testable on CPU with synthetic inputs (8
   test functions, 23 checks).  Reserved-future negative values
   (-2, -100, ...) now throw instead of silently falling through
   to device 0.  Verbose mode logs the per-device VRAM table so
   operators can confirm the auto-pick chose the expected adapter.

3. Pinned-host-buffer-type capability probe (6th cache flag) +
   bench surface.  Probes whether ggml_backend_vk_host_buffer_type()
   is callable on the resolved backend (Vulkan + non-null buffer-
   type).  Forward-compat — primes the capability cache for a
   follow-up per-engine input-scratchpad refactor that skips
   ggml-vulkan's internal staging-buffer hop on per-step uploads.
   Bench output now shows bf16_kv_attn_available +
   pinned_host_buffer_available in both the human-readable backend
   tag and the JSON output so operators can pre-flight whether a
   future opt-in will be effective on their machine.

Test plan (TDD round 3):
- test-supertonic-capability-cache: 27 / 27 checks pass (was 18,
  +9 checks for round-3: BF16 K/V smoke + cache-slot share,
  pinned-host-buffer smoke + cache-slot share, null-backend
  defensive checks for both new probes).
- test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass
  (8 test functions: empty-list, single-device, argmax-VRAM, tie-
  break, explicit-index passthrough, out-of-range, reserved-
  negative, zero-VRAM handling).
- Whole CPU-only ctest -L unit reports 16 / 16 tests passing,
  zero regressions on round-1 / round-2 / audit-follow-up tests.

CLI surface:
- supertonic CLI + chatterbox CLI usage strings updated to
  document --vulkan-device -1 = auto-pick adapter with most free
  VRAM.
- supertonic-bench usage string updated likewise.

Co-authored-by: Cursor <cursoragent@cursor.com>
…hts operator deny-list

Round 6 layers a user-overridable extra deny-list on top of the
existing hand-curated should_materialise_f16_weight() allow-list.
The curated allow-list (Phase 2A) already excludes biases, norms,
embeddings, depthwise convs, and pre-transposed companions; the
round-6 deny-list lets operators force-keep specific additional
tensors as F32 even when --f16-weights is on.  Use cases:

- A/B testing: researcher excludes a specific tensor pattern
  temporarily without recompiling.
- Hardware-specific drift mitigation: operator pins a problematic
  tensor to F32 via config rather than disabling F16 weights
  wholesale.
- Future-GGUF safety net: new tensor patterns added in future
  GGUFs that the curated allow-list inadvertently scoops in can
  be excluded via config without a code change.

Smallest blast radius of the four follow-up rounds — load-time
policy only, runtime dispatch unaffected, zero behaviour change
on the empty-deny-list default path.

Strict TDD discipline (per the user's "double check, don't break
anything" constraint):
- Both new tests committed FIRST.
- Both confirmed to fail to compile on the missing symbols
  (predicate test: 'too many arguments to should_materialise_f16_weight';
  API test: 'EngineOptions has no member f16_weights_deny_list').
- Implementation written.
- Both tests + every existing unit test re-run; all green.

What changed:

1. 2-arg overload should_materialise_f16_weight(name,
   extra_deny_substrings) added alongside the existing 1-arg
   version (existing test + call sites unchanged).  Substring
   matching matches the curated predicate's audit-friendly style;
   no regex compile cost or invalid-pattern surface.  The deny-
   list can only flip true → false, never false → true.  Empty
   strings inside the deny-list are SKIPPED defensively, not
   treated as universal matches (config-typo guard).

2. EngineOptions::f16_weights_deny_list (vector<string>, default
   empty) — public API surface.  Wired through Engine::Impl →
   load_supertonic_gguf → the per-tensor allocation loop.

3. load_supertonic_gguf 7th parameter added at the end of the
   signature with a {} default — every existing call site keeps
   compiling without modification.

4. supertonic_model::f16_weights_excluded_count counter bumped at
   load time when a curated-hot tensor is excluded by the user's
   deny-list.  Surfaced in bench's human + JSON output so
   operators can confirm their config took effect.

5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on
   supertonic-cli, tts-cli (chatterbox), and supertonic-bench
   (comma-separated substring patterns).

6. Verbose-log line in load_supertonic_gguf when the deny-list is
   non-empty (silent on the default path — no visual noise on
   existing operator workflows).

Test plan (TDD round 6):

- test-supertonic-f16-weights (UPDATED): existing 36 checks
  (positives, negatives, edges) + 29 new round-6 checks across 7
  new test functions (empty-list passthrough, matching-deny-
  excludes, non-matching-no-op, cannot-promote-cold, multiple-
  patterns ANY-match, empty-string defensive skip, empty-name
  safety) → 65 / 65 PASS.
- test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time
  gate for EngineOptions::f16_weights_deny_list +
  load_supertonic_gguf 7th param; runtime defaults check +
  assignability + regression guards on every other documented
  EngineOptions default → 9 / 9 PASS.
- Whole CPU-only ctest -L unit reports 17 / 17 tests, 0
  failures, 0 regressions on round-1/2/3 + audit follow-up + the
  baseline tests.
- Smoke-tested supertonic-cli + tts-cli + supertonic-bench
  binaries: --f16-weights-deny flag parses correctly, surfaces in
  --help output, and threads through to the load layer.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ype K/V flash-attention dispatch

Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a
four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag
so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth
as F16, no F16 underflow on small attention scores) or Q8_0 K/V
(Vulkan + half the K/V upload bandwidth) on adapters that advertise
the corresponding capability.  Default `auto` falls back to
`--f16-attn` so every existing operator config sees zero behaviour
change.

Strict TDD throughout: Prereq B extends the F16 parity harness to
cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both
hot shapes) BEFORE touching any production code; new pure-logic
resolver test (`test-supertonic-kv-attn-type`, 106 checks across the
full {-1, 0..3} × legacy × probe-mask matrix); new API-surface
SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks).
Tests committed first, observed to fail on missing symbols, then
implementation added.

Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch
site (same pattern as round-3's `resolve_vulkan_device_index`).
Probe-rejected explicit requests fall back to F32 silently
(advisory-probe contract); out-of-range int throws to surface CLI
typos loudly.  Vector-estimator dispatch site
(`build_text_attention_cache`) replaces the F16-only cast with a
switch on the enum; cache key promoted from `bool f16_kv_attn` to
`kv_attn_dtype kv_attn_type`.  Bench surface adds `(kv_attn_type=…)`
to the human-readable backend line and `"kv_attn_type"` +
`"kv_attn_type_requested"` to the JSON output so log-grep / CI
attribution works across machines.

Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch
so invalid values surface as a clean `error: ...` line + exit 2
(also fixes the pre-existing latent crash on `--vulkan-device abc` /
`--seed nonsense`).

Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0
regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…servability + voice cache + Vulkan env-var passthrough

Lowest impact-÷-risk round of the four planned in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  Four sub-features, none
touching the per-synth hot path beyond a single voice-cache
lookup.

1. Voice ttl/dp host cache (`detail::voice_host_cache`).  Eliminates
   2 sync points / synthesize() after the first per-voice call on
   Vulkan / OpenCL.  Extracted to a standalone helper so the
   lookup-or-load semantics are testable on CPU without
   instantiating a full Engine; reference-stability contract
   documented for the synthesis-pipeline call site.

2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)`
   public helper + `EngineOptions::vulkan_env_overrides` field +
   `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` /
   `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` /
   `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags
   on all three binaries).  ALL-OR-NOTHING validation: an
   operator-config typo throws cleanly BEFORE any env var is
   touched.  `set_env_if_unset` semantics so an operator-set env
   var still WINS over the EngineOptions override.

3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync`
   opt-out).  Inserts an explicit backend sync at every per-stage
   timing boundary so wall-clock attributes to the right stage on
   async backends.  Cheap on CPU; prerequisite for measuring
   round-5 / 8 / 9 wins on real hardware.

4. Bench per-denoise-step breakdown (`--bench-per-step`).  Times
   each `supertonic_vector_step_ggml` call individually so the
   first-step (cold pipeline) cost is distinguished from
   steady-state.  Empty array on the default-off path = identical
   legacy JSON shape.

Strict TDD throughout.  Two new test executables committed
first, observed to fail on missing symbols, then implementation
written.  TDD also caught a real bug: the original env-key
validator used `std::string()` empty-as-success sentinel which
collided with the empty-string-as-key edge case; the test pinned
the contract and forced a `bool / out-param` API fix BEFORE any
production wiring went in.

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions (was 19; +2 new tests = 54 new checks).

Co-authored-by: Cursor <cursoragent@cursor.com>
…PU bridge

Single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md.  PR tetherto#16's audit follow-up tetherto#6
(2C-lite) shipped the GPU device→device blit infrastructure
(`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group
attentions to use it; the front-block attn0 site was deferred
because of cache-lifetime concerns at the time.  Round 8 picks
it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one
function.

Eliminates 6 sync points × 5 denoise steps = 30 sync points /
synth on the production path (3 GPU→host downloads + 3 host→GPU
uploads of post-RoPE Q / K / raw V at the front-block attn0
site).  Strict gating on `front_in_graph_rope &&
!include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace
mode falls back to the legacy host bridge so the trace harness
still captures pre-attention Q/K/V host vectors, and legacy
GGUFs without `vector_rope_theta` continue to take the host-
rotate path.

The blit primitive parity gate already shipped with PR tetherto#16
(`test-supertonic-graph-to-graph-blit`); round 8 extends it
with explicit coverage of the front-block K / V shapes
(text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`).

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…U bridge

Extends the round-8 GPU bridge pattern to the 4 style flash-attn
sites (style0 + g1_style + g2_style + g3_style).  Largest
bandwidth-style optimisation that ships from pure-Supertonic-side
code: 120 sync points / synth eliminated on the production
Vulkan / OpenCL path (4× the round-8 win).

- vector_res_style_qkv_result extended with `sq_gpu / sk_gpu /
  sv_gpu` GPU handles, populated unconditionally by
  `run_res_style_qkv_cache` (cheap — no GPU sync; just
  `ggml_graph_get_tensor` lookups).  Same shape as
  `vector_group_graph_result::q_rope_gpu` etc from the round-1
  2C-lite work.

- `run_res_style_qkv_cache` host-download gating: the 3
  `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv`
  are now gated on `trace != nullptr`.  Production path skips
  them entirely.  Mirrors the round-1 2C-lite
  `need_host_qkv = (trace != nullptr)` gate.  `post` stays
  unconditional — consumed by the next-stage
  `run_style_residual_cache` which still expects a host vector
  (cross-stage GPU bridge for `post` is deferred).

- 4 dispatch sites rewired with the same gating pattern as the
  round-8 front-block bridge: `!include_ggml_trace && sq_gpu &&
  sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge.
  Trace mode falls back to the legacy host bridge so the trace
  harness still gets all the host vectors.

Strict TDD: parity test
(`test-supertonic-graph-to-graph-blit`) extended with explicit
style-shape coverage (`style_sq_L1` trip-wire + clarified
`style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any
production wiring.  All 24 / 24 parity checks pass at bit-exact
`max_abs = 0.0`.

Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…t upload-skip

After rounds 8 + 9 wired the GPU bridge for the 5 attention
sites, the largest remaining per-step host upload is `text_emb`
(uploaded to 4 caches × 5 denoise steps = 20 times / synth, but
constant data within one synth).  Round 10 generalises the F4
pointer-compare upload-skip pattern (already used for
`style_v_in` / `kctx_in`) into a reusable
`upload_skip_tracker` helper and applies it to the front-block
+ 3 group caches.

CRITICAL CORRECTNESS HAZARD addressed:

`text_emb` is a stack-local `std::vector<float>` in
`Engine::Impl::synthesize()` (and bench loops).  Modern heap
allocators (jemalloc / tcmalloc / glibc) very often re-issue
the SAME address for the next stack-local vector of the same
size — so synth N+1 may have `text_emb.data() ==
synth_N.text_emb.data()` despite holding completely different
data.  A naive pointer-compare upload-skip would silently leak
prior synth's text-encoder embedding into the next synth's GPU
buffer.

Mitigation: caller MUST invoke `tracker.reset()` at every
synth boundary (`current_step == 0`).  The CPU-only TDD test
includes an explicit cross-synth pointer-reuse hazard
simulation that documents the bug and verifies the reset
prevents it.

Per-synth wins:
- 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth
- ~512 KB / synth bandwidth saved at text_len=32 (linear in
  prompt length)

Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7
functions, 41 checks) committed first, observed to fail compile
(`upload_skip_tracker was not declared`), then implementation
added.

Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0
failures, 0 regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
…PU-bridge layout fix

Critical correctness fix.  Round 11 didn't add a new optimisation
— it made every prior round actually run end-to-end on real
hardware.  Rounds 8 + 9 + 10 had all shipped CPU-only unit-test
green, but the unit tests never exercised the production code
path with a real GGUF carrying `vector_rope_theta`.  The first
end-to-end synth attempt (CPU OR Vulkan) aborted at
`GGML_ASSERT(HD == n_heads * head_dim)` inside
`apply_rope_to_packed_qk`, and even past that assertion every
`ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge
fast paths would have hit
`GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V
matmul outputs were the byte-for-byte transpose of what the
attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors
expect.

Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5)
was written under the assumption that `dense_matmul_time_ggml`
returns a `ne=[HD, L]` channel-fastest-in-memory tensor.  In
fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`)
produces `ne=[L, HD]` with channel-major-flat memory — the
bit-exact transpose of the helper's input contract.  The CPU
unit test that landed alongside the helper hand-built Q under
the wrong `[HD, L]` shape, so the failure mode was invisible
to CI.

The fix (strict TDD):

1. `test_supertonic_rope_packed_qk.cpp` rewritten under the
   production matmul shape `ne=[L, HD]` (channel-major-flat
   memory).  Reference built in scalar `apply_rope`'s native
   time-major-flat layout; test verifies the helper's output
   bytes match bit-for-bit AND pins `y->ne[0] = HD,
   y->ne[1] = L` so the downstream `q_tc_in` blit cannot
   regress on layout.  Committed RED first, observed to abort
   at the same assertion the production crash hits, then
   landing the helper fix turned it GREEN (14 / 14 checks).

2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a
   head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from
   `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major-
   flat (which IS the layout `q_tc_in` expects).  Rest of the
   pipeline unchanged.  Output ne=[HD, L] time-major-flat
   bytes match scalar `apply_rope`'s native layout AND
   `q_tc_in`'s blit target bit-for-bit.

3. V (and the style sq/sk/sv) have no RoPE to mask the layout
   flip — open-code the same `ggml_cont(ggml_transpose(...))`
   at the matmul output in `build_group_graph_cache`,
   `ve_front_block_proj_cache`, and `build_res_style_qkv_cache`
   so all four GPU-bridge attention sites get bit-for-bit
   matching layouts.

4. Legacy host-bridge fallbacks switched from
   `tensor_to_time_channel(<post-rope-or-v>)` to
   `tensor_raw_f32(...)`.  The new graph-side layout puts the
   bytes already in the time-major-flat shape scalar
   `apply_rope` / `flash_attention_qkv` host references read,
   so the raw download is the correct call;
   `tensor_to_time_channel` would now apply the transpose-of-
   the-transpose and feed wrong-orientation Q/K/V into the
   attention silently.

Verification:

| Backend | Pre-fix | Post-fix |
|---|---|---|
| CPU | abort on first step | writes 3.89s 44.1 kHz WAV |
| Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime |
| Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime |
| Vulkan Mesa lavapipe | abort | writes 1.21s WAV |

CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0
regressions.  Vulkan build's `ctest` likewise 22 / 22.

The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V
dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm,
front-block + style + group GPU bridges, text-input upload-
skip) are now actually exercised end-to-end on every Vulkan
adapter we have — they just couldn't run before round 11
unblocked the production path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 force-pushed the QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 1b710d3 to c383e70 Compare May 13, 2026 16:01
@gianni-cor gianni-cor requested review from a team as code owners May 28, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants