Qvac 18605 tts ggml add and optimize vulkan for supertonic#17
Open
Zbig9000 wants to merge 66 commits into
Open
Qvac 18605 tts ggml add and optimize vulkan for supertonic#17Zbig9000 wants to merge 66 commits into
Zbig9000 wants to merge 66 commits into
Conversation
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
QVAC-7457: Add seed parameter for reproducible sampling
add_codeowners file
…orker added approval check worker
…ners DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0
chore: rebase fork to whisper.cpp v1.8.4
Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor
- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor
The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor
…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.
…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.
…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor
[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)
…review) Address @gianni-cor review on PR tetherto#11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>
Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml
…herto#6) The standalone setup-ggml.sh + patches/ tooling was dropped from qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the CMakeLists.txt still: * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-" on the bundled ggml target. That combination quietly broke standalone bundled-ggml builds: the filename-prefix patch was no longer applied, so libspeech-ggml-*.so files existed on disk but ggml's runtime loader still searched for libggml-*.so under GGML_BACKEND_DL=ON. Vulkan / OpenCL / CUDA backends silently failed to load on Android. Fix per reviewer guidance: converge the speech stack on a single ggml source-of-truth. Standalone-bundled-ggml is no longer a supported build mode out of this in-tree subtree; the canonical path is `-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack `ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships the patches pre-applied. Edits: - TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this tree. Docstring spells out the rationale + points users at the standalone github.com/gianni-cor/chatterbox.cpp repo if they need a bundled-ggml dev build with patches/ present. - The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to configure when patches/ is absent: a FATAL_ERROR points at the right consumption path (vcpkg ggml-speech) and the standalone fallback. Doesn't break in-tree-with-patches builds (parakeet-cpp in this same repo still ships patches/, so its bundled path is unaffected by this guard inside tts-cpp). - Verified locally: `cmake -S tts-cpp -B build` (no flags) errors out at find_package(ggml CONFIG REQUIRED) with our new message pointing at the ggml-speech port; `cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard with the no-patches message. - tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/ that no longer exist; running it would have errored out anyway. The standalone repo keeps its own setup-ggml.sh; only the in-tree subtree drops it. The standalone chatterbox.cpp repo (the one tts-cpp/ was copied from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/ folder + scripts/setup-ggml.sh. This commit is therefore an integration-time delta against that source, not a change to the standalone build flow. Co-authored-by: Cursor <cursoragent@cursor.com>
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>
… integration (F20+F23)
Bakes the per-step apply_rope rotation into the same GGML graphs
that produce Q/K (4 attention sites: front block + 3 group caches),
eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time)
plus the implicit "host can't dispatch next graph until rotation
completes" ordering constraint.
Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin,
n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout
adapter between the `[head_dim, n_heads, L]` contract of the
already-landed `apply_rope_in_graph` helper (F20-h) and the
`[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces.
Universally-supported ops only (view, cont, reshape, mul, sub,
add, repeat, concat) — green on baseline upstream OpenCL.
Graph wiring: each Q/K-producing cache (vector_group_graph_cache
+ ve_front_block_graph_cache) now owns four host-uploaded cos/sin
input tensors (Q's L + K's text_len) and emits `<q_name>_rope` /
`<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin
tables are populated once at cache build time (stable for the
cache's lifetime since they depend only on L / text_len / θ).
Call sites: the 4 RoPE-using sites in
`supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` /
`k_rope` outputs directly and only fall back to host apply_rope
when the GGUF didn't ship `vector_rope_theta` (legacy safety net).
The pre-RoPE Q/K trace entries remain unchanged so scalar-parity
harnesses keep their existing contract.
Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend
parity vs scalar apply_rope on the two hot vector-estimator
shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate
trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt
with LABEL "unit" (no GGUF required).
Full sweep verification:
- 9 / 9 supertonic source files: clean syntax-check
- 21 / 21 test files: clean syntax-check
- 98 / 98 CPU-only unit-test checks pass across
test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops,
backend-dispatch, f16-attn-parity, profile-csv}.
Audit pass tetherto#5 catalogued the remaining hot-path opportunities;
deferred items (F7 vocoder layout flip, F12 host transposes, 2C
full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in
aiDocs/AUDIT_SUPERTONIC_OPENCL.md.
Co-authored-by: Cursor <cursoragent@cursor.com>
…on, in-graph transpose, Q/K/V GPU bridge
Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite),
each landed with a TDD unit test that runs CPU-only (no GGUF fixture required).
F7 — Vocoder ConvNeXt block fusion:
* convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in
[C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct
ggml_mul_mat against that layout, eliminating the layer-norm back-permute
and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass
across the 10 blocks).
* test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference,
max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape.
F12 — In-graph time/channel transpose:
* transpose_time_channel_ggml (supertonic_internal.h) replaces the
pack_time_channel_for_ggml host loops at every run_*_cache ingestion site
in supertonic_vector_estimator.cpp (group / res-style QKV / style residual
/ tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native
x_tc directly and the graph does ggml_cont(ggml_transpose(...)).
* Also drops a redundant double-transpose on the tail-graph noisy_latent path.
* test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err
= 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes.
F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph:
* vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor
handles harvested from the group cache's graph.
* run_text_attention_cache_gpu — new overload that consumes those handles
via ggml_backend_tensor_copy (same-backend device→device blit) instead of
the historical tensor_get + tensor_set pair.
* Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now
gated on (trace != nullptr || !apply_rope); production runs with in-graph
RoPE skip them entirely.
* g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the
GPU fast path (legacy host-RoPE fallback preserved for GGUFs without
vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block
and the four style attention sites still pay the round-trip; targeting
them is the next iteration.
* test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the
five representative attn/style shapes plus L=1.
Verification: all five new + pre-existing CPU unit tests pass (38/38 checks).
Co-authored-by: Cursor <cursoragent@cursor.com>
The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>
…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic
Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>
`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>
…s + prewarm
Layered on top of the QVAC-18605 Vulkan bring-up commit; the
round-2 changes generalise the bring-up's "load-time backend
probe" pattern into a process-wide capability cache and add
three more probes / dispatch hooks that fit the same shape.
Net effect on Vulkan: redundant supports_op traffic eliminated,
defensive auto-policy gating extended to F16 weights, forward-
compat Q8_0 K/V probe primed for a follow-up dispatch flip,
and an opt-in --prewarm hook that lets operators amortise the
~hundreds-of-ms cold-start shader-compile cost outside the
operator-visible first synth call.
1) Process-wide capability-probe cache keyed by ggml_backend_t
The bring-up's three load sites (load_supertonic_gguf,
Engine::Engine, supertonic_bench's main) each ran the
LEAKY_RELU + F16-K/V flash-attn supports_op queries
independently — 2-3x redundant probe traffic per backend.
On Vulkan, supports_op may inspect the device's pipeline
state (~50-200 us per query on Adreno / llvmpipe / RADV in
microbenchmarks); the cache short-circuits 100 % of the
duplicates. Test seam (supertonic_clear_capability_cache +
supertonic_capability_probe_call_count) lets the unit test
verify the cache is hit on the second call by comparing the
counter before / after. Per-backend independence verified
against two distinct CPU backend handles.
2) F16 mul_mat backend-capability probe
Symmetric to the F16-K/V flash-attn probe. The bring-up
auto-enabled use_f16_weights on `!backend_is_cpu` blindly;
a partial-port backend that ships F16 storage but rejects
the hot vector-estimator W_query mul_mat shape would crash
at first synth call. Probe builds the live shape ([256,256]
F16 weight x [256,16] F32 activation) and asks the backend;
auto-policy refuses materialisation on a `false` answer
(slower F32 path stays correct). Manual --f16-weights 1
still forces materialisation (debug-shim escape hatch).
Probe cached; test verifies CPU returns true.
3) Q8_0 K/V flash-attn forward-compat probe
Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0
(and Q4_0) K/V types in scalar + coopmat2 paths. Switching
K/V from F16 to Q8_0 would halve the per-step upload
bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape;
~1 MB / synth on the default 5-step x 4-site schedule) in
exchange for a small (~0.5 %) drift on the attention output.
This commit adds the probe + caches the result; live
dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift
measurement against the parity harness on a real Vulkan
adapter. Bench output annotates `(q8_0_kv_attn=available)`
when the probe says yes so operators can confirm their
hardware is ready for the follow-up.
4) Engine::warm_up(text) + EngineOptions::prewarm_text +
--prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench)
First-synth-latency reduction on Vulkan / OpenCL. In-tree
thread_local graph caches handle every subsequent call but
can't avoid the first pipeline-compile cost (~hundreds of
ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up
runs one throwaway synth at construction time on a caller-
supplied sample text so the operator-visible first synth
sees steady-state latency. Auto-no-op on CPU (no shader-
compile cost). Bench's --prewarm runs the cold-start synth
BEFORE the timed loop (independent of --warmup N which only
discards N timed runs from the median); cold-start latency
logged as `[prewarm] cold-start synth on '...' took N.Nms`
and emitted to --json-out as "prewarm_ms".
5) Bench output extended
Backend log line surfaces every dispatch flag plus the
cold-start prewarm latency:
Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on)
(native_leaky_relu=on) (q8_0_kv_attn=available)
--json-out gains "f16_attn", "f16_weights",
"native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms"
keys for downstream analysis tooling.
Tests
- test-supertonic-capability-cache (NEW, LABEL "unit"): probe
cache short-circuit + clear seam + per-backend independence
+ idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke.
18 / 18 checks pass.
- test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface
contract for EngineOptions::prewarm_text + Engine::warm_up
via SFINAE. 9 / 9 checks pass.
- All existing CPU-only unit tests (test-supertonic-vulkan-
dispatch, -portable-ops, -backend-dispatch, -rope-in-graph,
-rope-packed-qk, -in-graph-transpose, -convnext-block-fused,
-graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus
resample / cpu-caches / t3-caches): all 13 pass unchanged.
- ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ /
184+ individual checks).
Build
- All changed source files compile clean with both
-DGGML_USE_VULKAN defined and undefined.
- No public-API break: EngineOptions::prewarm_text is a new
optional field defaulting to empty (no-op), Engine::warm_up
is a new method (existing callers don't have to invoke it).
Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"):
persistent VkPipelineCache (cross-process), BF16 K/V flash-attn,
Q8_0 K/V live dispatch wiring, multi-device load-balancing.
Co-authored-by: Cursor <cursoragent@cursor.com>
…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>
…hts operator deny-list
Round 6 layers a user-overridable extra deny-list on top of the
existing hand-curated should_materialise_f16_weight() allow-list.
The curated allow-list (Phase 2A) already excludes biases, norms,
embeddings, depthwise convs, and pre-transposed companions; the
round-6 deny-list lets operators force-keep specific additional
tensors as F32 even when --f16-weights is on. Use cases:
- A/B testing: researcher excludes a specific tensor pattern
temporarily without recompiling.
- Hardware-specific drift mitigation: operator pins a problematic
tensor to F32 via config rather than disabling F16 weights
wholesale.
- Future-GGUF safety net: new tensor patterns added in future
GGUFs that the curated allow-list inadvertently scoops in can
be excluded via config without a code change.
Smallest blast radius of the four follow-up rounds — load-time
policy only, runtime dispatch unaffected, zero behaviour change
on the empty-deny-list default path.
Strict TDD discipline (per the user's "double check, don't break
anything" constraint):
- Both new tests committed FIRST.
- Both confirmed to fail to compile on the missing symbols
(predicate test: 'too many arguments to should_materialise_f16_weight';
API test: 'EngineOptions has no member f16_weights_deny_list').
- Implementation written.
- Both tests + every existing unit test re-run; all green.
What changed:
1. 2-arg overload should_materialise_f16_weight(name,
extra_deny_substrings) added alongside the existing 1-arg
version (existing test + call sites unchanged). Substring
matching matches the curated predicate's audit-friendly style;
no regex compile cost or invalid-pattern surface. The deny-
list can only flip true → false, never false → true. Empty
strings inside the deny-list are SKIPPED defensively, not
treated as universal matches (config-typo guard).
2. EngineOptions::f16_weights_deny_list (vector<string>, default
empty) — public API surface. Wired through Engine::Impl →
load_supertonic_gguf → the per-tensor allocation loop.
3. load_supertonic_gguf 7th parameter added at the end of the
signature with a {} default — every existing call site keeps
compiling without modification.
4. supertonic_model::f16_weights_excluded_count counter bumped at
load time when a curated-hot tensor is excluded by the user's
deny-list. Surfaced in bench's human + JSON output so
operators can confirm their config took effect.
5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on
supertonic-cli, tts-cli (chatterbox), and supertonic-bench
(comma-separated substring patterns).
6. Verbose-log line in load_supertonic_gguf when the deny-list is
non-empty (silent on the default path — no visual noise on
existing operator workflows).
Test plan (TDD round 6):
- test-supertonic-f16-weights (UPDATED): existing 36 checks
(positives, negatives, edges) + 29 new round-6 checks across 7
new test functions (empty-list passthrough, matching-deny-
excludes, non-matching-no-op, cannot-promote-cold, multiple-
patterns ANY-match, empty-string defensive skip, empty-name
safety) → 65 / 65 PASS.
- test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time
gate for EngineOptions::f16_weights_deny_list +
load_supertonic_gguf 7th param; runtime defaults check +
assignability + regression guards on every other documented
EngineOptions default → 9 / 9 PASS.
- Whole CPU-only ctest -L unit reports 17 / 17 tests, 0
failures, 0 regressions on round-1/2/3 + audit follow-up + the
baseline tests.
- Smoke-tested supertonic-cli + tts-cli + supertonic-bench
binaries: --f16-weights-deny flag parses correctly, surfaces in
--help output, and threads through to the load layer.
Co-authored-by: Cursor <cursoragent@cursor.com>
…ype K/V flash-attention dispatch
Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a
four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag
so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth
as F16, no F16 underflow on small attention scores) or Q8_0 K/V
(Vulkan + half the K/V upload bandwidth) on adapters that advertise
the corresponding capability. Default `auto` falls back to
`--f16-attn` so every existing operator config sees zero behaviour
change.
Strict TDD throughout: Prereq B extends the F16 parity harness to
cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both
hot shapes) BEFORE touching any production code; new pure-logic
resolver test (`test-supertonic-kv-attn-type`, 106 checks across the
full {-1, 0..3} × legacy × probe-mask matrix); new API-surface
SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks).
Tests committed first, observed to fail on missing symbols, then
implementation added.
Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch
site (same pattern as round-3's `resolve_vulkan_device_index`).
Probe-rejected explicit requests fall back to F32 silently
(advisory-probe contract); out-of-range int throws to surface CLI
typos loudly. Vector-estimator dispatch site
(`build_text_attention_cache`) replaces the F16-only cast with a
switch on the enum; cache key promoted from `bool f16_kv_attn` to
`kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)`
to the human-readable backend line and `"kv_attn_type"` +
`"kv_attn_type_requested"` to the JSON output so log-grep / CI
attribution works across machines.
Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch
so invalid values surface as a clean `error: ...` line + exit 2
(also fixes the pre-existing latent crash on `--vulkan-device abc` /
`--seed nonsense`).
Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0
regressions.
Co-authored-by: Cursor <cursoragent@cursor.com>
…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>
…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>
1b710d3 to
c383e70
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the Supertonic TTS stage of
tts-cppto functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds eleven rounds of Vulkan-specific deltas — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability contract for future regressions.Rounds 1–6 are dispatch + capability infrastructure (probes, flags, multi-device auto-pick, deny-list, multi-dtype K/V). Rounds 8–10 are observability + per-step sync-point elimination on the GPU bridges. Round 11 is a critical correctness fix that turns the prior 10 rounds from "passes CI" into "actually runs end-to-end on every Vulkan adapter we have." Without round 11, every prior round was hitting a latent assertion-failure during the first real synth call.
Scope vs. PR #16: this PR sits on top of the OpenCL branch (
QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All Vulkan-specific deltas are restated here; the OpenCL audit work is not. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.End-to-end validation (on real hardware)
Tested on three Vulkan adapters in one machine — the gold-standard hybrid dev-rig setup:
RTX 5090 per-step breakdown (median over 5 runs, F16 K/V default, post-prewarm):
The round-3/4/7/8/9/10 wins are all in those numbers — round 7's prewarm hides the ~2.3s cold shader-compile, round 8/9/10 eliminate ~166 sync points/synth so the steady-state per-step time is dominated by actual compute rather than host↔GPU bookkeeping.
Net new surface (against PR #16):
native_leaky_relu,f16_kv_flash_attn,f16_mul_mat,q8_0_kv_flash_attn,bf16_kv_flash_attn,pinned_host_buffer)use_native_leaky_relu,kv_attn_type) — joins the round-1use_f16_attnEngineOptionsknobsvulkan_device,prewarm_text,f16_weights_deny_list,kv_attn_type+ 4 Vulkan env-var passthroughs)--vulkan-device,--prewarm,--f16-weights-deny,--kv-attn-type,--vulkan-prefer-host-memory,--vulkan-disable-coopmat2,--vulkan-disable-bfloat16,--vulkan-perf-logger,--vulkan-async-transfer,--vulkan-env KEY=VALUE,--bench-per-step,--bench-sync,--json-outctest -L unit)ctest -L unitInvestigation methodology (TDD throughout)
Every round followed the same workflow:
PROGRESS_SUPERTONIC.md+ commit.The CPU-only test strategy is deliberate: a fresh checkout's
ctestexercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer.Commit-by-commit walkthrough
33fd5c34— Round 1: Vulkan bring-upFoundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used
model.use_f16_attn = !backend_is_cpubecause the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan theHSK % 8 == 0supports_opgate has to be respected, so the auto-policy needs a probe.supertonic_modelflags populated at GGUF load:backend_is_vk(informational; appended to the backend-description string) anduse_native_leaky_relu(resolved viaggml_backend_supports_op(LEAKY_RELU)against a synthetic node).supertonic_backend_supports_f16_kv_flash_attngates theuse_f16_attnauto-policy.EngineOptions::vulkan_deviceint +--vulkan-device NCLI flag plumbed through all three binaries. Range-checked at load (out-of-range = hard error).ggml_backend_vk_get_device_descriptionso multi-GPU / multi-ICD machines unambiguously identify which adapter ran.test-supertonic-vulkan-dispatch(29 checks).d080a1e4— Pre-existing missing-include fixtts-cpp/src/chatterbox_tts.cppusedstd::atomic<int>without#include <atomic>. One-line fix kept as a separate commit so it's trivially revertable.e09d4278— Round 2: capability-cache + 3 probes + prewarmcached_backend_capabilitiesmap keyed byggml_backend_t, guarded by a singlestd::mutex. Eliminates 3× redundant probe calls per backend.supertonic_backend_supports_f16_mul_mat(gatesuse_f16_weightsauto-policy),supertonic_backend_supports_q8_0_kv_flash_attn(forward-compat),supertonic_backend_supports_native_leaky_relu(wraps round 1).Engine::warm_up(text)API +EngineOptions::prewarm_text+--prewarm TEXTCLI. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines compile up-front; operator-visible firstsynthesize()hits steady-state latency. No-op on CPU.test-supertonic-capability-cache,test-supertonic-warm-up-api.8ae15996— Round 3: multi-device auto-pick + 2 forward-compat probes--vulkan-device -1auto-pick policy:resolve_vulkan_device_indexpure-logic helper picksargmax(free_vram)viaggml_backend_vk_get_device_memory(). Tie-break = lower index.supertonic_backend_supports_bf16_kv_flash_attn(for coopmat2 on Ampere+ / RDNA3+),supertonic_backend_supports_pinned_host_buffer(for future per-engine input-scratchpad refactor).test-supertonic-vulkan-device-select(23 checks).32703fcd— Round 6: F16-weights operator deny-listshould_materialise_f16_weight(source_name, deny_list)overload layered on top of the curated allow-list. Each entry is a substring; any match keeps that tensor at its native storage type.EngineOptions::f16_weights_deny_list+--f16-weights-deny PAT1,PAT2,...CLI flag (comma-split parser shared between all three binaries).test-supertonic-f16-weightsextended (+29 checks),test-supertonic-f16-deny-list-api(NEW, 9 checks).2e1c9468— Round 4: multi-dtype K/V flash-attention dispatchGeneralises the round-1 F16-only K/V path into a multi-dtype dispatch.
kv_attn_dtypeenum (autoselect,f32,f16,bf16,q8_0) +EngineOptions::kv_attn_typefield.resolve_kv_attn_typepure-logic helper with full{requested × legacy × probe-mask}behaviour matrix.--kv-attn-typeCLI flag on all three binaries with parse hardening.test-supertonic-kv-attn-type(106 checks),test-supertonic-kv-attn-type-api(18 checks),test-supertonic-f16-attn-parityextended for BF16.ba6d1749— Round 7: bench observability + voice cache + Vulkan env-var passthroughThree independent observability/UX wins shipped together:
--bench-per-step+--bench-sync+--prewarm(already from round 2) +--json-out FILE: per-denoise-step timings on a single timeline (cold pipeline step[0] distinguishable from steady-state step[1..4]); operator can attribute Vulkan stalls to a specific stage on real hardware without GPU-side profilers.--vulkan-prefer-host-memory,--vulkan-disable-coopmat2,--vulkan-disable-bfloat16,--vulkan-perf-logger,--vulkan-async-transfer,--vulkan-env KEY=VALUE— sets the correspondingGGML_VK_*env var before backend init. Operator-set shell env STILL wins over the CLI override (audit-friendly).test-supertonic-vulkan-env-overrides(29 checks).e8bbc728— Round 8: front-block attn0 GPU bridgeThe single largest remaining per-step sync hotspot identified in
aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1/g2/g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.Strict gating on
front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0— trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors.Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth.
df895fd6— Round 9: style flash-attn GPU bridgeSame pattern as round 8, applied to the 4 style attention sites (front-block style0 + style attentions in g1/g2/g3 caches). Gated Q/K/V host downloads on trace mode in
run_res_style_qkv_cache(production path skips them entirely).Eliminates 3 sync points × 4 sites × 5 denoise steps = 60 GPU→host downloads / synth.
358d7aa8— Round 10: per-step text-input upload-skipGeneralised the F4 pointer-compare upload-skip pattern (
style_v_in/kctx_ininvector_res_style_qkv_cache) into a reusableupload_skip_trackerhelper.Applied to
text_in_ton front-block cache +text_inon 3 group caches. Caught and documented a cross-synth pointer-reuse hazard: stack-localtext_embvectors very often re-issue the same address (allocator size-class reuse); thetracker.reset()at synth boundaries prevents the naive pointer-compare from leaking prior-synth GPU data into next-synth attention.New test
test-supertonic-upload-skip-tracker(7 functions, 41 checks) explicitly simulates the cross-synth hazard.Eliminates 16 redundant uploads / synth (~512 KB at text_len=32, linear in prompt length).
c383e70d— Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESSAfter the IDE-freeze recovery, the first end-to-end synth attempt on real hardware crashed at:
on every backend (CPU + Vulkan RTX 5090 + RADV + lavapipe).
Root cause:
apply_rope_to_packed_qk(introduced in PR #16 audit follow-up #5) was written under the assumption thatdense_matmul_time_ggmlreturns ane=[HD, L]channel-fastest-in-memory tensor. In fact, the matmul (both the CPUcblas_sgemmfast path and the GPUconv1d_f32(K=1)fallback) producesne=[L, HD]with channel-major-flat memory (data[t + c*L]) — the bit-exact transpose of the helper's input contract.The CPU unit test that landed alongside the helper (
test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong[HD, L]shape, so the failure mode was invisible to CI — and rounds 8/9/10 were ALSO broken (the GPU bridgeggml_backend_tensor_copy(q_src, q_tc_in)would have aborted atggml_are_same_layoutbecause V (and the style sq/sk/sv which have no RoPE to mask the layout flip) flowed into the GPU bridge from matmul → channel-major-flat bytes → mismatched layout againstq_tc_intime-major-flat).The fix (strict TDD):
ne=[L, HD](channel-major-flat memory). Reference built in scalarapply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pinsy->ne[0] = HD, y->ne[1] = Lso the downstreamq_tc_inblit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then GREEN (14 / 14 checks).apply_rope_to_packed_qkhead-of-pipelineggml_cont(ggml_transpose(q))to flip fromne=[L, HD]channel-major-flat tone=[HD, L]time-major-flat (which IS the layoutq_tc_inexpects).ggml_cont(ggml_transpose(...))at the matmul output inbuild_group_graph_cache,ve_front_block_proj_cache, andbuild_res_style_qkv_cache× all three sq/sk/sv outputs so all four GPU-bridge attention sites get bit-for-bit matching layouts.tensor_to_time_channel(<post-rope-or-v>)totensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalarapply_rope/flash_attention_qkvhost references consume, so the raw download is the correct call.The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.
Test plan
CPU-only — a fresh checkout's
ctest -L unitexercises every new contract without needing a Vulkan adapter.Expected: 22 / 22 tests, 0 failures, 0 regressions.
test-supertonic-vulkan-dispatchtest-supertonic-portable-ops(UPDATED)test-supertonic-capability-cachetest-supertonic-warm-up-apiEngine::warm_uptest-supertonic-vulkan-device-selectresolve_vulkan_device_indexbehaviour matrixtest-supertonic-f16-weights(UPDATED)test-supertonic-f16-deny-list-apitest-supertonic-kv-attn-typeresolve_kv_attn_typebehaviour matrixtest-supertonic-kv-attn-type-apitest-supertonic-f16-attn-parity(UPDATED)test-supertonic-vulkan-env-overridestest-supertonic-upload-skip-tracker(NEW)test-supertonic-rope-packed-qk(REWRITTEN)Smoke testing the CLIs
Bench JSON includes
"kv_attn_type"(resolved),"kv_attn_type_requested"(raw int), and per-step timings so probe misses and per-step variance are attributable in CI/operator triage.Backwards compatibility
--vulkan-device 0semantics unchanged — round 1 introduced the flag; round 3's-1is opt-in only.--f16-weights 0|1semantics unchanged — round 6's--f16-weights-denyis opt-in only.--prewarmdefaults to empty (no-op).--kv-attn-typedefaults toautowhich falls back to round-1'suse_f16_attnboolean — every existing config keeps the round-1 behaviour.model.use_f16_attnboolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.apply_rope_to_packed_qkcontract is backwards-incompatible with the old (broken) one, but the old contract never actually worked in production — pre-fix it crashed on every backend. The 14-check test now pins both the input and output contracts so a future regression fails at compile time on the shape check.File-by-file change summary
tts-cpp/PROGRESS_SUPERTONIC.mdtts-cpp/CMakeLists.txttts-cpp/include/tts-cpp/supertonic/engine.hEngineOptionsfields +Engine::warm_up()tts-cpp/src/supertonic_internal.hkv_attn_dtypeenum, 5 new probes, resolvers,upload_skip_trackerhelper,apply_rope_to_packed_qk(round-11 fix)tts-cpp/src/supertonic_gguf.cpptts-cpp/src/supertonic_vector_estimator.cpptts-cpp/src/supertonic_engine.cppwarm_upimpltts-cpp/src/supertonic_bench.cpptts-cpp/src/supertonic_cli.cpptts-cpp/src/chatterbox_cli.cpptts-clialiastts-cpp/src/chatterbox_tts.cpp#include <atomic>(pre-existing missing-include fix)Deferred follow-ups (intentionally out of scope; pre-existing on master)
Tracked in
tts-cpp/PROGRESS_SUPERTONIC.md"Deferred work" section.argmax(free_vram)policy picks the iGPU on machines like the one we tested (RTX 5090 + AMD RADV) because UMA reports system RAM as free VRAM. Pre-existing in this PR; fix candidate: bias against UMA when a discrete is present. Workaround: explicit--vulkan-device 0.test-supertonic-audit3-cachesF18 + F19 cache-reuse failures — these pre-existed on master (verified pairwise). Pre-round-11 they were hidden by the rope crash; post-round-11 they're newly observable but neither introduced nor fixable by this PR's content (text encoder for F18; cross-cache state-leak for F19). Both should be wired into CI as a separate ticket; F18/F19 affect the OpenCL build identically.VkPipelineCache(chatterbox PROGRESS.md §3.32): recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by<vendorID>-<deviceID>-<driverVersion>. This is aggml-vulkaninternal patch (~199 lines) that benefits all Vulkan workloads. Round 7's--prewarmis an in-process workaround.latentupload latency.Linked