Qvac 18605 tts ggml add and optimize vulkan for supertonic by Zbig9000 · Pull Request #17 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-12T14:09:43Z

Summary

Brings the Supertonic TTS stage of tts-cpp to functional + tunable parity on the Vulkan backend, layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR #16). The audit-driven optimisations from #16 are backend-portable by construction, so Vulkan inherits all ~280 host↔GPU sync-point eliminations + the F16-weight roster + the in-graph RoPE / ConvNeXt fusion / GPU↔GPU blit work without modification. This PR adds eleven rounds of Vulkan-specific deltas — each round committed test-first (TDD) with a CPU-only unit gate that locks in the dispatch + capability contract for future regressions.

Rounds 1–6 are dispatch + capability infrastructure (probes, flags, multi-device auto-pick, deny-list, multi-dtype K/V). Rounds 8–10 are observability + per-step sync-point elimination on the GPU bridges. Round 11 is a critical correctness fix that turns the prior 10 rounds from "passes CI" into "actually runs end-to-end on every Vulkan adapter we have." Without round 11, every prior round was hitting a latent assertion-failure during the first real synth call.

Scope vs. PR #16: this PR sits on top of the OpenCL branch (QVAC-18607-TTS-GGML-Add-and-optimize-OpenCL-for-supertonic). All Vulkan-specific deltas are restated here; the OpenCL audit work is not. The optimisations layer cleanly because the audit hits the GGML graph layer (backend-portable by construction); Vulkan inherits the wins automatically.

End-to-end validation (on real hardware)

Tested on three Vulkan adapters in one machine — the gold-standard hybrid dev-rig setup:

Adapter	Driver	Result	Per-synth (5-step denoise)
NVIDIA RTX 5090 (discrete, KHR_coopmat, FP16, no BF16)	NVIDIA 590.48.01, Vulkan 1.4.325	✅ 6.53s WAV	44 ms total, 74× realtime short prompt / 76 ms, 123× realtime long prompt
AMD Ryzen 9 9950X3D iGPU (UMA, RADV, FP16)	Mesa 25.2.8 RADV, Vulkan 1.4.318	✅ 3.64s WAV	178 ms total, 7× realtime
Mesa lavapipe (CPU-Vulkan correctness baseline)	Mesa 25.2.8 lavapipe (LLVM 20.1.2)	✅ 1.21s WAV	— (correctness baseline only)
CPU baseline (16-thread Ryzen 9 9950X3D)	—	✅ 3.89s WAV	121 ms total, 10× realtime

RTX 5090 per-step breakdown (median over 5 runs, F16 K/V default, post-prewarm):

preprocess             med=  0.00  ms
duration               med=  0.97  ms
text_encoder           med=  2.94  ms
vector_estimator       med= 37.70  ms (5 steps)
  vector_step[0]       med=  7.44  ms   (cold pipeline)
  vector_step[1..4]    med=  7.01–7.05  ms   (steady state)
vocoder                med=  2.47  ms
total                  med= 44.08  ms

The round-3/4/7/8/9/10 wins are all in those numbers — round 7's prewarm hides the ~2.3s cold shader-compile, round 8/9/10 eliminate ~166 sync points/synth so the steady-state per-step time is dominated by actual compute rather than host↔GPU bookkeeping.

Net new surface (against PR #16):

Category	Delta
Vulkan-specific commits	11 (rounds 1–11)
New backend-capability probes	5 (`native_leaky_relu`, `f16_kv_flash_attn`, `f16_mul_mat`, `q8_0_kv_flash_attn`, `bf16_kv_flash_attn`, `pinned_host_buffer`)
New thread-local dispatch flags	2 (`use_native_leaky_relu`, `kv_attn_type`) — joins the round-1 `use_f16_attn`
New `EngineOptions` knobs	8 (`vulkan_device`, `prewarm_text`, `f16_weights_deny_list`, `kv_attn_type` + 4 Vulkan env-var passthroughs)
New CLI flags (× 3 binaries)	`--vulkan-device`, `--prewarm`, `--f16-weights-deny`, `--kv-attn-type`, `--vulkan-prefer-host-memory`, `--vulkan-disable-coopmat2`, `--vulkan-disable-bfloat16`, `--vulkan-perf-logger`, `--vulkan-async-transfer`, `--vulkan-env KEY=VALUE`, `--bench-per-step`, `--bench-sync`, `--json-out`
New unit tests (`ctest -L unit`)	9 new + 3 extended (vulkan-dispatch, capability-cache, warm-up-api, vulkan-device-select, f16-deny-list-api, kv-attn-type, kv-attn-type-api, vulkan-env-overrides, upload-skip-tracker; rope-packed-qk rewritten for correct contract)
Whole `ctest -L unit`	22 / 22 PASS, 0 regressions, 0 flakes (CPU build + Vulkan build)
Sync-points eliminated per synth (vs. PR #16 baseline)	~166 (30 from round 8 + 120 from round 9 + 16 from round 10)

Investigation methodology (TDD throughout)

Every round followed the same workflow:

Audit: identify a Vulkan-specific gap (capability probe, multi-GPU support, drift recovery, per-step sync hotspot, observability gap, etc.).
Test first: write the CPU-only unit gate that pins the new contract (resolver behaviour matrix, API surface, parity bound, layout contract). Commit + observe failure on the missing symbol (compile error or assertion).
Implement: minimal-surgery production change. Pure-logic helpers split out so the policy is testable on CPU without a Vulkan device.
Re-run: every new test + every existing test must pass before commit.
Update PROGRESS_SUPERTONIC.md + commit.

The CPU-only test strategy is deliberate: a fresh checkout's ctest exercises the dispatch + capability + resolver contracts without needing a Vulkan adapter, so CI on a CPU-only runner catches regressions in the policy layer.

Commit-by-commit walkthrough

`33fd5c34` — Round 1: Vulkan bring-up

Foundational Vulkan dispatch + capability probing. The OpenCL bring-up (#16) used model.use_f16_attn = !backend_is_cpu because the chatterbox OpenCL patch unconditionally accepts the F16-K/V op; on Vulkan the HSK % 8 == 0 supports_op gate has to be respected, so the auto-policy needs a probe.

Two new supertonic_model flags populated at GGUF load: backend_is_vk (informational; appended to the backend-description string) and use_native_leaky_relu (resolved via ggml_backend_supports_op(LEAKY_RELU) against a synthetic node).
New backend-capability probe supertonic_backend_supports_f16_kv_flash_attn gates the use_f16_attn auto-policy.
EngineOptions::vulkan_device int + --vulkan-device N CLI flag plumbed through all three binaries. Range-checked at load (out-of-range = hard error).
Verbose mode + bench output append ggml_backend_vk_get_device_description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran.
New CPU-only TDD harness test-supertonic-vulkan-dispatch (29 checks).

`d080a1e4` — Pre-existing missing-include fix

tts-cpp/src/chatterbox_tts.cpp used std::atomic<int> without #include <atomic>. One-line fix kept as a separate commit so it's trivially revertable.

`e09d4278` — Round 2: capability-cache + 3 probes + prewarm

Process-wide cached_backend_capabilities map keyed by ggml_backend_t, guarded by a single std::mutex. Eliminates 3× redundant probe calls per backend.
3 new probes: supertonic_backend_supports_f16_mul_mat (gates use_f16_weights auto-policy), supertonic_backend_supports_q8_0_kv_flash_attn (forward-compat), supertonic_backend_supports_native_leaky_relu (wraps round 1).
Engine::warm_up(text) API + EngineOptions::prewarm_text + --prewarm TEXT CLI. Runs one throwaway synth at engine construction so the Vulkan / OpenCL shader pipelines compile up-front; operator-visible first synthesize() hits steady-state latency. No-op on CPU.
New tests: test-supertonic-capability-cache, test-supertonic-warm-up-api.

`8ae15996` — Round 3: multi-device auto-pick + 2 forward-compat probes

--vulkan-device -1 auto-pick policy: resolve_vulkan_device_index pure-logic helper picks argmax(free_vram) via ggml_backend_vk_get_device_memory(). Tie-break = lower index.
2 new forward-compat probes: supertonic_backend_supports_bf16_kv_flash_attn (for coopmat2 on Ampere+ / RDNA3+), supertonic_backend_supports_pinned_host_buffer (for future per-engine input-scratchpad refactor).
New test test-supertonic-vulkan-device-select (23 checks).

⚠️ Known issue (pre-existing on this round's policy): on heterogeneous discrete+iGPU machines, UMA iGPUs report system RAM as "free VRAM" and win the argmax even when a discrete GPU is available. On the test machine, --vulkan-device -1 picks the AMD iGPU (178 ms) over the RTX 5090 (44 ms) — a 4× regression for users who follow the help text. Trivially worked around by explicit --vulkan-device 0. Tracked for a follow-up: bias against UMA when a discrete is present.

`32703fcd` — Round 6: F16-weights operator deny-list

2-arg should_materialise_f16_weight(source_name, deny_list) overload layered on top of the curated allow-list. Each entry is a substring; any match keeps that tensor at its native storage type.
EngineOptions::f16_weights_deny_list + --f16-weights-deny PAT1,PAT2,... CLI flag (comma-split parser shared between all three binaries).
Tests: test-supertonic-f16-weights extended (+29 checks), test-supertonic-f16-deny-list-api (NEW, 9 checks).

`2e1c9468` — Round 4: multi-dtype K/V flash-attention dispatch

Generalises the round-1 F16-only K/V path into a multi-dtype dispatch.

kv_attn_dtype enum (autoselect, f32, f16, bf16, q8_0) + EngineOptions::kv_attn_type field.
resolve_kv_attn_type pure-logic helper with full {requested × legacy × probe-mask} behaviour matrix.
--kv-attn-type CLI flag on all three binaries with parse hardening.
Tests: test-supertonic-kv-attn-type (106 checks), test-supertonic-kv-attn-type-api (18 checks), test-supertonic-f16-attn-parity extended for BF16.

`ba6d1749` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

Three independent observability/UX wins shipped together:

--bench-per-step + --bench-sync + --prewarm (already from round 2) + --json-out FILE: per-denoise-step timings on a single timeline (cold pipeline step[0] distinguishable from steady-state step[1..4]); operator can attribute Vulkan stalls to a specific stage on real hardware without GPU-side profilers.
Voice cache: precomputed style buffers reused across synths.
Vulkan env-var CLI passthrough: --vulkan-prefer-host-memory, --vulkan-disable-coopmat2, --vulkan-disable-bfloat16, --vulkan-perf-logger, --vulkan-async-transfer, --vulkan-env KEY=VALUE — sets the corresponding GGML_VK_* env var before backend init. Operator-set shell env STILL wins over the CLI override (audit-friendly).
New test test-supertonic-vulkan-env-overrides (29 checks).

`e8bbc728` — Round 8: front-block attn0 GPU bridge

The single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR #16's audit follow-up #6 (2C-lite) shipped the GPU device→device blit infrastructure (run_text_attention_cache_gpu) and wired g1/g2/g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function.

Strict gating on front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0 — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors.

Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth.

`df895fd6` — Round 9: style flash-attn GPU bridge

Same pattern as round 8, applied to the 4 style attention sites (front-block style0 + style attentions in g1/g2/g3 caches). Gated Q/K/V host downloads on trace mode in run_res_style_qkv_cache (production path skips them entirely).

Eliminates 3 sync points × 4 sites × 5 denoise steps = 60 GPU→host downloads / synth.

`358d7aa8` — Round 10: per-step text-input upload-skip

Generalised the F4 pointer-compare upload-skip pattern (style_v_in / kctx_in in vector_res_style_qkv_cache) into a reusable upload_skip_tracker helper.

Applied to text_in_t on front-block cache + text_in on 3 group caches. Caught and documented a cross-synth pointer-reuse hazard: stack-local text_emb vectors very often re-issue the same address (allocator size-class reuse); the tracker.reset() at synth boundaries prevents the naive pointer-compare from leaking prior-synth GPU data into next-synth attention.

New test test-supertonic-upload-skip-tracker (7 functions, 41 checks) explicitly simulates the cross-synth hazard.

Eliminates 16 redundant uploads / synth (~512 KB at text_len=32, linear in prompt length).

`c383e70d` — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS

After the IDE-freeze recovery, the first end-to-end synth attempt on real hardware crashed at:

supertonic_internal.h:1154: GGML_ASSERT(HD == n_heads * head_dim) failed

on every backend (CPU + Vulkan RTX 5090 + RADV + lavapipe).

Root cause: apply_rope_to_packed_qk (introduced in PR #16 audit follow-up #5) was written under the assumption that dense_matmul_time_ggml returns a ne=[HD, L] channel-fastest-in-memory tensor. In fact, the matmul (both the CPU cblas_sgemm fast path and the GPU conv1d_f32(K=1) fallback) produces ne=[L, HD] with channel-major-flat memory (data[t + c*L]) — the bit-exact transpose of the helper's input contract.

The CPU unit test that landed alongside the helper (test_supertonic_rope_packed_qk.cpp) hand-built Q under the wrong [HD, L] shape, so the failure mode was invisible to CI — and rounds 8/9/10 were ALSO broken (the GPU bridge ggml_backend_tensor_copy(q_src, q_tc_in) would have aborted at ggml_are_same_layout because V (and the style sq/sk/sv which have no RoPE to mask the layout flip) flowed into the GPU bridge from matmul → channel-major-flat bytes → mismatched layout against q_tc_in time-major-flat).

The fix (strict TDD):

Test rewritten under production matmul shape ne=[L, HD] (channel-major-flat memory). Reference built in scalar apply_rope's native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins y->ne[0] = HD, y->ne[1] = L so the downstream q_tc_in blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then GREEN (14 / 14 checks).
apply_rope_to_packed_qk head-of-pipeline ggml_cont(ggml_transpose(q)) to flip from ne=[L, HD] channel-major-flat to ne=[HD, L] time-major-flat (which IS the layout q_tc_in expects).
V (and style sq/sk/sv) graph-side transpose: V has no RoPE to hide behind — open-coded the same ggml_cont(ggml_transpose(...)) at the matmul output in build_group_graph_cache, ve_front_block_proj_cache, and build_res_style_qkv_cache × all three sq/sk/sv outputs so all four GPU-bridge attention sites get bit-for-bit matching layouts.
Legacy host-bridge downloads switched from tensor_to_time_channel(<post-rope-or-v>) to tensor_raw_f32(...). The new graph-side layout puts the bytes already in the time-major-flat shape scalar apply_rope / flash_attention_qkv host references consume, so the raw download is the correct call.

Backend	Pre-fix	Post-fix
CPU	abort on first denoise step	writes 3.89s 44.1 kHz WAV
Vulkan RTX 5090	abort	writes 6.53s WAV; 44 ms / 5 steps; 74× realtime
Vulkan AMD RADV iGPU	abort	writes 3.64s WAV; 178 ms; 7× realtime
Vulkan Mesa lavapipe	abort	writes 1.21s WAV

The round-1..10 wins are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path.

Test plan

CPU-only — a fresh checkout's ctest -L unit exercises every new contract without needing a Vulkan adapter.

cmake -S tts-cpp -B build-tts
cmake --build build-tts --parallel
ctest --test-dir build-tts -L unit --output-on-failure

Expected: 22 / 22 tests, 0 failures, 0 regressions.

Test	Purpose	Round	Checks
`test-supertonic-vulkan-dispatch`	Backend-flag dispatch + F16-K/V probe smoke	1	29
`test-supertonic-portable-ops` (UPDATED)	LEAKY_RELU decomposition path stays exercised	1	—
`test-supertonic-capability-cache`	Probe-counter regression + new-probe coverage	2 + 3	—
`test-supertonic-warm-up-api`	SFINAE gate for `Engine::warm_up`	2	—
`test-supertonic-vulkan-device-select`	`resolve_vulkan_device_index` behaviour matrix	3	23
`test-supertonic-f16-weights` (UPDATED)	Deny-list overload	6	65
`test-supertonic-f16-deny-list-api`	SFINAE gate for the deny-list field	6	9
`test-supertonic-kv-attn-type`	`resolve_kv_attn_type` behaviour matrix	4	106
`test-supertonic-kv-attn-type-api`	SFINAE gate for the enum + EngineOptions field	4	18
`test-supertonic-f16-attn-parity` (UPDATED)	F16 + BF16 K/V parity vs F32 reference	4	8
`test-supertonic-vulkan-env-overrides`	Env-var CLI passthrough; operator-set env wins	7	29
`test-supertonic-upload-skip-tracker` (NEW)	Pointer-compare upload-skip + cross-synth pointer-reuse hazard	10	41
`test-supertonic-rope-packed-qk` (REWRITTEN)	Production matmul shape contract + output layout pin	11	14
Every other unit test	Zero-regression gate	—	unchanged

Smoke testing the CLIs

./build-tts/supertonic-cli --help 2>&1 | grep -A 6 kv-attn-type
./build-tts/supertonic-bench --help 2>&1 | grep -A 5 bench-per-step

# Real-Vulkan validation on RTX 5090 (74× realtime)
./build-tts/supertonic-cli --model models/supertonic2.gguf --text "Hello world" \
  --out /tmp/out.wav --voice M1 --n-gpu-layers 99 --vulkan-device 0 --prewarm "warm up"

./build-tts/supertonic-bench --model models/supertonic2.gguf --text "Hello world" \
  --voice M1 --n-gpu-layers 99 --vulkan-device 0 --runs 5 --warmup 1 \
  --prewarm "warm" --bench-per-step --json-out /tmp/bench.json

Bench JSON includes "kv_attn_type" (resolved), "kv_attn_type_requested" (raw int), and per-step timings so probe misses and per-step variance are attributable in CI/operator triage.

Backwards compatibility

--vulkan-device 0 semantics unchanged — round 1 introduced the flag; round 3's -1 is opt-in only.
--f16-weights 0|1 semantics unchanged — round 6's --f16-weights-deny is opt-in only.
--prewarm defaults to empty (no-op).
--kv-attn-type defaults to auto which falls back to round-1's use_f16_attn boolean — every existing config keeps the round-1 behaviour.
model.use_f16_attn boolean is still populated and is kept in sync with the round-4 enum (= (kv_attn_type == f16)) so any external code keying on the boolean stays consistent.
All round-1 / round-3 probes throw on out-of-range CLI input (loud failure for actual config errors); all probe-gated dispatches fall back to F32 silently (advisory-probe contract — visible in bench output).
Round 11 fix: the new apply_rope_to_packed_qk contract is backwards-incompatible with the old (broken) one, but the old contract never actually worked in production — pre-fix it crashed on every backend. The 14-check test now pins both the input and output contracts so a future regression fails at compile time on the shape check.

File-by-file change summary

38 files changed, 13713 insertions(+), 692 deletions(-)

File	Δ	Notes
`tts-cpp/PROGRESS_SUPERTONIC.md`	+1219	11 round writeups + cross-references
`tts-cpp/CMakeLists.txt`	+252	New test targets + Vulkan-build wiring
`tts-cpp/include/tts-cpp/supertonic/engine.h`	+155	New `EngineOptions` fields + `Engine::warm_up()`
`tts-cpp/src/supertonic_internal.h`	+1254	`kv_attn_dtype` enum, 5 new probes, resolvers, `upload_skip_tracker` helper, `apply_rope_to_packed_qk` (round-11 fix)
`tts-cpp/src/supertonic_gguf.cpp`	+1509	Capability cache, multi-device auto-pick, dispatch-scope plumbing, deny-list, env-var passthrough
`tts-cpp/src/supertonic_vector_estimator.cpp`	+1781	Round-4 enum dispatch, round-8/9 GPU bridges, round-10 upload-skip, round-11 V/QKV transposes + helper rewrites
`tts-cpp/src/supertonic_engine.cpp`	+147	Probe-gated auto-policy, multi-device auto-pick, `warm_up` impl
`tts-cpp/src/supertonic_bench.cpp`	+406	All round flags + bench surface (per-step, sync, JSON, env passthrough)
`tts-cpp/src/supertonic_cli.cpp`	+80	Round flags + try/catch arg-parse hardening
`tts-cpp/src/chatterbox_cli.cpp`	+139	Round flags mirrored on the `tts-cli` alias
`tts-cpp/src/chatterbox_tts.cpp`	+1	`#include <atomic>` (pre-existing missing-include fix)
13 new test files	+3640	Rounds 1, 2, 3, 4, 6, 7, 10, 11 + audit-follow-up parity harnesses
3 updated test files	+900	Round 1, 4, 6, 11 extensions

Deferred follow-ups (intentionally out of scope; pre-existing on master)

Tracked in tts-cpp/PROGRESS_SUPERTONIC.md "Deferred work" section.

Auto-pick on hybrid discrete+iGPU machines — round 3's argmax(free_vram) policy picks the iGPU on machines like the one we tested (RTX 5090 + AMD RADV) because UMA reports system RAM as free VRAM. Pre-existing in this PR; fix candidate: bias against UMA when a discrete is present. Workaround: explicit --vulkan-device 0.
test-supertonic-audit3-caches F18 + F19 cache-reuse failures — these pre-existed on master (verified pairwise). Pre-round-11 they were hidden by the rope crash; post-round-11 they're newly observable but neither introduced nor fixable by this PR's content (text encoder for F18; cross-cache state-leak for F19). Both should be wired into CI as a separate ticket; F18/F19 affect the OpenCL build identically.
Persistent VkPipelineCache (chatterbox PROGRESS.md §3.32): recovers ~91 % of cold→warm shader-compilation gap on first warm run, keyed by <vendorID>-<deviceID>-<driverVersion>. This is a ggml-vulkan internal patch (~199 lines) that benefits all Vulkan workloads. Round 7's --prewarm is an in-process workaround.
Pinned-host-buffer per-step uploads: round 3 added the capability probe so the cache + bench surface know whether the path is available. The actual per-engine input-scratchpad refactor is deferred until measured on a real Vulkan adapter so we can quantify the reduction in latent upload latency.

Linked

Asana: QVAC-18605 [TTS GGML] Add and optimize Vulkan for supertonic
Stacks on: PR Qvac 18607 tts ggml add and optimize open cl for supertonic #16 (QVAC-18607 OpenCL bring-up + audit follow-ups)
Reference: chatterbox.cpp's PROGRESS.md OpenCL / Vulkan optimization log

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

QVAC-7457: Add seed parameter for reproducible sampling

add_codeowners file

…orker added approval check worker

…ners DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

chore: rebase fork to whisper.cpp v1.8.4

Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor

- Add n_audio_window_size and n_audio_last_window_layer hparams - When present, encoder self-attention is restricted to a local window for layers up to last_window_layer - Bypass flash attention when windowed mask is active (Metal FA does not support custom F32 masks); flash attention remains enabled for non-BCI models and for the decoder - Populate window_mask data on the encoder graph (not the cross graph) - Add proper SOS token (language + transcribe) initialization for BCI models Backward-compatible: n_audio_window_size defaults to 0 and n_audio_last_window_layer defaults to -1, disabling windowed attention entirely for standard whisper models. Made-with: Cursor

Made-with: Cursor

Address review feedback: 1. Guard read_safe for BCI-specific hparams (n_audio_conv1_kernel, n_audio_window_size, n_audio_last_window_layer) behind a n_mels > 256 check. Standard whisper models have n_mels <= 128 and do not contain these fields — reading them unconditionally would corrupt the file position and break model loading. 2. Add explicit is_bci flag to hparams struct, set when BCI fields are detected during loading. 3. Use is_bci flag (instead of n_audio_window_size > 0) to guard the BCI-specific decoder SOS token initialization. 4. Log BCI-specific hparams when a BCI model is detected. Made-with: Cursor

The windowed attention mask values depend only on n_ctx and window_size, both fixed after model load. Move the O(n_ctx^2) computation from whisper_encode_internal (called every encode) to whisper_init_state (called once). The encode path now just copies the precomputed data to the graph tensor. Made-with: Cursor

…, Threads 1. Fix window_mask_data / exp_n_audio_ctx mismatch: the precomputed mask uses hparams.n_audio_ctx, but the graph tensor is sized from exp_n_audio_ctx when params.audio_ctx is overridden. Now falls back to recomputing the mask at the effective n_ctx when sizes differ, preventing a buffer overflow into the smaller tensor. 2. Update whisper.pc.in: the install interface was changed to include/whisper but the pkg-config includedir still pointed to include/. Consumers using pkg-config would not find whisper.h. 3. Fix whisper-config.cmake.in: the whisper target publicly links Threads::Threads but find_dependency(Threads) was skipped on Windows, leaving downstream find_package(whisper) with an unresolved imported target. Now always resolve Threads.

…ash attention 1. Cache fallback mask recompute: when exp_n_audio_ctx overrides the default n_audio_ctx, the window mask is now recomputed once and cached in wstate (keyed on window_mask_n_ctx) instead of allocating a new std::vector on every whisper_encode_internal call. 2. Per-layer flash attention: layers above last_window_layer no longer need the windowed attention mask. The flash attention path is now used for those layers even when BCI windowed attention is active, instead of globally falling back to the softmax path for the entire encoder. 3. Use std::abs instead of C abs in both init-time and encode-time mask computation paths.

…alidation 1. Extract compute_window_mask() helper on whisper_state to eliminate the duplicated O(n_ctx^2) mask fill loop that appeared in both whisper_init_state and whisper_encode_internal. Both call sites now use the single helper, preventing future drift. 2. Guard the encode-time mask block with hparams.is_bci before doing the ggml_graph_get_tensor lookup. Cheaper and more explicit than relying on the tensor name string to determine whether BCI windowed attention is active. 3. Add hparams.is_bci to the graph builder guard for window_mask tensor creation, aligning it with the other BCI code paths. 4. Add validation for BCI hparams after reading from file: n_audio_conv1_kernel must be > 0, n_audio_window_size must be >= 0. Log an error and return false on invalid values instead of proceeding with garbage. 5. Add comment explaining the n_mels > 256 threshold used to discriminate BCI models from standard whisper models, and noting that a dedicated file-format marker should be introduced if this assumption ever breaks. Made-with: Cursor

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)

@gianni-cor

…review) Address @gianni-cor review on PR tetherto#11: switch the bundled ggml filename prefix from `libparakeet-ggml-*` to `libspeech-ggml-*` so the QVAC speech stack (whisper, parakeet, chatterbox, supertonic, ...) can co-vendor a single ggml file set instead of each library shipping its own copy. - parakeet-cpp/CMakeLists.txt: OUTPUT_NAME prefix `parakeet-` -> `speech-`, GGML_BACKEND_DL_PROJECT_PREFIX macro `"parakeet-"` -> `"speech-"`, option blurb + status message updated. - parakeet-cpp/README.md, patches/README.md, scripts/setup-ggml.sh, patches/ggml-backend-reg-filename-prefix.patch: doc / comment / example updated to reference the new `speech-` prefix. Verified: setup-ggml.sh re-applies all patches cleanly; CMake configure prints `bundled ggml libraries will be emitted as libspeech-ggml-*`; build emits libspeech-ggml{,-base,-cpu,-blas,-metal}.{0,0.9.11}.dylib; parakeet binary's otool -L now references `libspeech-ggml*` exclusively. Co-authored-by: Cursor <cursoragent@cursor.com>

Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml

…herto#6) The standalone setup-ggml.sh + patches/ tooling was dropped from qvac-ext-lib-whisper.cpp/tts-cpp/ in the integration commit, but the CMakeLists.txt still: * defaulted TTS_CPP_USE_SYSTEM_GGML=OFF, and * unconditionally compile-defined GGML_BACKEND_DL_PROJECT_PREFIX="speech-" on the bundled ggml target. That combination quietly broke standalone bundled-ggml builds: the filename-prefix patch was no longer applied, so libspeech-ggml-*.so files existed on disk but ggml's runtime loader still searched for libggml-*.so under GGML_BACKEND_DL=ON. Vulkan / OpenCL / CUDA backends silently failed to load on Android. Fix per reviewer guidance: converge the speech stack on a single ggml source-of-truth. Standalone-bundled-ggml is no longer a supported build mode out of this in-tree subtree; the canonical path is `-DTTS_CPP_USE_SYSTEM_GGML=ON` against the QVAC speech-stack `ggml-speech` vcpkg port (qvac-ext-ggml/speech branch), which ships the patches pre-applied. Edits: - TTS_CPP_USE_SYSTEM_GGML default flipped from OFF to ON in this tree. Docstring spells out the rationale + points users at the standalone github.com/gianni-cor/chatterbox.cpp repo if they need a bundled-ggml dev build with patches/ present. - The bundled-ggml branch of `if (NOT TARGET ggml)` now refuses to configure when patches/ is absent: a FATAL_ERROR points at the right consumption path (vcpkg ggml-speech) and the standalone fallback. Doesn't break in-tree-with-patches builds (parakeet-cpp in this same repo still ships patches/, so its bundled path is unaffected by this guard inside tts-cpp). - Verified locally: `cmake -S tts-cpp -B build` (no flags) errors out at find_package(ggml CONFIG REQUIRED) with our new message pointing at the ggml-speech port; `cmake -S tts-cpp -B build -DTTS_CPP_USE_SYSTEM_GGML=OFF` errors out at the patches/ guard with the no-patches message. - tts-cpp/scripts/setup-ggml.sh deleted: it referenced patches/ that no longer exist; running it would have errored out anyway. The standalone repo keeps its own setup-ggml.sh; only the in-tree subtree drops it. The standalone chatterbox.cpp repo (the one tts-cpp/ was copied from) keeps TTS_CPP_USE_SYSTEM_GGML=OFF default + the patches/ folder + scripts/setup-ggml.sh. This commit is therefore an integration-time delta against that source, not a change to the standalone build flow. Co-authored-by: Cursor <cursoragent@cursor.com>

… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>

… helper (F20 partial) Adds `apply_rope_in_graph(ctx, x, cos, sin)` plus a host-side `make_rope_cos_sin_tables(theta, L, half)` precompute helper in supertonic_internal.h. Both use only universally-supported GGML ops (reshape / view / permute / mul / add) so the rotation can later run on the OpenCL / Metal / Vulkan backends without per-element scalar CPU work or extra get/set sync points. Integration into the 8 attention sites is deferred to keep this change small and reviewable — the existing scalar `apply_rope` path is unchanged. Test: new test/test_supertonic_rope_in_graph.cpp verifies - parity vs scalar apply_rope on a synthetic Q tensor - identity behaviour when cos=1 / sin=0 Wired into CMakeLists.txt with the "unit" label. Co-authored-by: Cursor <cursoragent@cursor.com>

… integration (F20+F23) Bakes the per-step apply_rope rotation into the same GGML graphs that produce Q/K (4 attention sites: front block + 3 group caches), eliminating the 40 host-side CPU rotations / synth (~2 ms wall-time) plus the implicit "host can't dispatch next graph until rotation completes" ordering constraint. Helper: new inline `apply_rope_to_packed_qk(ctx, q, cos, sin, n_heads, head_dim)` in supertonic_internal.h — a zero-cost layout adapter between the `[head_dim, n_heads, L]` contract of the already-landed `apply_rope_in_graph` helper (F20-h) and the `[H*D, L]` packed tensor that `dense_matmul_time_ggml` produces. Universally-supported ops only (view, cont, reshape, mul, sub, add, repeat, concat) — green on baseline upstream OpenCL. Graph wiring: each Q/K-producing cache (vector_group_graph_cache + ve_front_block_graph_cache) now owns four host-uploaded cos/sin input tensors (Q's L + K's text_len) and emits `<q_name>_rope` / `<k_name>_rope` outputs alongside the pre-RoPE entries. cos/sin tables are populated once at cache build time (stable for the cache's lifetime since they depend only on L / text_len / θ). Call sites: the 4 RoPE-using sites in `supertonic_vector_trace_proj_ggml` consume the cache's `q_rope` / `k_rope` outputs directly and only fall back to host apply_rope when the GGUF didn't ship `vector_rope_theta` (legacy safety net). The pre-RoPE Q/K trace entries remain unchanged so scalar-parity harnesses keep their existing contract. Test: new test/test_supertonic_rope_packed_qk.cpp — CPU-backend parity vs scalar apply_rope on the two hot vector-estimator shapes (q_len=20×H=4×D=64, kv_len=32×H=4×D=64) + an L=1 degenerate trip-wire. Bit-exact (max_abs_err=0.0). Wired into CMakeLists.txt with LABEL "unit" (no GGUF required). Full sweep verification: - 9 / 9 supertonic source files: clean syntax-check - 21 / 21 test files: clean syntax-check - 98 / 98 CPU-only unit-test checks pass across test-supertonic-{rope-packed-qk, rope-in-graph, portable-ops, backend-dispatch, f16-attn-parity, profile-csv}. Audit pass tetherto#5 catalogued the remaining hot-path opportunities; deferred items (F7 vocoder layout flip, F12 host transposes, 2C full Q/K/V graph fusion, 2B Q8_0 quantization) tracked in aiDocs/AUDIT_SUPERTONIC_OPENCL.md. Co-authored-by: Cursor <cursoragent@cursor.com>

…on, in-graph transpose, Q/K/V GPU bridge Three optimizations targeted by audit findings F7, F12, and a new F24 (2C-lite), each landed with a TDD unit test that runs CPU-only (no GGUF fixture required). F7 — Vocoder ConvNeXt block fusion: * convnext_block_fused_ggml (supertonic_internal.h) keeps the LN output in [C, T0] (channel-major) and lowers the two K=1 pointwise convs to direct ggml_mul_mat against that layout, eliminating the layer-norm back-permute and both im2col copies the previous chain paid (~16.8 MiB / vocoder pass across the 10 blocks). * test_supertonic_convnext_block_fused.cpp — CPU parity vs scalar reference, max_abs_err = 3.815e-06 over a vocoder-realistic [C=64, T=20] shape. F12 — In-graph time/channel transpose: * transpose_time_channel_ggml (supertonic_internal.h) replaces the pack_time_channel_for_ggml host loops at every run_*_cache ingestion site in supertonic_vector_estimator.cpp (group / res-style QKV / style residual / tail). Cache inputs now declare ne=[C, L]; callers upload CPU-native x_tc directly and the graph does ggml_cont(ggml_transpose(...)). * Also drops a redundant double-transpose on the tail-graph noisy_latent path. * test_supertonic_in_graph_transpose.cpp — 9 checks, bit-exact (max_abs_err = 0.0) across group_graph, tail_noise, and L=1 trip-wire shapes. F24 (2C-lite) — GPU→GPU Q/K/V bridge between group graph and attention graph: * vector_group_graph_result exposes q_rope_gpu / k_rope_gpu / v_gpu tensor handles harvested from the group cache's graph. * run_text_attention_cache_gpu — new overload that consumes those handles via ggml_backend_tensor_copy (same-backend device→device blit) instead of the historical tensor_get + tensor_set pair. * Host downloads of q_rope/k_rope/v inside run_group_graph_cache are now gated on (trace != nullptr || !apply_rope); production runs with in-graph RoPE skip them entirely. * g1 / g2 / g3 attn call sites in supertonic_vector_trace_proj_ggml use the GPU fast path (legacy host-RoPE fallback preserved for GGUFs without vector_rope_theta). Net: 90 sync points / synth eliminated. Front-block and the four style attention sites still pay the round-trip; targeting them is the next iteration. * test_supertonic_graph_to_graph_blit.cpp — 15 checks, bit-exact across the five representative attn/style shapes plus L=1. Verification: all five new + pre-existing CPU unit tests pass (38/38 checks). Co-authored-by: Cursor <cursoragent@cursor.com>

The plan document is an AI-authored R&D scratchpad that doesn't belong in the committed source tree alongside production code. Move it out of tts-cpp/ so the subtree only ships the implementation; the file continues to live locally under aiDocs/ for ongoing iteration. No code or build changes; documentation-only. Co-authored-by: Cursor <cursoragent@cursor.com>

…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic

Layered on top of the QVAC-18607 OpenCL bring-up + audit follow-ups (PR tetherto#16): the audit-driven optimisations there are backend-portable by construction (every host-sync / bandwidth / fusion win uses the same GPU dispatch path Vulkan walks), so this PR only adds the Vulkan-specific dispatch deltas the OpenCL bring-up did not need. Vulkan-specific deltas - supertonic_model gains backend_is_vk + use_native_leaky_relu, both resolved at GGUF load time: - backend_is_vk via ggml_backend_is_vk so verbose / bench / engine backend_name() can annotate the device with ggml_backend_vk_get_device_description(). - use_native_leaky_relu via a ggml_backend_supports_op probe against a synthetic LEAKY_RELU node — short-circuits leaky_relu_portable_ggml to the fused builtin on Vulkan / Metal / CUDA / chatterbox-patched OpenCL, keeps the conservative RELU+SCALE+ADD decomposition for plain upstream OpenCL. Dynamic probe self-adapts to whichever ggml-opencl-chatterbox-ops.patch state the consumer's vendored ggml ships in. - supertonic_backend_supports_f16_kv_flash_attn probe (synthetic Supertonic-shaped ggml_flash_attn_ext(Q=F32, K/V=F16) node) gates the use_f16_attn auto-policy so a backend that ships flash_attn_ext but rejects the F16-K/V variant for Supertonic shapes keeps the F32 path instead of crashing at first synth call. Manual --f16-attn 1 still forces F16 (debug knob). - Vulkan device selection: replaces the historical hard-coded ggml_backend_vk_init(0) with --vulkan-device N CLI flag plumbed through EngineOptions::vulkan_device, range-checked against ggml_backend_vk_get_device_count() at load (out-of-range index is a hard error — surfaces operator typos / wrong-machine config loud rather than silently falling back to CPU). Verbose mode + bench output append the Vulkan device description so multi-GPU / multi-ICD machines unambiguously identify which adapter ran. - supertonic_op_dispatch_scope extended with prev_use_native_leaky_relu slot so the scope correctly mirrors the new model field through thread-local dispatch. Tests - test-supertonic-vulkan-dispatch (new, LABEL "unit"): CPU-only harness covering the new flags through supertonic_op_dispatch_scope plus a smoke test for the F16-K/V flash-attn probe. 29/29 checks pass. - test-supertonic-portable-ops (existing): fixture model now requests use_native_leaky_relu = false explicitly so the GPU-decomposition correctness gate stays green now that the helper short-circuits on backends with native LEAKY_RELU. 10/10 checks pass. - test-supertonic-backend-dispatch (existing): unchanged, 27/27 pass. - All audit follow-up tests from tetherto#16 unchanged, all PASS. Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined; non-Vulkan builds compile clean. - No public-API break: EngineOptions::vulkan_device defaults to 0 (the historical hard-coded value), load_supertonic_gguf gains a new optional last argument with the same default; existing callers are source-compatible. Deferred (tracked in PROGRESS_SUPERTONIC.md "GPU bring-up: Vulkan"): persistent VkPipelineCache (ggml-vulkan-internal patch, benefits all Vulkan workloads), BF16 / Q4_0 / Q8_0 K/V flash-attention, multi-device load-balancing (--vulkan-device -1 auto-pick). Co-authored-by: Cursor <cursoragent@cursor.com>

`g_s3gen_cache_refcount` is a `std::atomic<int>` (line 189) but `<atomic>` was never included; the file relied on a transitive include chain that broke once any consumer rearranged includes. Surfaces as `error: variable 'std::atomic<int> ... has initializer but incomplete type'` on a clean build. Pre-existing bug, unrelated to QVAC-18605 itself but blocked local CTest runs against the Vulkan-optimisation work. Trivial additive include with no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>

…s + prewarm Layered on top of the QVAC-18605 Vulkan bring-up commit; the round-2 changes generalise the bring-up's "load-time backend probe" pattern into a process-wide capability cache and add three more probes / dispatch hooks that fit the same shape. Net effect on Vulkan: redundant supports_op traffic eliminated, defensive auto-policy gating extended to F16 weights, forward- compat Q8_0 K/V probe primed for a follow-up dispatch flip, and an opt-in --prewarm hook that lets operators amortise the ~hundreds-of-ms cold-start shader-compile cost outside the operator-visible first synth call. 1) Process-wide capability-probe cache keyed by ggml_backend_t The bring-up's three load sites (load_supertonic_gguf, Engine::Engine, supertonic_bench's main) each ran the LEAKY_RELU + F16-K/V flash-attn supports_op queries independently — 2-3x redundant probe traffic per backend. On Vulkan, supports_op may inspect the device's pipeline state (~50-200 us per query on Adreno / llvmpipe / RADV in microbenchmarks); the cache short-circuits 100 % of the duplicates. Test seam (supertonic_clear_capability_cache + supertonic_capability_probe_call_count) lets the unit test verify the cache is hit on the second call by comparing the counter before / after. Per-backend independence verified against two distinct CPU backend handles. 2) F16 mul_mat backend-capability probe Symmetric to the F16-K/V flash-attn probe. The bring-up auto-enabled use_f16_weights on `!backend_is_cpu` blindly; a partial-port backend that ships F16 storage but rejects the hot vector-estimator W_query mul_mat shape would crash at first synth call. Probe builds the live shape ([256,256] F16 weight x [256,16] F32 activation) and asks the backend; auto-policy refuses materialisation on a `false` answer (slower F32 path stays correct). Manual --f16-weights 1 still forces materialisation (debug-shim escape hatch). Probe cached; test verifies CPU returns true. 3) Q8_0 K/V flash-attn forward-compat probe Vulkan's GGML_OP_FLASH_ATTN_EXT supports_op advertises Q8_0 (and Q4_0) K/V types in scalar + coopmat2 paths. Switching K/V from F16 to Q8_0 would halve the per-step upload bandwidth (50 KB -> 25 KB per K/V on Supertonic's hot shape; ~1 MB / synth on the default 5-step x 4-site schedule) in exchange for a small (~0.5 %) drift on the attention output. This commit adds the probe + caches the result; live dispatch site is NOT yet wired pending F16-vs-Q8_0 K/V drift measurement against the parity harness on a real Vulkan adapter. Bench output annotates `(q8_0_kv_attn=available)` when the probe says yes so operators can confirm their hardware is ready for the follow-up. 4) Engine::warm_up(text) + EngineOptions::prewarm_text + --prewarm CLI flag (supertonic-cli, tts-cli, supertonic-bench) First-synth-latency reduction on Vulkan / OpenCL. In-tree thread_local graph caches handle every subsequent call but can't avoid the first pipeline-compile cost (~hundreds of ms on Adreno / RADV per chatterbox PROGRESS.md). warm_up runs one throwaway synth at construction time on a caller- supplied sample text so the operator-visible first synth sees steady-state latency. Auto-no-op on CPU (no shader- compile cost). Bench's --prewarm runs the cold-start synth BEFORE the timed loop (independent of --warmup N which only discards N timed runs from the median); cold-start latency logged as `[prewarm] cold-start synth on '...' took N.Nms` and emitted to --json-out as "prewarm_ms". 5) Bench output extended Backend log line surfaces every dispatch flag plus the cold-start prewarm latency: Vulkan (device 0: ...) (f16_attn=on) (f16_weights=on) (native_leaky_relu=on) (q8_0_kv_attn=available) --json-out gains "f16_attn", "f16_weights", "native_leaky_relu", "q8_0_kv_attn_available", "prewarm_ms" keys for downstream analysis tooling. Tests - test-supertonic-capability-cache (NEW, LABEL "unit"): probe cache short-circuit + clear seam + per-backend independence + idempotency + F16 mul_mat probe + Q8_0 K/V probe smoke. 18 / 18 checks pass. - test-supertonic-warm-up-api (NEW, LABEL "unit"): API-surface contract for EngineOptions::prewarm_text + Engine::warm_up via SFINAE. 9 / 9 checks pass. - All existing CPU-only unit tests (test-supertonic-vulkan- dispatch, -portable-ops, -backend-dispatch, -rope-in-graph, -rope-packed-qk, -in-graph-transpose, -convnext-block-fused, -graph-to-graph-blit, -profile-csv, -f16-attn-parity, plus resample / cpu-caches / t3-caches): all 13 pass unchanged. - ctest -L unit reports 100 % pass (15 / 15 binaries; 184+ / 184+ individual checks). Build - All changed source files compile clean with both -DGGML_USE_VULKAN defined and undefined. - No public-API break: EngineOptions::prewarm_text is a new optional field defaulting to empty (no-op), Engine::warm_up is a new method (existing callers don't have to invoke it). Deferred (tracked in PROGRESS_SUPERTONIC.md "Deferred work"): persistent VkPipelineCache (cross-process), BF16 K/V flash-attn, Q8_0 K/V live dispatch wiring, multi-device load-balancing. Co-authored-by: Cursor <cursoragent@cursor.com>

…vice auto-pick + 2 forward-compat probes Three more Vulkan-specific deltas, all developed test-first. New tests were committed first, observed to fail on the missing symbol, and only then was the implementation written and the tests re-run to verify green. 1. BF16 K/V flash-attn capability probe (5th cached_backend_capabilities flag). Symmetric to the round-2 Q8_0 K/V probe. Vulkan's FLASH_ATTN_EXT supports_op advertises BF16 K/V via the coopmat2- only path; BF16 has the same 2-byte per-element footprint as F16 (so identical upload bandwidth) but the wider 8-bit exponent range avoids the F16 underflow on small attention scores. Forward-compat — the live --kv-attn-type bf16 dispatch wiring is deferred to a follow-up that measures drift against the parity harness on a real Vulkan adapter. 2. Multi-device auto-pick for --vulkan-device -1. Wires the previously-reserved auto-pick API: walks every visible adapter, queries ggml_backend_vk_get_device_memory() to read free VRAM, and dispatches into a pure-logic helper resolve_vulkan_device_index(requested, free_vram_per_device) that picks argmax(free_vram); ties → lower index for stable per-run assignment on identical-spec multi-GPU machines. The pure-logic helper is testable on CPU with synthetic inputs (8 test functions, 23 checks). Reserved-future negative values (-2, -100, ...) now throw instead of silently falling through to device 0. Verbose mode logs the per-device VRAM table so operators can confirm the auto-pick chose the expected adapter. 3. Pinned-host-buffer-type capability probe (6th cache flag) + bench surface. Probes whether ggml_backend_vk_host_buffer_type() is callable on the resolved backend (Vulkan + non-null buffer- type). Forward-compat — primes the capability cache for a follow-up per-engine input-scratchpad refactor that skips ggml-vulkan's internal staging-buffer hop on per-step uploads. Bench output now shows bf16_kv_attn_available + pinned_host_buffer_available in both the human-readable backend tag and the JSON output so operators can pre-flight whether a future opt-in will be effective on their machine. Test plan (TDD round 3): - test-supertonic-capability-cache: 27 / 27 checks pass (was 18, +9 checks for round-3: BF16 K/V smoke + cache-slot share, pinned-host-buffer smoke + cache-slot share, null-backend defensive checks for both new probes). - test-supertonic-vulkan-device-select (NEW): 23 / 23 checks pass (8 test functions: empty-list, single-device, argmax-VRAM, tie- break, explicit-index passthrough, out-of-range, reserved- negative, zero-VRAM handling). - Whole CPU-only ctest -L unit reports 16 / 16 tests passing, zero regressions on round-1 / round-2 / audit-follow-up tests. CLI surface: - supertonic CLI + chatterbox CLI usage strings updated to document --vulkan-device -1 = auto-pick adapter with most free VRAM. - supertonic-bench usage string updated likewise. Co-authored-by: Cursor <cursoragent@cursor.com>

…hts operator deny-list Round 6 layers a user-overridable extra deny-list on top of the existing hand-curated should_materialise_f16_weight() allow-list. The curated allow-list (Phase 2A) already excludes biases, norms, embeddings, depthwise convs, and pre-transposed companions; the round-6 deny-list lets operators force-keep specific additional tensors as F32 even when --f16-weights is on. Use cases: - A/B testing: researcher excludes a specific tensor pattern temporarily without recompiling. - Hardware-specific drift mitigation: operator pins a problematic tensor to F32 via config rather than disabling F16 weights wholesale. - Future-GGUF safety net: new tensor patterns added in future GGUFs that the curated allow-list inadvertently scoops in can be excluded via config without a code change. Smallest blast radius of the four follow-up rounds — load-time policy only, runtime dispatch unaffected, zero behaviour change on the empty-deny-list default path. Strict TDD discipline (per the user's "double check, don't break anything" constraint): - Both new tests committed FIRST. - Both confirmed to fail to compile on the missing symbols (predicate test: 'too many arguments to should_materialise_f16_weight'; API test: 'EngineOptions has no member f16_weights_deny_list'). - Implementation written. - Both tests + every existing unit test re-run; all green. What changed: 1. 2-arg overload should_materialise_f16_weight(name, extra_deny_substrings) added alongside the existing 1-arg version (existing test + call sites unchanged). Substring matching matches the curated predicate's audit-friendly style; no regex compile cost or invalid-pattern surface. The deny- list can only flip true → false, never false → true. Empty strings inside the deny-list are SKIPPED defensively, not treated as universal matches (config-typo guard). 2. EngineOptions::f16_weights_deny_list (vector<string>, default empty) — public API surface. Wired through Engine::Impl → load_supertonic_gguf → the per-tensor allocation loop. 3. load_supertonic_gguf 7th parameter added at the end of the signature with a {} default — every existing call site keeps compiling without modification. 4. supertonic_model::f16_weights_excluded_count counter bumped at load time when a curated-hot tensor is excluded by the user's deny-list. Surfaced in bench's human + JSON output so operators can confirm their config took effect. 5. CLI plumbing: --f16-weights-deny PAT1,PAT2,... flag on supertonic-cli, tts-cli (chatterbox), and supertonic-bench (comma-separated substring patterns). 6. Verbose-log line in load_supertonic_gguf when the deny-list is non-empty (silent on the default path — no visual noise on existing operator workflows). Test plan (TDD round 6): - test-supertonic-f16-weights (UPDATED): existing 36 checks (positives, negatives, edges) + 29 new round-6 checks across 7 new test functions (empty-list passthrough, matching-deny- excludes, non-matching-no-op, cannot-promote-cold, multiple- patterns ANY-match, empty-string defensive skip, empty-name safety) → 65 / 65 PASS. - test-supertonic-f16-deny-list-api (NEW): SFINAE compile-time gate for EngineOptions::f16_weights_deny_list + load_supertonic_gguf 7th param; runtime defaults check + assignability + regression guards on every other documented EngineOptions default → 9 / 9 PASS. - Whole CPU-only ctest -L unit reports 17 / 17 tests, 0 failures, 0 regressions on round-1/2/3 + audit follow-up + the baseline tests. - Smoke-tested supertonic-cli + tts-cli + supertonic-bench binaries: --f16-weights-deny flag parses correctly, surfaces in --help output, and threads through to the load layer. Co-authored-by: Cursor <cursoragent@cursor.com>

…ype K/V flash-attention dispatch Generalises the round-1 `--f16-attn` boolean (F16 vs F32 only) into a four-valued enum + `--kv-attn-type {auto,f32,f16,bf16,q8_0}` CLI flag so operators can opt into BF16 K/V (Vulkan coopmat2 — same bandwidth as F16, no F16 underflow on small attention scores) or Q8_0 K/V (Vulkan + half the K/V upload bandwidth) on adapters that advertise the corresponding capability. Default `auto` falls back to `--f16-attn` so every existing operator config sees zero behaviour change. Strict TDD throughout: Prereq B extends the F16 parity harness to cover BF16 (4 → 8 checks, 5e-3 abs / 5e-3 rel tolerance band, both hot shapes) BEFORE touching any production code; new pure-logic resolver test (`test-supertonic-kv-attn-type`, 106 checks across the full {-1, 0..3} × legacy × probe-mask matrix); new API-surface SFINAE lockdown (`test-supertonic-kv-attn-type-api`, 18 checks). Tests committed first, observed to fail on missing symbols, then implementation added. Pure-logic resolver (`resolve_kv_attn_type`) split from the dispatch site (same pattern as round-3's `resolve_vulkan_device_index`). Probe-rejected explicit requests fall back to F32 silently (advisory-probe contract); out-of-range int throws to surface CLI typos loudly. Vector-estimator dispatch site (`build_text_attention_cache`) replaces the F16-only cast with a switch on the enum; cache key promoted from `bool f16_kv_attn` to `kv_attn_dtype kv_attn_type`. Bench surface adds `(kv_attn_type=…)` to the human-readable backend line and `"kv_attn_type"` + `"kv_attn_type_requested"` to the JSON output so log-grep / CI attribution works across machines. Bonus: `supertonic-cli`'s arg-parse loop is now wrapped in try/catch so invalid values surface as a clean `error: ...` line + exit 2 (also fixes the pre-existing latent crash on `--vulkan-device abc` / `--seed nonsense`). Whole CPU-only `ctest -L unit` reports 19 / 19 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…servability + voice cache + Vulkan env-var passthrough Lowest impact-÷-risk round of the four planned in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. Four sub-features, none touching the per-synth hot path beyond a single voice-cache lookup. 1. Voice ttl/dp host cache (`detail::voice_host_cache`). Eliminates 2 sync points / synthesize() after the first per-voice call on Vulkan / OpenCL. Extracted to a standalone helper so the lookup-or-load semantics are testable on CPU without instantiating a full Engine; reference-stability contract documented for the synthesis-pipeline call site. 2. Vulkan env-var passthrough (`apply_vulkan_env_overrides(map)` public helper + `EngineOptions::vulkan_env_overrides` field + `--vulkan-prefer-host-memory` / `--vulkan-disable-coopmat2` / `--vulkan-disable-bfloat16` / `--vulkan-perf-logger` / `--vulkan-async-transfer` / `--vulkan-env KEY=VALUE` CLI flags on all three binaries). ALL-OR-NOTHING validation: an operator-config typo throws cleanly BEFORE any env var is touched. `set_env_if_unset` semantics so an operator-set env var still WINS over the EngineOptions override. 3. Bench `ggml_backend_synchronize` boundaries (`--no-bench-sync` opt-out). Inserts an explicit backend sync at every per-stage timing boundary so wall-clock attributes to the right stage on async backends. Cheap on CPU; prerequisite for measuring round-5 / 8 / 9 wins on real hardware. 4. Bench per-denoise-step breakdown (`--bench-per-step`). Times each `supertonic_vector_step_ggml` call individually so the first-step (cold pipeline) cost is distinguished from steady-state. Empty array on the default-off path = identical legacy JSON shape. Strict TDD throughout. Two new test executables committed first, observed to fail on missing symbols, then implementation written. TDD also caught a real bug: the original env-key validator used `std::string()` empty-as-success sentinel which collided with the empty-string-as-key edge case; the test pinned the contract and forced a `bool / out-param` API fix BEFORE any production wiring went in. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions (was 19; +2 new tests = 54 new checks). Co-authored-by: Cursor <cursoragent@cursor.com>

…PU bridge Single largest remaining per-step sync hotspot identified in aiDocs/PLAN_VULKAN_NEXT_ROUNDS.md. PR tetherto#16's audit follow-up tetherto#6 (2C-lite) shipped the GPU device→device blit infrastructure (`run_text_attention_cache_gpu`) and wired g1 / g2 / g3 group attentions to use it; the front-block attn0 site was deferred because of cache-lifetime concerns at the time. Round 8 picks it up — same exact pattern as g1/g2/g3, ~30 LOC delta in one function. Eliminates 6 sync points × 5 denoise steps = 30 sync points / synth on the production path (3 GPU→host downloads + 3 host→GPU uploads of post-RoPE Q / K / raw V at the front-block attn0 site). Strict gating on `front_in_graph_rope && !include_ggml_trace && v_gpu_attn0 && k_rope_gpu_attn0` — trace mode falls back to the legacy host bridge so the trace harness still captures pre-attention Q/K/V host vectors, and legacy GGUFs without `vector_rope_theta` continue to take the host- rotate path. The blit primitive parity gate already shipped with PR tetherto#16 (`test-supertonic-graph-to-graph-blit`); round 8 extends it with explicit coverage of the front-block K / V shapes (text_len=32 and text_len=50, both bit-exact `max_abs = 0.0`). Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…U bridge Extends the round-8 GPU bridge pattern to the 4 style flash-attn sites (style0 + g1_style + g2_style + g3_style). Largest bandwidth-style optimisation that ships from pure-Supertonic-side code: 120 sync points / synth eliminated on the production Vulkan / OpenCL path (4× the round-8 win). - vector_res_style_qkv_result extended with `sq_gpu / sk_gpu / sv_gpu` GPU handles, populated unconditionally by `run_res_style_qkv_cache` (cheap — no GPU sync; just `ggml_graph_get_tensor` lookups). Same shape as `vector_group_graph_result::q_rope_gpu` etc from the round-1 2C-lite work. - `run_res_style_qkv_cache` host-download gating: the 3 `tensor_to_time_channel(...)` downloads of `sq` / `sk` / `sv` are now gated on `trace != nullptr`. Production path skips them entirely. Mirrors the round-1 2C-lite `need_host_qkv = (trace != nullptr)` gate. `post` stays unconditional — consumed by the next-stage `run_style_residual_cache` which still expects a host vector (cross-stage GPU bridge for `post` is deferred). - 4 dispatch sites rewired with the same gating pattern as the round-8 front-block bridge: `!include_ggml_trace && sq_gpu && sk_gpu && sv_gpu` → GPU bridge; otherwise legacy host bridge. Trace mode falls back to the legacy host bridge so the trace harness still gets all the host vectors. Strict TDD: parity test (`test-supertonic-graph-to-graph-blit`) extended with explicit style-shape coverage (`style_sq_L1` trip-wire + clarified `style0_q_rope_L20` / `style0_k_rope_kv50`) BEFORE any production wiring. All 24 / 24 parity checks pass at bit-exact `max_abs = 0.0`. Whole CPU-only `ctest -L unit` reports 21 / 21 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…t upload-skip After rounds 8 + 9 wired the GPU bridge for the 5 attention sites, the largest remaining per-step host upload is `text_emb` (uploaded to 4 caches × 5 denoise steps = 20 times / synth, but constant data within one synth). Round 10 generalises the F4 pointer-compare upload-skip pattern (already used for `style_v_in` / `kctx_in`) into a reusable `upload_skip_tracker` helper and applies it to the front-block + 3 group caches. CRITICAL CORRECTNESS HAZARD addressed: `text_emb` is a stack-local `std::vector<float>` in `Engine::Impl::synthesize()` (and bench loops). Modern heap allocators (jemalloc / tcmalloc / glibc) very often re-issue the SAME address for the next stack-local vector of the same size — so synth N+1 may have `text_emb.data() == synth_N.text_emb.data()` despite holding completely different data. A naive pointer-compare upload-skip would silently leak prior synth's text-encoder embedding into the next synth's GPU buffer. Mitigation: caller MUST invoke `tracker.reset()` at every synth boundary (`current_step == 0`). The CPU-only TDD test includes an explicit cross-synth pointer-reuse hazard simulation that documents the bug and verifies the reset prevents it. Per-synth wins: - 16 fewer `ggml_backend_tensor_set` host→GPU uploads per synth - ~512 KB / synth bandwidth saved at text_len=32 (linear in prompt length) Strict TDD: `test-supertonic-upload-skip-tracker` (NEW, 7 functions, 41 checks) committed first, observed to fail compile (`upload_skip_tracker was not declared`), then implementation added. Whole CPU-only `ctest -L unit` reports 22 / 22 tests, 0 failures, 0 regressions. Co-authored-by: Cursor <cursoragent@cursor.com>

…PU-bridge layout fix Critical correctness fix. Round 11 didn't add a new optimisation — it made every prior round actually run end-to-end on real hardware. Rounds 8 + 9 + 10 had all shipped CPU-only unit-test green, but the unit tests never exercised the production code path with a real GGUF carrying `vector_rope_theta`. The first end-to-end synth attempt (CPU OR Vulkan) aborted at `GGML_ASSERT(HD == n_heads * head_dim)` inside `apply_rope_to_packed_qk`, and even past that assertion every `ggml_backend_tensor_copy(q_src, q_tc_in)` on the GPU-bridge fast paths would have hit `GGML_ASSERT(ggml_are_same_layout(src, dst))` because Q/K/V matmul outputs were the byte-for-byte transpose of what the attention cache's `q_tc_in` / `k_tc_in` / `v_tc_in` tensors expect. Root cause: `apply_rope_to_packed_qk` (PR tetherto#16 audit follow-up tetherto#5) was written under the assumption that `dense_matmul_time_ggml` returns a `ne=[HD, L]` channel-fastest-in-memory tensor. In fact the matmul (CPU `cblas_sgemm` and GPU `conv1d_f32(K=1)`) produces `ne=[L, HD]` with channel-major-flat memory — the bit-exact transpose of the helper's input contract. The CPU unit test that landed alongside the helper hand-built Q under the wrong `[HD, L]` shape, so the failure mode was invisible to CI. The fix (strict TDD): 1. `test_supertonic_rope_packed_qk.cpp` rewritten under the production matmul shape `ne=[L, HD]` (channel-major-flat memory). Reference built in scalar `apply_rope`'s native time-major-flat layout; test verifies the helper's output bytes match bit-for-bit AND pins `y->ne[0] = HD, y->ne[1] = L` so the downstream `q_tc_in` blit cannot regress on layout. Committed RED first, observed to abort at the same assertion the production crash hits, then landing the helper fix turned it GREEN (14 / 14 checks). 2. `apply_rope_to_packed_qk` (`supertonic_internal.h`): add a head-of-pipeline `ggml_cont(ggml_transpose(q))` to flip from `ne=[L, HD]` channel-major-flat to `ne=[HD, L]` time-major- flat (which IS the layout `q_tc_in` expects). Rest of the pipeline unchanged. Output ne=[HD, L] time-major-flat bytes match scalar `apply_rope`'s native layout AND `q_tc_in`'s blit target bit-for-bit. 3. V (and the style sq/sk/sv) have no RoPE to mask the layout flip — open-code the same `ggml_cont(ggml_transpose(...))` at the matmul output in `build_group_graph_cache`, `ve_front_block_proj_cache`, and `build_res_style_qkv_cache` so all four GPU-bridge attention sites get bit-for-bit matching layouts. 4. Legacy host-bridge fallbacks switched from `tensor_to_time_channel(<post-rope-or-v>)` to `tensor_raw_f32(...)`. The new graph-side layout puts the bytes already in the time-major-flat shape scalar `apply_rope` / `flash_attention_qkv` host references read, so the raw download is the correct call; `tensor_to_time_channel` would now apply the transpose-of- the-transpose and feed wrong-orientation Q/K/V into the attention silently. Verification: | Backend | Pre-fix | Post-fix | |---|---|---| | CPU | abort on first step | writes 3.89s 44.1 kHz WAV | | Vulkan RTX 5090 | abort | writes 6.53s WAV; 44 ms / 5 steps; 74x realtime | | Vulkan AMD RADV iGPU | abort | writes 3.64s WAV; 178 ms; 7x realtime | | Vulkan Mesa lavapipe | abort | writes 1.21s WAV | CPU-only `ctest -L unit`: 22 / 22 tests, 0 failures, 0 regressions. Vulkan build's `ctest` likewise 22 / 22. The round-1..10 wins (multi-device cache, BF16 / Q8_0 K/V dispatch, native LEAKY_RELU, F16 weights deny-list, prewarm, front-block + style + group GPU bridges, text-input upload- skip) are now actually exercised end-to-end on every Vulkan adapter we have — they just couldn't run before round 11 unblocked the production path. Co-authored-by: Cursor <cursoragent@cursor.com>

nik and others added 30 commits November 10, 2025 13:02

Add seed parameter for reproducible sampling

80f95ec

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

add_codeowners file

e421573

added approval check worker

962c380

Merge pull request tetherto#4 from aegioscy/master

71d3308

QVAC-7457: Add seed parameter for reproducible sampling

Merge pull request tetherto#5 from shikha-tether/add_codeowners_file

010d8f9

add_codeowners file

Merge pull request tetherto#6 from shikha-tether/add_approval_check_w…

eda3d21

…orker added approval check worker

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

06c478e

Merge pull request tetherto#7 from Proletter/DEVOPS-916_update_codeow…

65d5225

…ners DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

Add seed parameter for reproducible sampling

7ce31d4

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

add_codeowners file

2cc2313

added approval check worker

6befb6f

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

2a94ba2

Merge branch 'master' into rebase-v1.8.4

8519283

Merge pull request tetherto#8 from tetherto/rebase-v1.8.4

e361028

chore: rebase fork to whisper.cpp v1.8.4

feat(bci): add variable conv1 kernel size support

775d436

Read n_audio_conv1_kernel from model hparams to allow BCI models to use a non-standard first convolution kernel size. Standard whisper models default to kernel size 3. Made-with: Cursor

fix vcpkg build

e6fcbaa

Made-with: Cursor

fix apple silicon cross compile

461f07d

Made-with: Cursor

fix windows pthread

bbb3535

Made-with: Cursor

Merge pull request tetherto#10 from tetherto/feat/bci-patches-v184

2b1e04f

[BCI] QVAC-17071 feat: add BCI neural signal support (variable conv1 kernel + windowed attention)

Add parakeet-cpp port

d7ab516

Merge pull request tetherto#11 from GustavoA1604/add-parakeet-cpp

a6785de

Add parakeet-cpp: NVIDIA Parakeet ASR + Sortformer diarization in pure C++/ggml

Add tts-cpp files

ef840d5

Zbig9000 and others added 5 commits May 12, 2026 10:53

Zbig9000 requested review from a team as code owners May 12, 2026 14:09

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2 and ogad-tether May 12, 2026 15:57

GustavoA1604 and others added 12 commits May 12, 2026 18:45

Merge pull request tetherto#16 from Zbig9000/QVAC-18607-TTS-GGML-Add-…

eed9c52

…and-optimize-OpenCL-for-supertonic Qvac 18607 tts ggml add and optimize open cl for supertonic

Zbig9000 force-pushed the QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic branch from 1b710d3 to c383e70 Compare May 13, 2026 16:01

GustavoA1604 force-pushed the master branch from 6c60e4c to f5f914b Compare May 13, 2026 22:06

gianni-cor force-pushed the master branch from f8af247 to eabcf6d Compare May 28, 2026 12:36

gianni-cor requested review from a team as code owners May 28, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17

Qvac 18605 tts ggml add and optimize vulkan for supertonic#17
Zbig9000 wants to merge 66 commits into
tetherto:masterfrom
Zbig9000:QVAC-18605-TTS-GGML-Add-and-optimize-Vulkan-for-supertonic

Zbig9000 commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Zbig9000 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end validation (on real hardware)

Investigation methodology (TDD throughout)

Commit-by-commit walkthrough

33fd5c34 — Round 1: Vulkan bring-up

d080a1e4 — Pre-existing missing-include fix

e09d4278 — Round 2: capability-cache + 3 probes + prewarm

8ae15996 — Round 3: multi-device auto-pick + 2 forward-compat probes

32703fcd — Round 6: F16-weights operator deny-list

2e1c9468 — Round 4: multi-dtype K/V flash-attention dispatch

ba6d1749 — Round 7: bench observability + voice cache + Vulkan env-var passthrough

e8bbc728 — Round 8: front-block attn0 GPU bridge

df895fd6 — Round 9: style flash-attn GPU bridge

358d7aa8 — Round 10: per-step text-input upload-skip

c383e70d — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS

Test plan

Smoke testing the CLIs

Backwards compatibility

File-by-file change summary

Deferred follow-ups (intentionally out of scope; pre-existing on master)

Linked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zbig9000 commented May 12, 2026 •

edited

Loading

`33fd5c34` — Round 1: Vulkan bring-up

`d080a1e4` — Pre-existing missing-include fix

`e09d4278` — Round 2: capability-cache + 3 probes + prewarm

`8ae15996` — Round 3: multi-device auto-pick + 2 forward-compat probes

`32703fcd` — Round 6: F16-weights operator deny-list

`2e1c9468` — Round 4: multi-dtype K/V flash-attention dispatch

`ba6d1749` — Round 7: bench observability + voice cache + Vulkan env-var passthrough

`e8bbc728` — Round 8: front-block attn0 GPU bridge

`df895fd6` — Round 9: style flash-attn GPU bridge

`358d7aa8` — Round 10: per-step text-input upload-skip

`c383e70d` — Round 11: Packed-QK RoPE + GPU-bridge layout fix ⚡ CRITICAL CORRECTNESS