QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio) by pratiknarola-t · Pull Request #37 · tetherto/qvac-ext-lib-whisper.cpp

pratiknarola-t · 2026-06-01T09:12:23Z

flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192 divisible by 32, not deny-listed) but is read back as raw F32 by cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding -> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level.

Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py (the single source of truth shared by both converters and the offline requantizer), so the affine weights stay F32. Surgical: exactly one fewer tensor block-quantized; everything else is unchanged (encoder_proj stays quantized -- it is consumed via ggml_mul_mat and dequantized correctly).

Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor, naming it, so any future raw-F32-read weight that slips into the quant set fails loudly at load time instead of emitting silent NaN audio.

Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise (cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on the pre-fix GGUF naming flow/spk_embed_affine/w.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1215009851268640

…s sub-f16 NaN audio) flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192 divisible by 32, not deny-listed) but is read back as raw F32 by cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding -> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level. Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py (the single source of truth shared by both converters and the offline requantizer), so the affine weights stay F32. Surgical: exactly one fewer tensor block-quantized; everything else is unchanged (encoder_proj stays quantized -- it is consumed via ggml_mul_mat and dequantized correctly). Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor, naming it, so any future raw-F32-read weight that slips into the quant set fails loudly at load time instead of emitting silent NaN audio. Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise (cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on the pre-fix GGUF naming flow/spk_embed_affine/w.

pratiknarola-t requested review from a team as code owners June 1, 2026 09:12

Zbig9000 approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37

QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37
pratiknarola-t wants to merge 1 commit into
masterfrom
QVAC-19252-s3gen-spk-affine-quant-fix

pratiknarola-t commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pratiknarola-t commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pratiknarola-t commented Jun 1, 2026 •

edited

Loading