Skip to content

QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37

Open
pratiknarola-t wants to merge 1 commit into
masterfrom
QVAC-19252-s3gen-spk-affine-quant-fix
Open

QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37
pratiknarola-t wants to merge 1 commit into
masterfrom
QVAC-19252-s3gen-spk-affine-quant-fix

Conversation

@pratiknarola-t
Copy link
Copy Markdown

@pratiknarola-t pratiknarola-t commented Jun 1, 2026

flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192 divisible by 32, not deny-listed) but is read back as raw F32 by cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding -> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level.

Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py (the single source of truth shared by both converters and the offline requantizer), so the affine weights stay F32. Surgical: exactly one fewer tensor block-quantized; everything else is unchanged (encoder_proj stays quantized -- it is consumed via ggml_mul_mat and dequantized correctly).

Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor, naming it, so any future raw-F32-read weight that slips into the quant set fails loudly at load time instead of emitting silent NaN audio.

Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise (cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on the pre-fix GGUF naming flow/spk_embed_affine/w.


…s sub-f16 NaN audio)

flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192
divisible by 32, not deny-listed) but is read back as raw F32 by
cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the
quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding
-> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level.

Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py
(the single source of truth shared by both converters and the offline
requantizer), so the affine weights stay F32. Surgical: exactly one fewer
tensor block-quantized; everything else is unchanged (encoder_proj stays
quantized -- it is consumed via ggml_mul_mat and dequantized correctly).

Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor,
naming it, so any future raw-F32-read weight that slips into the quant set
fails loudly at load time instead of emitting silent NaN audio.

Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise
(cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on
the pre-fix GGUF naming flow/spk_embed_affine/w.
@pratiknarola-t pratiknarola-t requested review from a team as code owners June 1, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants