QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37
Open
pratiknarola-t wants to merge 1 commit into
Open
QVAC-19252 tts-cpp: don't block-quantize S3Gen spk_embed_affine (fixed sub-f16 NaN audio)#37pratiknarola-t wants to merge 1 commit into
pratiknarola-t wants to merge 1 commit into
Conversation
…s sub-f16 NaN audio) flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192 divisible by 32, not deny-listed) but is read back as raw F32 by cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding -> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level. Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py (the single source of truth shared by both converters and the offline requantizer), so the affine weights stay F32. Surgical: exactly one fewer tensor block-quantized; everything else is unchanged (encoder_proj stays quantized -- it is consumed via ggml_mul_mat and dequantized correctly). Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor, naming it, so any future raw-F32-read weight that slips into the quant set fails loudly at load time instead of emitting silent NaN audio. Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise (cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on the pre-fix GGUF naming flow/spk_embed_affine/w.
Zbig9000
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
flow/spk_embed_affine/w was eligible for block quantization (2-D, ne[0]=192 divisible by 32, not deny-listed) but is read back as raw F32 by cached_cpu_weights_f32() in the S3Gen synth path. At q8_0/q5_0/q4_0 the quantized bytes were reinterpreted as IEEE floats -> NaN speaker embedding -> all-NaN mel -> noise. Backend-independent; hit every sub-f16 level.
Fix: add "flow/spk_embed_affine" to _DENY_SUBSTRINGS in requantize-gguf.py (the single source of truth shared by both converters and the offline requantizer), so the affine weights stay F32. Surgical: exactly one fewer tensor block-quantized; everything else is unchanged (encoder_proj stays quantized -- it is consumed via ggml_mul_mat and dequantized correctly).
Harden: cached_cpu_weights_f32() now throws if handed a non-F32 tensor, naming it, so any future raw-F32-read weight that slips into the quant set fails loudly at load time instead of emitting silent NaN audio.
Verified on Turbo S3Gen (host CPU, full text->speech): pre-fix q8 = noise (cos 0.003 vs f16), fixed q8 = clean (cos 0.990 vs f16); the guard trips on the pre-fix GGUF naming flow/spk_embed_affine/w.