Fix TurboQuant centroid initialization wasting most codes at high dimensions#8076
Fix TurboQuant centroid initialization wasting most codes at high dimensions#8076mprammer wants to merge 1 commit into
Conversation
This comment was marked as low quality.
This comment was marked as low quality.
…ensions Seed Lloyd-Max centroids on ±sqrt(bit_width)·sigma instead of the full support [-1, 1], so they start where the rotated-coordinate marginal has mass and no cell freezes in the zero-mass tails. The same change lands in both centroid implementations (vortex-tensor and vortex-turboquant), kept in sync by the cross-crate parity test, with a regression test and an ignored sweep harness. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
If we are going to do this, can this also figure out better boundaries? These are initialization changes, but we should also change the boundaries from -1, 1 to something like -10.5sigma, 10.5sigma (following the vLLM impl) |
|
We checked this — the outer boundary is inert. Holding the init fixed (3σ here, to isolate the boundary) and sweeping the edge over ±1 (support), ±9σ, and ±10.5σ gives identical distortion (d=1024, error as a multiple of the Theorem 1 bound, lower is better):
The marginal sits within a few σ of 0, so the outermost cell's conditional mean is set by the near-zero mass no matter where the far boundary lands, independent of the init. (At d=128, ±1 is already ≈11σ; at higher d the extra span carries no mass.) We left the boundaries at [-1, 1] to keep the diff minimal, but it doesn't matter either way — the init spread is the only lever that moves distortion, which is what this PR changes. The full sweep, including the vLLM 3.5σ/10.5σ config, is reproducible through the ignored |
|
in that case, I want a proof of why we sqrt(bit width) * sigma is optimal (either as a formal mathematical proof or numerically with a chart). It is still super unclear to me why changing the initial parameters has such a massive affect on the distortion, and it is further unclear why this is supposed to be "ideal" |
|
@claude give this a detailed review paying special attention to the accuracy and precision of this method. |
|
Claude finished @danking's task in 5m 59s —— View job PR Review: Fix TurboQuant centroid initializationTasks:
SummaryThe fix is mathematically sound and correct as a fix for the stated bug. Seeding centroids in 1. Mathematical Correctness of the Core Fix ✅The key relationships are correct:
The 2. The
|
danking
left a comment
There was a problem hiding this comment.
I believe this is a correct improvement to TurboQuant but it changes the interpretation of already written vectors. I think this is OK as long as no releases of Vortex have been made since the original TQ PR merged.
|
Here's my reasoning for approving:
I also think Claude is wrong to bring up sqrt(2 ln N). AFAICT, what it is saying is true for the Normal distribution. In particular, you need to truncate the Normal distribution's support from (-inf, inf) to something reasonable. Claude may be correct that setting the boundaries to sqrt(2 ln N) is "optimal" under some settings for the Normal; however, that's a distraction. We've already chosen the interval: [-1, 1]. As I describe above, we're just trying to choose a decent initialization. And as Martin points out, the optimizer gets stuck if the initial points land in a "too flat" section of the distribution. |
|
Id like to have a different PR that only changes the centroid initialization, rather than also have this init spread thing that I still don't understand. |
|
closing in favor of #8116. I don't understand the Regardless, it is in the old vortex-tensor implementation only so merging this wouldn't do anything anyways |
## Summary Tracking issue: #7830 Followup of #8076 Changes the centroid initialization for `vortex-turboquant` to use `+- sqrt(bit_width) * sigma` instead of just [-1, 1] ## Testing ```sh uv run benchmarks/vector-search-bench/scripts/plot-turboquant-distortion.py \ --dataset cohere-small-100k:single \ --dataset openai-medium-500k:single \ --dataset bioasq-medium-1m \ --dataset glove-small-100k \ --dataset gist-small-100k \ --dataset sift-small-500k \ --output ~/Downloads/distortion-sqrt-sigma.png ``` <img width="2816" height="950" alt="distortion-3-sigma" src="https://github.com/user-attachments/assets/cb2db4eb-13a7-4083-949b-ee7f04ab7428" /> Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
TurboQuant's Lloyd–Max scalar quantizer seeded its centroids uniformly across the full support
[-1, 1], but after rotation each coordinate's marginal concentrates around 0 with standard deviation1/sqrt(dimension). Fordimension >= 256most seeds start in the near-zero-mass tails, where the conditional-mean update's zero-denominator guard freezes them on the first iteration — so a large fraction of codes end up quantizing coordinate values that never occur. At 8 bits anddimension = 1024that left reconstruction error at ~40x the paper's Theorem 1 bound and stopped it falling as bits increased. The fix seeds centroids on±sqrt(bit_width) · sigma, where the mass actually is, so no code freezes.The seed scales with
sqrt(bit_width)rather than a fixed multiple of sigma because the optimal spread grows with codebook size: a constant tight enough to be optimal at low bit widths under-covers the tail at high ones, and a constant wide enough for high bit widths wastes resolution lower down. So any single constant is dominated — 2.5σ is best in the mid range, 3.0σ at the top, neither everywhere — whilesqrt(bit_width)·sigmaties the best constant at every bit width and is strictly best at 8 bits. Reconstruction error as a multiple of the Theorem 1 bound atdimension = 1024, lower is better (the pattern holds across dimensions):[-1,1](before)√bits·σ(this PR)The same seed change lands in both Lloyd–Max copies (
vortex-tensorandvortex-turboquant), kept in lockstep by a cross-crate parity test, with a regression test asserting no code freezes across the dimension/bit grid and an#[ignore]dsweep_centroid_initthat reproduces the table. Note the codebook itself changes:vortex-tensorarrays store their centroids inline so existing files are unaffected, but thevortex-turboquantextension path derives centroids at decode, so data written through it will dequantize against the new codebook.🤖 Generated with Claude Code