[diskann-garnet] Implement BIN and Q8 quantizers#1050
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for BIN (spherical 1-bit) and Q8 (MinMax 8-bit) quantization in diskann-garnet, introducing a type-erased quantizer interface and a “dynamic” accessor/strategy that can switch between full-precision and quantized operation as the quantizer becomes available. It also extends the vectorset tool to allow specifying quantization and distance metric during ingestion, and adds RFC documentation describing the bootstrap/backfill approach.
Changes:
- Add new
diskann-garnetquantizer implementations (BIN + Q8) behind a new type-erasedGarnetQuantizertrait, plus FFI entrypoints for training/backfill. - Rework
diskann-garnetprovider/accessor/strategy to support dynamic switching and reranking behavior. - Update tooling/docs:
vectorsetCLI options for quantizer/metric, new RFC, and small provider/common helpers.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| vectorset/src/main.rs | Adds CLI flags for quantizer/metric and forwards them into VADD ingestion. |
| vectorset/src/loader.rs | Refactors dataset loader locking/state handling and removes unused fields. |
| rfcs/00000-quantizer-bootstrap.md | Adds an RFC describing quantization bootstrapping/backfill phases and provider responsibilities. |
| diskann-providers/src/common/mod.rs | Re-exports MinMax distance helper types used by Garnet quantization. |
| diskann-providers/src/common/minmax_repr.rs | Adds from_bytes convenience for interpreting raw bytes as MinMax elements. |
| diskann-garnet/src/test_utils.rs | Updates tests to match the new GarnetProvider::new signature (quant type parameter). |
| diskann-garnet/src/quantization.rs | New quantizer implementations and type-erased distance/query computer adapters. |
| diskann-garnet/src/provider.rs | Major provider/accessor/search strategy updates for dynamic quantization, training readiness, and backfill. |
| diskann-garnet/src/lib.rs | Updates index creation for quant types, changes insert() return type, and adds FFI for training/backfill. |
| diskann-garnet/src/garnet.rs | Adds (currently unused) exists_* helpers with explicit dead_code expectations. |
| diskann-garnet/src/fsm.rs | Adds visit_used, reuse locking, and total_used tracking to support safe backfill. |
| diskann-garnet/src/dyn_index.rs | Switches type-erased index operations to use the new dynamic strategy and exposes train/backfill hooks. |
| diskann-garnet/src/alloc.rs | Makes AlignToEight public for cross-module aligned byte container usage. |
| diskann-garnet/Cargo.toml | Adds rand and adjusts tokio features for new code usage. |
| Cargo.lock | Locks new dependency resolution (notably rand). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Jack! It will be great to have quantization land in diskann-garnet.
After seeing the 1000-foot view, the things that stand out to me are:
- The somewhat awkward representation of quantizer (
Option<Box<dyn>>inGarnetProviderwithRwLock<Option<_>>as the implementation). - Heavy use of conditionals throughout vector retrieval checking for quantization, especially with respect to the start point cache. This also necessitated the working-set check and drain if we switched to quantization part-way through insert.
- Race conditions in the free space map and quantization.
One idea I had regarding the first two points is to lean more heavily into trait objects with something like
trait Vector {
// The number of bytes needed per vector
fn bytes(&self) -> usize;
fn is_quantized(&self) -> bool;
// Ther term to provide to `Garnet` for data retrieval.
fn term(&self) -> Term;
fn cached(&self, id: u32) -> Option<&[u8]>;
fn distance_computer(&self) -> Box<dyn _>;
fn query_computer(&self, ...) -> Box<dyn _>;
}The idea being that both full-precision and quantized vectors can live behind the same facade. Then in the Provider, we can have:
// Naming and exact structure are just meant to describe the idea
struct GarnetProvider {
primary: Arc<dyn Vector>,
quant: Option<Arc<dyn Vector>>,
...
}
The Accessor can then be
struct Accessor<'a> {
provider: &'a GarnetProvider,
vectors: &'a dyn Vector,
}
With most of the non-reranking vector needs going through Accessor::vectors. When no quantization is active, then vectors = GarnetProvider::primary, otherwise, vectors = GarnetProvider::quant. Then all the vector retrieval code stays pretty much the same without needing to constantly branch on whether or not the vectors are quantized. Further, the issue with the working set goes away because vectors that start their insert as full-precision stay as full-precision for the duration of the insert.
| /// to return the correct status at the end of insert() since the provider code that becomes | ||
| /// aware of the state change cannot directly communicate it back to the caller. | ||
| pub static QUANTIZER_READY: AtomicBool = const { AtomicBool::new(false) }; | ||
| } |
There was a problem hiding this comment.
Can we rely on a thread-local for this? What if the same thread is being used to search multiple indexes?
Could this be stored in the Context instead?
There was a problem hiding this comment.
I could store it in context if I wrap add an internal field to context. I suppose that is slightly nicer.
I'm not sure what issue you are envisioning with the threads. The check is between the guts of insert and the return of insert. There's no way for another chunk of code to run on our thread in the meantime. If something were to happen, the bad behavior is limited to returning SUCCESS_TRAINING_READY, which will cause garnet to dispatch train_quantizer(). That getting called a second time is fine, because train_quantizer has detection for whether training has already started and exits if that's the case.
There was a problem hiding this comment.
What happens, though, if there are multiple indexes. One has been had quantization bootstrapped and the other has not. If a thread inserts on the bootstrapped index and then runs on the non-boot-strapped index, the thread local will still be set to quantized. Does garnet not reuse threads for multiple indexes?
| VectorQuantType::NoQuant | VectorQuantType::XPreQ8 => (None, 0, false), | ||
| VectorQuantType::Invalid => return Err(GarnetProviderError::InvalidQuantizer), | ||
| VectorQuantType::Q8 => { | ||
| if TypeId::of::<T>() != TypeId::of::<f32>() { |
There was a problem hiding this comment.
I don't think relying on TypeId is a good idea. It feels like ad-hoc specialization and can cause issues if we add f16 support in the future. I understand that the Redis API doesn't support f16, but it has historically proved itself to be useful. Can this be packaged behind a trait? Or use VectorRepr::as_f32?
| diskann::always_escalate!(GarnetProviderError); | ||
|
|
||
| /// The Garnet DataProvider implementation. | ||
| pub struct GarnetProvider<T: VectorRepr> { |
There was a problem hiding this comment.
Why parameterize by T anymore? If we ever end up varying T, a bunch of code will get re-monomorphized when what actually changes based on T is relatively small.
| &self, | ||
| metric_type: Metric, | ||
| data: MatrixView<f32>, | ||
| ) -> Result<(), GarnetQuantizerError> { |
There was a problem hiding this comment.
Should the whole training loop be done under a lock, with training skipped if the inner quantizer is already initialized?
Otherwise, we risk overwriting the quantizer with a different one when that risk can be mitigated locally.
|
|
||
| let mut v = vec![0f32; self.dim]; | ||
| let mut q = vec![0u8; quantizer.canonical_bytes()]; | ||
| for id in start_id..=end_id { |
There was a problem hiding this comment.
This entire loop suppresses errors rather than correcting or reporting, why is that?
|
|
||
| let backfill_finished = self.fsm.unlock_reuse(); | ||
|
|
||
| if backfill_finished { |
There was a problem hiding this comment.
This isn't a reliable signal: it races. For example, if you have two tasks: the first one gets preempted before acquiring the reuse lock. The second one then acquires the lock, processes data, and then releases the lock. This release will see the counter as zero and think everything is good, but the first task never even started.
In general, the external interaction with the free-space map feels brittle. There are many atomic variables interacting with each other. As another example, checking can_reuse in FreeSpaceMap::next_id also doesn't feel right. See my other comment.
| /// previously belonged to a deleted element. The returned ID is marked as used. | ||
| pub fn next_id(&self, ctx: Context) -> Result<u32, FsmError> { | ||
| if self.has_free_ids.load(Ordering::Acquire) { | ||
| if self.can_reuse() && self.has_free_ids.load(Ordering::Acquire) { |
There was a problem hiding this comment.
Is this check correct? self.can_reuse can return true before any quantization happens, causing this check to pass. Then quantization can be initiated. This means we can be in the loop below and doing quantization. I assume this isn't the intended goal since the check exists, but is there a mechanism by which next_id broadcasts "I'm running".
| && !quantizer.is_prepared() | ||
| && self.fsm.total_used() > quantizer.required_vectors() | ||
| { | ||
| QUANTIZER_READY.with(|v| v.store(true, Ordering::Release)); |
There was a problem hiding this comment.
How does this avoid mayhem when the same thread is used to search another index?
| }; | ||
|
|
||
| // Ensure we don't kick off training twice. | ||
| match self.training_started.compare_exchange( |
There was a problem hiding this comment.
If training fails, training_started is never reset. Should we be able to restart training?
14b6fe9 to
b5a3759
Compare
This implements the
BINandQ8quantizers for diskann-garnet.This PR includes the following:
Poly<[u8], AlignOfEight>instead ofVec<T>. This means that the same type can be used for both quantized and full precision vectors.GarnetQuantizertrait is inroduced which allows us to type-erase the different quantizers instead of parameterizing the index/provider types.insert()now returns a success flag which can also signal that an associated quantizer is ready for training.build_quant_table()is called by Garnet asynchronously afterinsert()signals readiness for training to build quantization tables, andbackfill_quant_vectorsis called when training is complete to quantize previously inserted vectors. These new calls allow Garnet to control the training and backfill threads.DynamicAccessorbecause it dynamically switches between full precision only and quantized only operation depending on availability and readiness of the associated quantizer.visit_used()call was added to the FSM to facilitate walking all stored vectors.There is one task left to do, which is to store and set up quantization tables when disk persistence is enabled in Garnet, but the rest should be ready for review.