Significantly speed up bitmap computation by magdalendobson · Pull Request #1099 · microsoft/DiskANN

magdalendobson · 2026-05-21T22:09:59Z

Introduction

Bitmap computation in diskann-label-filter is unacceptably slow. Currently, with a 1 million size slice of yfcc and a 10k query set, computing the query bitmaps takes 132.043 seconds. With just a 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps takes 24.059 seconds. This was making it hard to run experiments on filtered search algorithms.

Speeding up the bitmap computation is conceptually simple. Instead of iterating over every base label for every query filter, we compute an inverted index for each label type, which maps the label value to the documents with the same value. Then, at query time, we query the inverted index for the relevant label values, and compose the resulting sets as necessary to find the documents satisfying the entire filter expression. At a high level, that is what this PR does.

Lower level details

The overall workflow of the main function, compute_query_bitmaps, is as follows:

Check whether the query expression contains any ASTExpr::Not clauses. If so, default to the existing slow path. This is because we don't store the document universe for each label, and thus can't compute the complement of an arbitrary bitset.
Otherwise, move to the fast path.
Flatten the base labels so that nested values map to a single string (e.g. the JSON string {"car": {"color":"red", "make":Mazda"}} would be transformed to {"car.color":red, "car.make":"Mazda}), and re-organize as a hash map of labels to values.
For each label, compute either an inverted index (strings and bools) or an R-tree (ints and floats) depending on its type.
At query time, use either the inverted index or the R-tree to produce a bitset for each CompareOp in the clause, and then compose them with AND and OR as needed to produce the final bitset.

We also add a utility to diskann-label-filter for computing the specificity of a set of query filters with respect to a base set, outputting some statistics on it, and optionally outputting the individual specificity values to a file for further processing.

Inverted Index

The inverted index maps each label value, converted to a string, to a bitset containing the doc ids corresponding to that value.

R-Tree

For simplicity, the R-tree implementation converts integers to floats before inserting so that we don't have to deal with two different types of R-tree. The performance of this piece of code isn't sensitive enough that it makes sense to differentiate, but this could be changed in the future.

The R-tree maps collections of ids to vectors instead of bitsets, because concatenating vectors is much cheaper than extending bitsets, and potentially many vectors would be concatenated during a range query.

Timings

Returning to the earlier discussion of timings, for the 1 million size slice of yfcc and a 10k query set, computing the query bitmaps now takes 7.811 seconds. For the 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps now takes 5.805 seconds. This is a lot better :)

Copilot

Pull request overview

This PR targets a major performance improvement in diskann-label-filter by introducing a fast-path for computing per-query bitmaps using precomputed per-field accelerators (inverted-index style maps for equality and a numeric BTree for range queries), while falling back to the existing evaluator when NOT is present. It also adds an example utility for computing “specificity” statistics over query filters.

Changes:

Add utils::compute_bitmap::compute_query_bitmaps implementing an accelerated bitmap computation path (with a NOT-guarded slow fallback).
Export the new bitmap API from diskann-label-filter and add an example (compute_specificities) to compute stats/output.
Minor doc comment updates in flattening utilities and dependency updates for the new module.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`diskann-label-filter/src/utils/flatten_utils.rs`	Updates doc examples for configurable flattening (one example is currently inconsistent with behavior).
`diskann-label-filter/src/utils/compute_bitmap.rs`	New accelerated bitmap computation implementation plus unit tests.
`diskann-label-filter/src/lib.rs`	Exposes the new module and re-exports `compute_query_bitmaps`.
`diskann-label-filter/examples/compute_specificities.rs`	New example for computing/saving specificity stats from computed bitmaps.
`diskann-label-filter/Cargo.toml`	Adds dependencies needed by the new bitmap computation module.
`Cargo.lock`	Locks new transitive deps (`bit-set`, `rayon`) for this crate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 ///
 /// Example:
-/// With config.separator="/": {"a": {"b": [1, 2]}} -> [ ("/a/b/0", 1), ("/a/b/1", 2) ]
+/// With config.separator=".": {"a": {"b": [1, 2]}} -> [ ("/a/b/0", 1), ("/a/b/1", 2) ]


+pub fn compute_inverted_index_accelerator(
+    key: String,
+    labels: Vec<HashMap<String, AttributeValue>>,
+) -> Result<HashMap<AttributeValue, BitSet>, anyhow::Error> {
+    let mut inverted_index: HashMap<AttributeValue, BitSet> = HashMap::new();
+    for (doc_id, label) in labels.iter().enumerate() {
+        if let Some(value) = label.get(&key) {
+            inverted_index
+                .entry(value.clone())
+                .or_insert_with(BitSet::new)
+                .insert(doc_id);
+        }
+    }


+pub fn compute_btree_accelerator(
+    key: String,
+    labels: Vec<HashMap<String, AttributeValue>>,
+) -> Result<BTreeMap<OrderedFloat, Vec<usize>>, anyhow::Error> {
+    // Implementation for computing BTree accelerator
+    let mut map: BTreeMap<OrderedFloat, Vec<usize>> = BTreeMap::new();
+    for (doc_id, label) in labels.iter().enumerate() {
+        if let Some(value) = label.get(&key) {
+            if let Some(f64_value) = value.as_float() {
+                let f64_value = OrderedFloat::new(f64_value)
+                    .map_err(|e| anyhow::anyhow!("Failed to create OrderedFloat: {e}"))?;
+                map.entry(f64_value).or_default().push(doc_id);
+            } else if let Some(i64_value) = value.as_integer() {
+                let i64_value = OrderedFloat::new(i64_value as f64)
+                    .map_err(|e| anyhow::anyhow!("Failed to create OrderedFloat: {e}"))?;
+                map.entry(i64_value).or_default().push(doc_id);
+            } else {


+pub fn compute_query_accelerator(
+    key: String,
+    value: AttributeValue,
+    flattened_base_labels: &[HashMap<String, AttributeValue>],
+) -> Result<QueryAccelerator, anyhow::Error> {
+    match value {
+        AttributeValue::String(_) | AttributeValue::Bool(_) => {
+            let bitmap =
+                compute_inverted_index_accelerator(key.clone(), flattened_base_labels.to_vec())
+                    .unwrap_or_default();
+            Ok(QueryAccelerator::InvertedIndex(bitmap))
+        }
+        AttributeValue::Integer(_) | AttributeValue::Real(_) => {
+            // For integers and reals, we use an BTree
+            let btree = compute_btree_accelerator(key.clone(), flattened_base_labels.to_vec())
+                .unwrap_or_default();
+            Ok(QueryAccelerator::BTree(btree))
+        }


+                                let mut all_ids = Vec::new();
+                                for (val, ids) in btree.iter() {
+                                    let fval = OrderedFloat::new(fval).map_err(|e| anyhow::anyhow!("Failed to create OrderedFloat: {e}"))?;
+                                    if val != &fval {
+                                        all_ids.extend(ids.iter().cloned());
+                                    }
+                                }
+                                let mut bitset = BitSet::new();
+                                bitset.extend(all_ids);


+        .map(|bitmap| {
+            let count = bitmap.len();
+            let specificity = count as f64 / total_base as f64;
+            specificity
+        })
+        .collect();


+        }
+    };
+    let elapsed = start.elapsed();
+    println!("read_labels_and_compute_bitmap_naive took {:.3?}", elapsed);


Magdalen Manohar and others added 5 commits May 19, 2026 13:01

add specificity utility

0a9980f

refactor example, add compute_bitmap

e99177a

commit to switch

ce447a3

work out kinks in OrderedFloat

e039051

undo change in docstring

8b47aef

magdalendobson marked this pull request as ready for review May 21, 2026 22:13

magdalendobson requested review from a team and Copilot May 21, 2026 22:13

Copilot started reviewing on behalf of magdalendobson May 21, 2026 22:13 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly speed up bitmap computation#1099

Significantly speed up bitmap computation#1099
magdalendobson wants to merge 5 commits into
mainfrom
users/magdalen/add_filter_utils

magdalendobson commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

magdalendobson commented May 21, 2026

Introduction

Lower level details

Inverted Index

R-Tree

Timings

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants