feat(embedded): add native Hnsw class for ann-benchmarks integration#70
feat(embedded): add native Hnsw class for ann-benchmarks integration#70polaz wants to merge 7 commits into
Conversation
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 9 minutes and 38 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR introduces a native Hnsw PyO3 class to coordinode-embedded, wrapping the coordinode_vector HNSW index to expose a fast Python API for vector search. The implementation includes metric parsing, GIL-aware thread-safe state management, typed stubs, package wiring, dependency updates, and tests validating recall and output correctness. ChangesHNSW vector index feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@coordinode-embedded/src/hnsw.rs`:
- Around line 116-118: The early-return for n == 0 incorrectly returns (0, 0)
ignoring the current next_id; update the branch in the function containing the n
check to return the empty range at the current position by returning
(self.next_id, self.next_id) (or next_id, next_id if in a non-method scope) so
the returned contiguous range [first_id, last_id+1) reflects the actual
insertion point; ensure you reference the same next_id symbol used elsewhere in
this function.
In `@tests/unit/test_hnsw.py`:
- Around line 27-30: Add an explicit test case constructing ce.Hnsw with the
metric alias "inner_product" to ensure alias compatibility; specifically,
alongside the existing ce.Hnsw(dim=4, metric="dot", M=4, ef_construction=20)
call add ce.Hnsw(dim=4, metric="inner_product", M=4, ef_construction=20) so the
Hnsw constructor (ce.Hnsw) is validated for the inner_product alias.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d79854e3-010d-4ead-9318-650cccff7362
⛔ Files ignored due to path filters (1)
coordinode-embedded/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
coordinode-embedded/Cargo.tomlcoordinode-embedded/python/coordinode_embedded/__init__.pycoordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyicoordinode-embedded/src/hnsw.rscoordinode-embedded/src/lib.rscoordinode-rstests/unit/test_hnsw.py
There was a problem hiding this comment.
Pull request overview
Adds a new native PyO3 Hnsw binding in coordinode-embedded to expose coordinode_vector::hnsw::HnswIndex directly (Cypher-bypass) for ann-benchmarks-style in-process ANN evaluation.
Changes:
- Introduces a Rust/PyO3
Hnswclass with NumPy-basedfit,set_ef, andknn_queryAPIs. - Exposes
Hnswfrom the Python package (__init__.py) and adds a typing stub surface. - Adds unit tests covering metric parsing, dimension validation, ID bookkeeping, recall thresholds, and return dtypes.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_hnsw.py | Adds API + recall/unit tests for the new Hnsw binding. |
| coordinode-embedded/src/lib.rs | Registers the new Hnsw PyO3 class in the extension module. |
| coordinode-embedded/src/hnsw.rs | Implements the Hnsw PyO3 wrapper over the native engine index. |
| coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi | Adds Python typing stubs for the new Hnsw API. |
| coordinode-embedded/python/coordinode_embedded/init.py | Re-exports Hnsw from the package top-level. |
| coordinode-embedded/Cargo.toml | Adds Rust-side dependencies needed for the binding (notably numpy). |
| coordinode-embedded/Cargo.lock | Updates lockfile for the new dependency graph / submodule bump. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fa85604 to
a23c983
Compare
The existing LocalClient routes through Cypher (parser + planner + executor), which carries overhead that's not comparable with in-process HNSW libraries like hnswlib, FAISS-HNSW, ScaNN, or Annoy. To put CoordiNode on the canonical ann-benchmarks.com leaderboard — and to drive the per-commit vector benchmark on docs.coordinode.com — we need a Cypher-bypass path that calls coordinode_vector::hnsw::HnswIndex directly. This commit: - Adds `coordinode-vector` and `numpy` as deps on coordinode-embedded. - New `Hnsw` PyO3 class with fit / set_ef / knn_query, mirroring the hnswlib BaseANN surface so the ann-benchmarks adapter is a one-line wrapper around it. - Numpy-native I/O: fit takes `np.ndarray[float32, 2D]`, knn_query returns `np.ndarray[int64, 1D]`. Zero per-row Python-side conversion on the query hot path. - GIL released around the build and search calls. - Internal IDs auto-assigned sequentially; fit returns the (start, end) range so callers can map their own labels. - Build uses per-item `insert` rather than `insert_batch`. The batch path's known recall divergence vs serial (engine parity bar is 0.7 top-10 agreement) is unacceptable for fair library-tier comparisons; a `fit_fast` opt-in can land later if a real workload needs the build-throughput trade. Bumps the `coordinode-rs` submodule to current main (engine HEAD has `HnswIndex::insert_batch` and `set_ef_search` which earlier pinned SHAs did not yet have, plus the f763f86 saturation-path back-edge fix). Tests at tests/unit/test_hnsw.py cover metric parsing, dimension validation, ID-range bookkeeping, and a recall@10 bar at 10K vectors (≥ 0.95 at ef=50, ≥ 0.99 at ef=200). Closes #69
- hnsw.rs: empty-batch fit() returns (next_id, next_id) instead of (0, 0). Previous early-return broke the documented "contiguous range at the actual insertion point" contract once the index already held vectors — every subsequent fit() call would see a phantom (0, 0) origin from an upstream empty batch. - test_hnsw.py: add explicit alias test for "inner_product" alongside "dot". Locks the parser against a future rename silently dropping the alias. - pyproject.toml: declare numpy as a runtime dependency. The compiled extension links the Rust `numpy` crate which calls into the NumPy C-API on first PyArray operation; without NumPy installed the wheel panics on first call. Lower bound 1.26 is the oldest release shipping matching ABI headers for cp311-314. - lib.rs: comment documenting why there is no explicit numpy initialisation in the pymodule fn. The numpy 0.23 crate auto-imports numpy.core lazily; it exposes no public init function to call.
- hnsw.rs: add `module = "coordinode_embedded"` to the `#[pyclass]` attribute so the class reports its module the same way LocalClient does (the public package, not the inner extension module). Affects `__module__`, generated docs, and pickling. - hnsw.rs: extend the unknown-metric error string to list every accepted spelling (cosine, angular, euclidean, l2, dot, dot_product, ip, inner_product, manhattan, l1) instead of a partial list. Pulled into a module-level ACCEPTED_METRICS constant kept next to the parser so the two cannot drift. - hnsw.rs: the string-representation method now emits a poison marker when the next_id mutex is poisoned instead of pretending the index is empty. The length accessor already surfaces the same condition as a RuntimeError; this brings the string-representation into the same shape so a debugger sees the failure deterministically. - pyproject.toml: rewrite the numpy-dependency note to match the actual import behaviour. The `numpy` crate imports `numpy.core` lazily on the first PyArray call, not eagerly at extension import time. Aligns the wording with the matching note in lib.rs so the two do not contradict each other.
a23c983 to
a0eb616
Compare
- hnsw.rs: `__len__` now reads the count from the underlying HnswIndex instead of `next_id`. The previous derivation could observe phantom IDs under concurrent fit/__len__ — `fit` bumps next_id under its own mutex, releases it, then acquires the index lock for the actual inserts (with the GIL released around the build). A concurrent `__len__` between those two windows would see the bumped counter before the inserts landed. Locking `inner` makes the count reflect committed inserts only. - hnsw.rs: `__repr__` now reports the same len source plus a `<busy>` marker when the index lock is contended (via `try_lock`). Stops a debug REPL from blocking on a concurrent build. - hnsw.rs: docstring for the `metric` constructor argument now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1) instead of a partial list. The error message and the parser were already exhaustive; the docstring just needed to catch up. - tests/unit/test_hnsw.py: the brute-force top-k helper uses `np.argpartition` (O(N)) instead of `np.argsort` (O(N log N)) — only the SET of nearest k matters for the recall metric, not the order inside it. - tests/unit/test_hnsw.py: the recall test allocates the 10K × 16 float matrix directly as float32 via `standard_normal(dtype=np.float32)` instead of allocating float64 then `.astype(np.float32)`. Halves the peak memory for that test.
…artition pivot - src/lib.rs: the explanation of why the pymodule fn does not call a NumPy initializer no longer mentions a specific numpy crate line. The numpy crate has held the same lazy-import policy across 0.23 and 0.24, and that note was already stale when the crate dependency moved to 0.24 in the previous round. - tests/unit/test_hnsw.py: `_brute_force_topk` gains an inline note documenting numpy's argpartition pivot semantics — the `k` argument is a pivot index, not an off-by-one. Verified empirically against np.argsort on random vectors. Heads off the recurring "should be k-1" review suggestion.
`m_max0: M * 2` could panic on debug builds or wrap on release builds for adversarial M values at or above usize::MAX / 2 + 1. The realistic M range for ann-benchmarks workloads is 4..96, so any overflow at the binding boundary is caller error rather than a workload to support. Replace the plain multiplication with `M.checked_mul(2)` and surface the overflow as a ValueError before the engine sees a broken HnswConfig.
…ion pivot - _coordinode_embedded.pyi: type stub docstring for `metric` now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1). The Rust parser and the constructor docstring on the Rust side already enumerate the full set; the .pyi stub had a stale subset. - tests/unit/test_hnsw.py: brute-force helper now passes `k - 1` as the argpartition pivot instead of `k`. Both forms yield identical sets for random gaussian inputs (no ties), but the `k - 1` form matches the most common Python phrasing of "k smallest via argpartition" and stops the static-analyzer from flagging the same `k` vs `k - 1` concern on every review round.
|
| via ``maturin develop``. Exercised by the ``build-embedded`` CI job after | ||
| the wheel is installed. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import pytest | ||
|
|
||
| np = pytest.importorskip("numpy") | ||
| ce = pytest.importorskip("coordinode_embedded") |



Summary
Why
Existing `LocalClient` exposes the engine through Cypher (parser + planner + executor). Useful for end-to-end SDK demos but not comparable with in-process HNSW libraries (hnswlib / FAISS / ScaNN / Annoy) on a like-for-like basis. To get CoordiNode on the canonical ann-benchmarks.com leaderboard, and to drive the per-commit vector benchmark dashboard on docs.coordinode.com, we need a Cypher-bypass binding.
API surface
```python
import numpy as np
from coordinode_embedded import Hnsw
rng = np.random.default_rng(42)
X = rng.standard_normal((10_000, 16), dtype=np.float32)
q = rng.standard_normal(16, dtype=np.float32)
idx = Hnsw(dim=16, metric="euclidean", M=16, ef_construction=200)
idx.fit(X)
idx.set_ef(80)
labels = idx.knn_query(q, k=10) # np.ndarray[int64], shape (10,)
```
Design notes
Tests
`tests/unit/test_hnsw.py` — 4 tests:
```
pytest tests/unit/test_hnsw.py -v
============================== 4 passed in 5.51s ==============================
```
Local smoke (M1 macOS):
```
Built N=10000, dim=16 in 2.96s
ef=10: recall@10 = 0.872, QPS = 39115
ef=50: recall@10 = 0.990, QPS = 10908
ef=100: recall@10 = 0.998, QPS = 5262
ef=200: recall@10 = 0.999, QPS = 3047
```
Follow-up (separate work, NOT in this PR)
Closes #69