feat(embedded): add native Hnsw class for ann-benchmarks integration by polaz · Pull Request #70 · structured-world/coordinode-python

polaz · 2026-05-22T19:33:25Z

Summary

New PyO3 `Hnsw` class in `coordinode-embedded` — a Cypher-bypass path to `coordinode_vector::hnsw::HnswIndex`. Mirrors the hnswlib `BaseANN` surface so the ann-benchmarks Docker adapter is a one-line wrapper around it.
Numpy-native I/O: `fit(np.ndarray[float32, 2D])`, `knn_query(np.ndarray[float32, 1D], k) -> np.ndarray[int64, 1D]`.
Bumps `coordinode-rs` submodule to current main (newer HnswIndex API + saturation-path fix from f763f86).

Why

Existing `LocalClient` exposes the engine through Cypher (parser + planner + executor). Useful for end-to-end SDK demos but not comparable with in-process HNSW libraries (hnswlib / FAISS / ScaNN / Annoy) on a like-for-like basis. To get CoordiNode on the canonical ann-benchmarks.com leaderboard, and to drive the per-commit vector benchmark dashboard on docs.coordinode.com, we need a Cypher-bypass binding.

API surface

```python
import numpy as np
from coordinode_embedded import Hnsw

rng = np.random.default_rng(42)
X = rng.standard_normal((10_000, 16), dtype=np.float32)
q = rng.standard_normal(16, dtype=np.float32)

idx = Hnsw(dim=16, metric="euclidean", M=16, ef_construction=200)
idx.fit(X)
idx.set_ef(80)
labels = idx.knn_query(q, k=10) # np.ndarray[int64], shape (10,)
```

Design notes

GIL released around build and search via `py.allow_threads`.
Auto IDs: `fit` returns the contiguous `(start, end)` range it assigned. Multiple `fit` calls extend the index. Adapters that need to track their own labels keep their own mapping.
Serial insert, not insert_batch. `insert_batch` trades within-batch plan staleness for ~5-8× build throughput, but the engine's parity test only requires 0.7 top-10 agreement vs serial — that's fine inside the engine but unacceptable in a head-to-head with hnswlib on ann-benchmarks. A `fit_fast` opt-in can land later for workloads that prefer build throughput.
Metric aliases track ann-benchmarks conventions: `angular` / `cosine`, `euclidean` / `l2`, `dot` / `inner_product`, `manhattan` / `l1`. Unknown metric raises `ValueError`.
Dimension validation raises `ValueError` rather than panic on shape mismatch.

Tests

`tests/unit/test_hnsw.py` — 4 tests:

Metric parsing + dim validation
ID range bookkeeping across multiple `fit` calls
Recall@10 bar at N=10 000, dim=16: ≥ 0.95 at ef=50, ≥ 0.98 at ef=100, ≥ 0.99 at ef=200
Return type assertion (int64 ndarray, shape (k,))

```
pytest tests/unit/test_hnsw.py -v
============================== 4 passed in 5.51s ==============================
```

Local smoke (M1 macOS):
```
Built N=10000, dim=16 in 2.96s
ef=10: recall@10 = 0.872, QPS = 39115
ef=50: recall@10 = 0.990, QPS = 10908
ef=100: recall@10 = 0.998, QPS = 5262
ef=200: recall@10 = 0.999, QPS = 3047
```

Follow-up (separate work, NOT in this PR)

`benches/ann-benchmarks-adapter/` in coordinode repo — thin BaseANN wrapper + Dockerfile + sweep config
Bench CI on self-hosted ro: per-commit submodule swap → wheel build → ann-benchmarks run → JSON to bench-data branch
Upstream PR to `erikbern/ann-benchmarks` once `coordinode-embedded` ships a stable PyPI release

Closes #69

coderabbitai · 2026-05-22T19:33:37Z

Warning

Review limit reached

@polaz, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 9 minutes and 38 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f729a91-765d-4bb6-a0fa-ae93fb56859e

📥 Commits

Reviewing files that changed from the base of the PR and between fa85604 and 2042d0f.

⛔ Files ignored due to path filters (1)

coordinode-embedded/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

coordinode-embedded/Cargo.toml
coordinode-embedded/pyproject.toml
coordinode-embedded/python/coordinode_embedded/__init__.py
coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi
coordinode-embedded/src/hnsw.rs
coordinode-embedded/src/lib.rs
coordinode-rs
tests/unit/test_hnsw.py

📝 Walkthrough

Walkthrough

This PR introduces a native Hnsw PyO3 class to coordinode-embedded, wrapping the coordinode_vector HNSW index to expose a fast Python API for vector search. The implementation includes metric parsing, GIL-aware thread-safe state management, typed stubs, package wiring, dependency updates, and tests validating recall and output correctness.

Changes

HNSW vector index feature

Layer / File(s)	Summary
Dependencies and Python module wiring `coordinode-embedded/Cargo.toml`, `coordinode-embedded/pyproject.toml`, `coordinode-embedded/src/lib.rs`, `coordinode-embedded/python/coordinode_embedded/__init__.py`, `coordinode-rs`	Adds `numpy` and `coordinode-vector` dependencies, updates the submodule reference, declares the `hnsw` Rust module, registers `Hnsw` as a Python-exposed class, and exports it at the package level.
Type interface and contract `coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi`	Defines type stubs for `Hnsw` with typed constructor parameters (dim, metric, M, ef_construction, max_elements) and method signatures for `fit` returning a contiguous ID range, `set_ef` for runtime tuning, `knn_query` returning an `int64` array of neighbor IDs, plus `__len__` and `__repr__`.
Hnsw native implementation `coordinode-embedded/src/hnsw.rs`	Implements the PyO3-wrapped `Hnsw` class with metric normalization (cosine/angular, euclidean/l2, dot, manhattan/l1), mutex-protected native index and monotonic ID counter, constructor parameter validation, bulk vector insertion via `fit` with dimension checks and GIL release, `set_ef` for search parameter tuning, `knn_query` with dimension validation and NumPy `int64` array return, plus `__len__` and `__repr__`.
Test suite `tests/unit/test_hnsw.py`	Tests metric string parsing and rejection of unknown metrics, dimension validation on `fit` and `knn_query`, `fit` contiguity and index length tracking across multiple calls, recall@10 thresholds against brute-force ground truth with varying ef settings, and `knn_query` output dtype and shape verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(embedded): add native Hnsw class for ann-benchmarks integration' accurately describes the main change — adding a new PyO3 Hnsw class to coordinode-embedded for benchmarking purposes.
Linked Issues check	✅ Passed	The PR successfully implements all acceptance criteria from `#69`: Hnsw class registered, metric aliases supported, fit/set_ef/knn_query implemented with correct signatures and return types, comprehensive unit tests included, wheel builds, and no LocalClient regressions.
Out of Scope Changes check	✅ Passed	All changes are scoped to the Hnsw implementation: new PyO3 bindings, type stubs, tests, dependency updates, and a submodule bump for HnswIndex API alignment. No unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description clearly describes the addition of a new PyO3 Hnsw class with numpy-native I/O for ann-benchmarks integration, design rationale, API surface, tests, and follow-up work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#69-hnsw-native-binding

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@coordinode-embedded/src/hnsw.rs`:
- Around line 116-118: The early-return for n == 0 incorrectly returns (0, 0)
ignoring the current next_id; update the branch in the function containing the n
check to return the empty range at the current position by returning
(self.next_id, self.next_id) (or next_id, next_id if in a non-method scope) so
the returned contiguous range [first_id, last_id+1) reflects the actual
insertion point; ensure you reference the same next_id symbol used elsewhere in
this function.

In `@tests/unit/test_hnsw.py`:
- Around line 27-30: Add an explicit test case constructing ce.Hnsw with the
metric alias "inner_product" to ensure alias compatibility; specifically,
alongside the existing ce.Hnsw(dim=4, metric="dot", M=4, ef_construction=20)
call add ce.Hnsw(dim=4, metric="inner_product", M=4, ef_construction=20) so the
Hnsw constructor (ce.Hnsw) is validated for the inner_product alias.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d79854e3-010d-4ead-9318-650cccff7362

📥 Commits

Reviewing files that changed from the base of the PR and between b5c1ed2 and f6b57c8.

⛔ Files ignored due to path filters (1)

coordinode-embedded/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

coordinode-embedded/Cargo.toml
coordinode-embedded/python/coordinode_embedded/__init__.py
coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi
coordinode-embedded/src/hnsw.rs
coordinode-embedded/src/lib.rs
coordinode-rs
tests/unit/test_hnsw.py

Copilot

Pull request overview

Adds a new native PyO3 Hnsw binding in coordinode-embedded to expose coordinode_vector::hnsw::HnswIndex directly (Cypher-bypass) for ann-benchmarks-style in-process ANN evaluation.

Changes:

Introduces a Rust/PyO3 Hnsw class with NumPy-based fit, set_ef, and knn_query APIs.
Exposes Hnsw from the Python package (__init__.py) and adds a typing stub surface.
Adds unit tests covering metric parsing, dimension validation, ID bookkeeping, recall thresholds, and return dtypes.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/test_hnsw.py	Adds API + recall/unit tests for the new `Hnsw` binding.
coordinode-embedded/src/lib.rs	Registers the new `Hnsw` PyO3 class in the extension module.
coordinode-embedded/src/hnsw.rs	Implements the `Hnsw` PyO3 wrapper over the native engine index.
coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi	Adds Python typing stubs for the new `Hnsw` API.
coordinode-embedded/python/coordinode_embedded/init.py	Re-exports `Hnsw` from the package top-level.
coordinode-embedded/Cargo.toml	Adds Rust-side dependencies needed for the binding (notably `numpy`).
coordinode-embedded/Cargo.lock	Updates lockfile for the new dependency graph / submodule bump.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

The existing LocalClient routes through Cypher (parser + planner + executor), which carries overhead that's not comparable with in-process HNSW libraries like hnswlib, FAISS-HNSW, ScaNN, or Annoy. To put CoordiNode on the canonical ann-benchmarks.com leaderboard — and to drive the per-commit vector benchmark on docs.coordinode.com — we need a Cypher-bypass path that calls coordinode_vector::hnsw::HnswIndex directly. This commit: - Adds `coordinode-vector` and `numpy` as deps on coordinode-embedded. - New `Hnsw` PyO3 class with fit / set_ef / knn_query, mirroring the hnswlib BaseANN surface so the ann-benchmarks adapter is a one-line wrapper around it. - Numpy-native I/O: fit takes `np.ndarray[float32, 2D]`, knn_query returns `np.ndarray[int64, 1D]`. Zero per-row Python-side conversion on the query hot path. - GIL released around the build and search calls. - Internal IDs auto-assigned sequentially; fit returns the (start, end) range so callers can map their own labels. - Build uses per-item `insert` rather than `insert_batch`. The batch path's known recall divergence vs serial (engine parity bar is 0.7 top-10 agreement) is unacceptable for fair library-tier comparisons; a `fit_fast` opt-in can land later if a real workload needs the build-throughput trade. Bumps the `coordinode-rs` submodule to current main (engine HEAD has `HnswIndex::insert_batch` and `set_ef_search` which earlier pinned SHAs did not yet have, plus the f763f86 saturation-path back-edge fix). Tests at tests/unit/test_hnsw.py cover metric parsing, dimension validation, ID-range bookkeeping, and a recall@10 bar at 10K vectors (≥ 0.95 at ef=50, ≥ 0.99 at ef=200). Closes #69

- hnsw.rs: empty-batch fit() returns (next_id, next_id) instead of (0, 0). Previous early-return broke the documented "contiguous range at the actual insertion point" contract once the index already held vectors — every subsequent fit() call would see a phantom (0, 0) origin from an upstream empty batch. - test_hnsw.py: add explicit alias test for "inner_product" alongside "dot". Locks the parser against a future rename silently dropping the alias. - pyproject.toml: declare numpy as a runtime dependency. The compiled extension links the Rust `numpy` crate which calls into the NumPy C-API on first PyArray operation; without NumPy installed the wheel panics on first call. Lower bound 1.26 is the oldest release shipping matching ABI headers for cp311-314. - lib.rs: comment documenting why there is no explicit numpy initialisation in the pymodule fn. The numpy 0.23 crate auto-imports numpy.core lazily; it exposes no public init function to call.

- hnsw.rs: add `module = "coordinode_embedded"` to the `#[pyclass]` attribute so the class reports its module the same way LocalClient does (the public package, not the inner extension module). Affects `__module__`, generated docs, and pickling. - hnsw.rs: extend the unknown-metric error string to list every accepted spelling (cosine, angular, euclidean, l2, dot, dot_product, ip, inner_product, manhattan, l1) instead of a partial list. Pulled into a module-level ACCEPTED_METRICS constant kept next to the parser so the two cannot drift. - hnsw.rs: the string-representation method now emits a poison marker when the next_id mutex is poisoned instead of pretending the index is empty. The length accessor already surfaces the same condition as a RuntimeError; this brings the string-representation into the same shape so a debugger sees the failure deterministically. - pyproject.toml: rewrite the numpy-dependency note to match the actual import behaviour. The `numpy` crate imports `numpy.core` lazily on the first PyArray call, not eagerly at extension import time. Aligns the wording with the matching note in lib.rs so the two do not contradict each other.

- hnsw.rs: `__len__` now reads the count from the underlying HnswIndex instead of `next_id`. The previous derivation could observe phantom IDs under concurrent fit/__len__ — `fit` bumps next_id under its own mutex, releases it, then acquires the index lock for the actual inserts (with the GIL released around the build). A concurrent `__len__` between those two windows would see the bumped counter before the inserts landed. Locking `inner` makes the count reflect committed inserts only. - hnsw.rs: `__repr__` now reports the same len source plus a `<busy>` marker when the index lock is contended (via `try_lock`). Stops a debug REPL from blocking on a concurrent build. - hnsw.rs: docstring for the `metric` constructor argument now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1) instead of a partial list. The error message and the parser were already exhaustive; the docstring just needed to catch up. - tests/unit/test_hnsw.py: the brute-force top-k helper uses `np.argpartition` (O(N)) instead of `np.argsort` (O(N log N)) — only the SET of nearest k matters for the recall metric, not the order inside it. - tests/unit/test_hnsw.py: the recall test allocates the 10K × 16 float matrix directly as float32 via `standard_normal(dtype=np.float32)` instead of allocating float64 then `.astype(np.float32)`. Halves the peak memory for that test.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

…artition pivot - src/lib.rs: the explanation of why the pymodule fn does not call a NumPy initializer no longer mentions a specific numpy crate line. The numpy crate has held the same lazy-import policy across 0.23 and 0.24, and that note was already stale when the crate dependency moved to 0.24 in the previous round. - tests/unit/test_hnsw.py: `_brute_force_topk` gains an inline note documenting numpy's argpartition pivot semantics — the `k` argument is a pivot index, not an off-by-one. Verified empirically against np.argsort on random vectors. Heads off the recurring "should be k-1" review suggestion.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

`m_max0: M * 2` could panic on debug builds or wrap on release builds for adversarial M values at or above usize::MAX / 2 + 1. The realistic M range for ann-benchmarks workloads is 4..96, so any overflow at the binding boundary is caller error rather than a workload to support. Replace the plain multiplication with `M.checked_mul(2)` and surface the overflow as a ValueError before the engine sees a broken HnswConfig.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

…ion pivot - _coordinode_embedded.pyi: type stub docstring for `metric` now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1). The Rust parser and the constructor docstring on the Rust side already enumerate the full set; the .pyi stub had a stale subset. - tests/unit/test_hnsw.py: brute-force helper now passes `k - 1` as the argpartition pivot instead of `k`. Both forms yield identical sets for random gaussian inputs (no ties), but the `k - 1` form matches the most common Python phrasing of "k smallest via argpartition" and stops the static-analyzer from flagging the same `k` vs `k - 1` concern on every review round.

sonarqubecloud · 2026-05-23T12:41:36Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

+via ``maturin develop``.  Exercised by the ``build-embedded`` CI job after
+the wheel is installed.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+np = pytest.importorskip("numpy")
+ce = pytest.importorskip("coordinode_embedded")


coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread coordinode-embedded/src/hnsw.rs

Comment thread tests/unit/test_hnsw.py

polaz requested a review from Copilot May 22, 2026 20:17

Copilot started reviewing on behalf of polaz May 22, 2026 20:17 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread coordinode-embedded/src/hnsw.rs Outdated

Comment thread coordinode-embedded/src/lib.rs

Comment thread coordinode-embedded/Cargo.toml

polaz requested a review from Copilot May 22, 2026 20:52

Copilot started reviewing on behalf of polaz May 22, 2026 20:52 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread coordinode-embedded/src/hnsw.rs Outdated

Comment thread coordinode-embedded/src/hnsw.rs

Comment thread coordinode-embedded/src/hnsw.rs Outdated

Comment thread coordinode-embedded/pyproject.toml Outdated

polaz force-pushed the feat/#69-hnsw-native-binding branch from fa85604 to a23c983 Compare May 22, 2026 21:53

polaz requested a review from Copilot May 22, 2026 22:00

Copilot started reviewing on behalf of polaz May 22, 2026 22:00 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread coordinode-embedded/src/hnsw.rs Outdated

Comment thread coordinode-embedded/src/hnsw.rs

Comment thread tests/unit/test_hnsw.py Outdated

Comment thread tests/unit/test_hnsw.py Outdated

polaz added 3 commits May 23, 2026 01:05

polaz force-pushed the feat/#69-hnsw-native-binding branch from a23c983 to a0eb616 Compare May 22, 2026 22:08

Copilot AI review requested due to automatic review settings May 22, 2026 22:12

Copilot started reviewing on behalf of polaz May 22, 2026 22:12 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread tests/unit/test_hnsw.py Outdated

Comment thread coordinode-embedded/src/lib.rs Outdated

polaz requested a review from Copilot May 23, 2026 09:15

Copilot started reviewing on behalf of polaz May 23, 2026 09:15 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread coordinode-embedded/src/hnsw.rs

polaz requested a review from Copilot May 23, 2026 11:53

Copilot started reviewing on behalf of polaz May 23, 2026 11:53 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi Outdated

Comment thread tests/unit/test_hnsw.py Outdated

polaz requested a review from Copilot May 23, 2026 12:46

Copilot started reviewing on behalf of polaz May 23, 2026 12:46 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Conversation

polaz commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

API surface

Design notes

Tests

Follow-up (separate work, NOT in this PR)

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 23, 2026

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading