Skip to content

feat(embedded): add native Hnsw class for ann-benchmarks integration#70

Open
polaz wants to merge 7 commits into
mainfrom
feat/#69-hnsw-native-binding
Open

feat(embedded): add native Hnsw class for ann-benchmarks integration#70
polaz wants to merge 7 commits into
mainfrom
feat/#69-hnsw-native-binding

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented May 22, 2026

Summary

  • New PyO3 `Hnsw` class in `coordinode-embedded` — a Cypher-bypass path to `coordinode_vector::hnsw::HnswIndex`. Mirrors the hnswlib `BaseANN` surface so the ann-benchmarks Docker adapter is a one-line wrapper around it.
  • Numpy-native I/O: `fit(np.ndarray[float32, 2D])`, `knn_query(np.ndarray[float32, 1D], k) -> np.ndarray[int64, 1D]`.
  • Bumps `coordinode-rs` submodule to current main (newer HnswIndex API + saturation-path fix from f763f86).

Why

Existing `LocalClient` exposes the engine through Cypher (parser + planner + executor). Useful for end-to-end SDK demos but not comparable with in-process HNSW libraries (hnswlib / FAISS / ScaNN / Annoy) on a like-for-like basis. To get CoordiNode on the canonical ann-benchmarks.com leaderboard, and to drive the per-commit vector benchmark dashboard on docs.coordinode.com, we need a Cypher-bypass binding.

API surface

```python
import numpy as np
from coordinode_embedded import Hnsw

rng = np.random.default_rng(42)
X = rng.standard_normal((10_000, 16), dtype=np.float32)
q = rng.standard_normal(16, dtype=np.float32)

idx = Hnsw(dim=16, metric="euclidean", M=16, ef_construction=200)
idx.fit(X)
idx.set_ef(80)
labels = idx.knn_query(q, k=10) # np.ndarray[int64], shape (10,)
```

Design notes

  • GIL released around build and search via `py.allow_threads`.
  • Auto IDs: `fit` returns the contiguous `(start, end)` range it assigned. Multiple `fit` calls extend the index. Adapters that need to track their own labels keep their own mapping.
  • Serial insert, not insert_batch. `insert_batch` trades within-batch plan staleness for ~5-8× build throughput, but the engine's parity test only requires 0.7 top-10 agreement vs serial — that's fine inside the engine but unacceptable in a head-to-head with hnswlib on ann-benchmarks. A `fit_fast` opt-in can land later for workloads that prefer build throughput.
  • Metric aliases track ann-benchmarks conventions: `angular` / `cosine`, `euclidean` / `l2`, `dot` / `inner_product`, `manhattan` / `l1`. Unknown metric raises `ValueError`.
  • Dimension validation raises `ValueError` rather than panic on shape mismatch.

Tests

`tests/unit/test_hnsw.py` — 4 tests:

  • Metric parsing + dim validation
  • ID range bookkeeping across multiple `fit` calls
  • Recall@10 bar at N=10 000, dim=16: ≥ 0.95 at ef=50, ≥ 0.98 at ef=100, ≥ 0.99 at ef=200
  • Return type assertion (int64 ndarray, shape (k,))

```
pytest tests/unit/test_hnsw.py -v
============================== 4 passed in 5.51s ==============================
```

Local smoke (M1 macOS):
```
Built N=10000, dim=16 in 2.96s
ef=10: recall@10 = 0.872, QPS = 39115
ef=50: recall@10 = 0.990, QPS = 10908
ef=100: recall@10 = 0.998, QPS = 5262
ef=200: recall@10 = 0.999, QPS = 3047
```

Follow-up (separate work, NOT in this PR)

  • `benches/ann-benchmarks-adapter/` in coordinode repo — thin BaseANN wrapper + Dockerfile + sweep config
  • Bench CI on self-hosted ro: per-commit submodule swap → wheel build → ann-benchmarks run → JSON to bench-data branch
  • Upstream PR to `erikbern/ann-benchmarks` once `coordinode-embedded` ships a stable PyPI release

Closes #69

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

Warning

Review limit reached

@polaz, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 9 minutes and 38 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f729a91-765d-4bb6-a0fa-ae93fb56859e

📥 Commits

Reviewing files that changed from the base of the PR and between fa85604 and 2042d0f.

⛔ Files ignored due to path filters (1)
  • coordinode-embedded/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • coordinode-embedded/Cargo.toml
  • coordinode-embedded/pyproject.toml
  • coordinode-embedded/python/coordinode_embedded/__init__.py
  • coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi
  • coordinode-embedded/src/hnsw.rs
  • coordinode-embedded/src/lib.rs
  • coordinode-rs
  • tests/unit/test_hnsw.py
📝 Walkthrough

Walkthrough

This PR introduces a native Hnsw PyO3 class to coordinode-embedded, wrapping the coordinode_vector HNSW index to expose a fast Python API for vector search. The implementation includes metric parsing, GIL-aware thread-safe state management, typed stubs, package wiring, dependency updates, and tests validating recall and output correctness.

Changes

HNSW vector index feature

Layer / File(s) Summary
Dependencies and Python module wiring
coordinode-embedded/Cargo.toml, coordinode-embedded/pyproject.toml, coordinode-embedded/src/lib.rs, coordinode-embedded/python/coordinode_embedded/__init__.py, coordinode-rs
Adds numpy and coordinode-vector dependencies, updates the submodule reference, declares the hnsw Rust module, registers Hnsw as a Python-exposed class, and exports it at the package level.
Type interface and contract
coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi
Defines type stubs for Hnsw with typed constructor parameters (dim, metric, M, ef_construction, max_elements) and method signatures for fit returning a contiguous ID range, set_ef for runtime tuning, knn_query returning an int64 array of neighbor IDs, plus __len__ and __repr__.
Hnsw native implementation
coordinode-embedded/src/hnsw.rs
Implements the PyO3-wrapped Hnsw class with metric normalization (cosine/angular, euclidean/l2, dot, manhattan/l1), mutex-protected native index and monotonic ID counter, constructor parameter validation, bulk vector insertion via fit with dimension checks and GIL release, set_ef for search parameter tuning, knn_query with dimension validation and NumPy int64 array return, plus __len__ and __repr__.
Test suite
tests/unit/test_hnsw.py
Tests metric string parsing and rejection of unknown metrics, dimension validation on fit and knn_query, fit contiguity and index length tracking across multiple calls, recall@10 thresholds against brute-force ground truth with varying ef settings, and knn_query output dtype and shape verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(embedded): add native Hnsw class for ann-benchmarks integration' accurately describes the main change — adding a new PyO3 Hnsw class to coordinode-embedded for benchmarking purposes.
Linked Issues check ✅ Passed The PR successfully implements all acceptance criteria from #69: Hnsw class registered, metric aliases supported, fit/set_ef/knn_query implemented with correct signatures and return types, comprehensive unit tests included, wheel builds, and no LocalClient regressions.
Out of Scope Changes check ✅ Passed All changes are scoped to the Hnsw implementation: new PyO3 bindings, type stubs, tests, dependency updates, and a submodule bump for HnswIndex API alignment. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description clearly describes the addition of a new PyO3 Hnsw class with numpy-native I/O for ann-benchmarks integration, design rationale, API surface, tests, and follow-up work.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#69-hnsw-native-binding

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@coordinode-embedded/src/hnsw.rs`:
- Around line 116-118: The early-return for n == 0 incorrectly returns (0, 0)
ignoring the current next_id; update the branch in the function containing the n
check to return the empty range at the current position by returning
(self.next_id, self.next_id) (or next_id, next_id if in a non-method scope) so
the returned contiguous range [first_id, last_id+1) reflects the actual
insertion point; ensure you reference the same next_id symbol used elsewhere in
this function.

In `@tests/unit/test_hnsw.py`:
- Around line 27-30: Add an explicit test case constructing ce.Hnsw with the
metric alias "inner_product" to ensure alias compatibility; specifically,
alongside the existing ce.Hnsw(dim=4, metric="dot", M=4, ef_construction=20)
call add ce.Hnsw(dim=4, metric="inner_product", M=4, ef_construction=20) so the
Hnsw constructor (ce.Hnsw) is validated for the inner_product alias.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d79854e3-010d-4ead-9318-650cccff7362

📥 Commits

Reviewing files that changed from the base of the PR and between b5c1ed2 and f6b57c8.

⛔ Files ignored due to path filters (1)
  • coordinode-embedded/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • coordinode-embedded/Cargo.toml
  • coordinode-embedded/python/coordinode_embedded/__init__.py
  • coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi
  • coordinode-embedded/src/hnsw.rs
  • coordinode-embedded/src/lib.rs
  • coordinode-rs
  • tests/unit/test_hnsw.py

Comment thread coordinode-embedded/src/hnsw.rs
Comment thread tests/unit/test_hnsw.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new native PyO3 Hnsw binding in coordinode-embedded to expose coordinode_vector::hnsw::HnswIndex directly (Cypher-bypass) for ann-benchmarks-style in-process ANN evaluation.

Changes:

  • Introduces a Rust/PyO3 Hnsw class with NumPy-based fit, set_ef, and knn_query APIs.
  • Exposes Hnsw from the Python package (__init__.py) and adds a typing stub surface.
  • Adds unit tests covering metric parsing, dimension validation, ID bookkeeping, recall thresholds, and return dtypes.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_hnsw.py Adds API + recall/unit tests for the new Hnsw binding.
coordinode-embedded/src/lib.rs Registers the new Hnsw PyO3 class in the extension module.
coordinode-embedded/src/hnsw.rs Implements the Hnsw PyO3 wrapper over the native engine index.
coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi Adds Python typing stubs for the new Hnsw API.
coordinode-embedded/python/coordinode_embedded/init.py Re-exports Hnsw from the package top-level.
coordinode-embedded/Cargo.toml Adds Rust-side dependencies needed for the binding (notably numpy).
coordinode-embedded/Cargo.lock Updates lockfile for the new dependency graph / submodule bump.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread coordinode-embedded/src/hnsw.rs Outdated
Comment thread coordinode-embedded/src/lib.rs
Comment thread coordinode-embedded/Cargo.toml
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 4 comments.

Comment thread coordinode-embedded/src/hnsw.rs Outdated
Comment thread coordinode-embedded/src/hnsw.rs
Comment thread coordinode-embedded/src/hnsw.rs Outdated
Comment thread coordinode-embedded/pyproject.toml Outdated
@polaz polaz force-pushed the feat/#69-hnsw-native-binding branch from fa85604 to a23c983 Compare May 22, 2026 21:53
@polaz polaz requested a review from Copilot May 22, 2026 22:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Comment thread coordinode-embedded/src/hnsw.rs Outdated
Comment thread coordinode-embedded/src/hnsw.rs
Comment thread tests/unit/test_hnsw.py Outdated
Comment thread tests/unit/test_hnsw.py Outdated
polaz added 3 commits May 23, 2026 01:05
The existing LocalClient routes through Cypher (parser + planner +
executor), which carries overhead that's not comparable with in-process
HNSW libraries like hnswlib, FAISS-HNSW, ScaNN, or Annoy. To put
CoordiNode on the canonical ann-benchmarks.com leaderboard — and to
drive the per-commit vector benchmark on docs.coordinode.com — we need
a Cypher-bypass path that calls coordinode_vector::hnsw::HnswIndex
directly.

This commit:

- Adds `coordinode-vector` and `numpy` as deps on coordinode-embedded.
- New `Hnsw` PyO3 class with fit / set_ef / knn_query, mirroring the
  hnswlib BaseANN surface so the ann-benchmarks adapter is a one-line
  wrapper around it.
- Numpy-native I/O: fit takes `np.ndarray[float32, 2D]`, knn_query
  returns `np.ndarray[int64, 1D]`. Zero per-row Python-side conversion
  on the query hot path.
- GIL released around the build and search calls.
- Internal IDs auto-assigned sequentially; fit returns the (start, end)
  range so callers can map their own labels.
- Build uses per-item `insert` rather than `insert_batch`. The batch
  path's known recall divergence vs serial (engine parity bar is 0.7
  top-10 agreement) is unacceptable for fair library-tier comparisons;
  a `fit_fast` opt-in can land later if a real workload needs the
  build-throughput trade.

Bumps the `coordinode-rs` submodule to current main (engine HEAD has
`HnswIndex::insert_batch` and `set_ef_search` which earlier pinned SHAs
did not yet have, plus the f763f86 saturation-path back-edge fix).

Tests at tests/unit/test_hnsw.py cover metric parsing, dimension
validation, ID-range bookkeeping, and a recall@10 bar at 10K vectors
(≥ 0.95 at ef=50, ≥ 0.99 at ef=200).

Closes #69
- hnsw.rs: empty-batch fit() returns (next_id, next_id) instead of
  (0, 0).  Previous early-return broke the documented "contiguous
  range at the actual insertion point" contract once the index already
  held vectors — every subsequent fit() call would see a phantom (0, 0)
  origin from an upstream empty batch.

- test_hnsw.py: add explicit alias test for "inner_product" alongside
  "dot".  Locks the parser against a future rename silently dropping
  the alias.

- pyproject.toml: declare numpy as a runtime dependency. The compiled
  extension links the Rust `numpy` crate which calls into the NumPy
  C-API on first PyArray operation; without NumPy installed the wheel
  panics on first call.  Lower bound 1.26 is the oldest release shipping
  matching ABI headers for cp311-314.

- lib.rs: comment documenting why there is no explicit numpy
  initialisation in the pymodule fn.  The numpy 0.23 crate auto-imports
  numpy.core lazily; it exposes no public init function to call.
- hnsw.rs: add `module = "coordinode_embedded"` to the `#[pyclass]`
  attribute so the class reports its module the same way LocalClient
  does (the public package, not the inner extension module). Affects
  `__module__`, generated docs, and pickling.

- hnsw.rs: extend the unknown-metric error string to list every
  accepted spelling (cosine, angular, euclidean, l2, dot, dot_product,
  ip, inner_product, manhattan, l1) instead of a partial list. Pulled
  into a module-level ACCEPTED_METRICS constant kept next to the parser
  so the two cannot drift.

- hnsw.rs: the string-representation method now emits a poison marker
  when the next_id mutex is poisoned instead of pretending the index
  is empty. The length accessor already surfaces the same condition as
  a RuntimeError; this brings the string-representation into the same
  shape so a debugger sees the failure deterministically.

- pyproject.toml: rewrite the numpy-dependency note to match the actual
  import behaviour. The `numpy` crate imports `numpy.core` lazily on
  the first PyArray call, not eagerly at extension import time. Aligns
  the wording with the matching note in lib.rs so the two do not
  contradict each other.
@polaz polaz force-pushed the feat/#69-hnsw-native-binding branch from a23c983 to a0eb616 Compare May 22, 2026 22:08
- hnsw.rs: `__len__` now reads the count from the underlying HnswIndex
  instead of `next_id`. The previous derivation could observe phantom
  IDs under concurrent fit/__len__ — `fit` bumps next_id under its own
  mutex, releases it, then acquires the index lock for the actual
  inserts (with the GIL released around the build). A concurrent
  `__len__` between those two windows would see the bumped counter
  before the inserts landed. Locking `inner` makes the count reflect
  committed inserts only.

- hnsw.rs: `__repr__` now reports the same len source plus a `<busy>`
  marker when the index lock is contended (via `try_lock`). Stops a
  debug REPL from blocking on a concurrent build.

- hnsw.rs: docstring for the `metric` constructor argument now lists
  every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/
  ip/inner_product, manhattan/l1) instead of a partial list. The error
  message and the parser were already exhaustive; the docstring just
  needed to catch up.

- tests/unit/test_hnsw.py: the brute-force top-k helper uses
  `np.argpartition` (O(N)) instead of `np.argsort` (O(N log N)) — only
  the SET of nearest k matters for the recall metric, not the order
  inside it.

- tests/unit/test_hnsw.py: the recall test allocates the 10K × 16 float
  matrix directly as float32 via `standard_normal(dtype=np.float32)`
  instead of allocating float64 then `.astype(np.float32)`. Halves the
  peak memory for that test.
Copilot AI review requested due to automatic review settings May 22, 2026 22:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Comment thread tests/unit/test_hnsw.py Outdated
Comment thread coordinode-embedded/src/lib.rs Outdated
…artition pivot

- src/lib.rs: the explanation of why the pymodule fn does not call a
  NumPy initializer no longer mentions a specific numpy crate line.
  The numpy crate has held the same lazy-import policy across 0.23
  and 0.24, and that note was already stale when the crate dependency
  moved to 0.24 in the previous round.

- tests/unit/test_hnsw.py: `_brute_force_topk` gains an inline note
  documenting numpy's argpartition pivot semantics — the `k` argument
  is a pivot index, not an off-by-one. Verified empirically against
  np.argsort on random vectors. Heads off the recurring "should be
  k-1" review suggestion.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Comment thread coordinode-embedded/src/hnsw.rs
`m_max0: M * 2` could panic on debug builds or wrap on release builds for
adversarial M values at or above usize::MAX / 2 + 1. The realistic M range
for ann-benchmarks workloads is 4..96, so any overflow at the binding
boundary is caller error rather than a workload to support. Replace the
plain multiplication with `M.checked_mul(2)` and surface the overflow as
a ValueError before the engine sees a broken HnswConfig.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Comment thread coordinode-embedded/python/coordinode_embedded/_coordinode_embedded.pyi Outdated
Comment thread tests/unit/test_hnsw.py Outdated
…ion pivot

- _coordinode_embedded.pyi: type stub docstring for `metric` now lists
  every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/
  ip/inner_product, manhattan/l1). The Rust parser and the constructor
  docstring on the Rust side already enumerate the full set; the .pyi
  stub had a stale subset.

- tests/unit/test_hnsw.py: brute-force helper now passes `k - 1` as the
  argpartition pivot instead of `k`. Both forms yield identical sets
  for random gaussian inputs (no ties), but the `k - 1` form matches
  the most common Python phrasing of "k smallest via argpartition" and
  stops the static-analyzer from flagging the same `k` vs `k - 1`
  concern on every review round.
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Comment thread tests/unit/test_hnsw.py
Comment on lines +4 to +13
via ``maturin develop``. Exercised by the ``build-embedded`` CI job after
the wheel is installed.
"""

from __future__ import annotations

import pytest

np = pytest.importorskip("numpy")
ce = pytest.importorskip("coordinode_embedded")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add native Hnsw class to coordinode-embedded for ann-benchmarks integration

2 participants