Commit a2986a6
authored
feat(embedded): add native Hnsw class for ann-benchmarks integration (#70)
* feat(embedded): add native Hnsw class for ann-benchmarks integration
The existing LocalClient routes through Cypher (parser + planner +
executor), which carries overhead that's not comparable with in-process
HNSW libraries like hnswlib, FAISS-HNSW, ScaNN, or Annoy. To put
CoordiNode on the canonical ann-benchmarks.com leaderboard — and to
drive the per-commit vector benchmark on docs.coordinode.com — we need
a Cypher-bypass path that calls coordinode_vector::hnsw::HnswIndex
directly.
This commit:
- Adds `coordinode-vector` and `numpy` as deps on coordinode-embedded.
- New `Hnsw` PyO3 class with fit / set_ef / knn_query, mirroring the
hnswlib BaseANN surface so the ann-benchmarks adapter is a one-line
wrapper around it.
- Numpy-native I/O: fit takes `np.ndarray[float32, 2D]`, knn_query
returns `np.ndarray[int64, 1D]`. Zero per-row Python-side conversion
on the query hot path.
- GIL released around the build and search calls.
- Internal IDs auto-assigned sequentially; fit returns the (start, end)
range so callers can map their own labels.
- Build uses per-item `insert` rather than `insert_batch`. The batch
path's known recall divergence vs serial (engine parity bar is 0.7
top-10 agreement) is unacceptable for fair library-tier comparisons;
a `fit_fast` opt-in can land later if a real workload needs the
build-throughput trade.
Bumps the `coordinode-rs` submodule to current main (engine HEAD has
`HnswIndex::insert_batch` and `set_ef_search` which earlier pinned SHAs
did not yet have, plus the f763f86 saturation-path back-edge fix).
Tests at tests/unit/test_hnsw.py cover metric parsing, dimension
validation, ID-range bookkeeping, and a recall@10 bar at 10K vectors
(≥ 0.95 at ef=50, ≥ 0.99 at ef=200).
Closes #69
* fix(embedded): correct empty-batch return + declare numpy runtime dep
- hnsw.rs: empty-batch fit() returns (next_id, next_id) instead of
(0, 0). Previous early-return broke the documented "contiguous
range at the actual insertion point" contract once the index already
held vectors — every subsequent fit() call would see a phantom (0, 0)
origin from an upstream empty batch.
- test_hnsw.py: add explicit alias test for "inner_product" alongside
"dot". Locks the parser against a future rename silently dropping
the alias.
- pyproject.toml: declare numpy as a runtime dependency. The compiled
extension links the Rust `numpy` crate which calls into the NumPy
C-API on first PyArray operation; without NumPy installed the wheel
panics on first call. Lower bound 1.26 is the oldest release shipping
matching ABI headers for cp311-314.
- lib.rs: comment documenting why there is no explicit numpy
initialisation in the pymodule fn. The numpy 0.23 crate auto-imports
numpy.core lazily; it exposes no public init function to call.
* fix(embedded): pyclass module attr, full metric list, poison-aware repr
- hnsw.rs: add `module = "coordinode_embedded"` to the `#[pyclass]`
attribute so the class reports its module the same way LocalClient
does (the public package, not the inner extension module). Affects
`__module__`, generated docs, and pickling.
- hnsw.rs: extend the unknown-metric error string to list every
accepted spelling (cosine, angular, euclidean, l2, dot, dot_product,
ip, inner_product, manhattan, l1) instead of a partial list. Pulled
into a module-level ACCEPTED_METRICS constant kept next to the parser
so the two cannot drift.
- hnsw.rs: the string-representation method now emits a poison marker
when the next_id mutex is poisoned instead of pretending the index
is empty. The length accessor already surfaces the same condition as
a RuntimeError; this brings the string-representation into the same
shape so a debugger sees the failure deterministically.
- pyproject.toml: rewrite the numpy-dependency note to match the actual
import behaviour. The `numpy` crate imports `numpy.core` lazily on
the first PyArray call, not eagerly at extension import time. Aligns
the wording with the matching note in lib.rs so the two do not
contradict each other.
* fix(embedded): race-free len, full alias docstring, faster brute force
- hnsw.rs: `__len__` now reads the count from the underlying HnswIndex
instead of `next_id`. The previous derivation could observe phantom
IDs under concurrent fit/__len__ — `fit` bumps next_id under its own
mutex, releases it, then acquires the index lock for the actual
inserts (with the GIL released around the build). A concurrent
`__len__` between those two windows would see the bumped counter
before the inserts landed. Locking `inner` makes the count reflect
committed inserts only.
- hnsw.rs: `__repr__` now reports the same len source plus a `<busy>`
marker when the index lock is contended (via `try_lock`). Stops a
debug REPL from blocking on a concurrent build.
- hnsw.rs: docstring for the `metric` constructor argument now lists
every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/
ip/inner_product, manhattan/l1) instead of a partial list. The error
message and the parser were already exhaustive; the docstring just
needed to catch up.
- tests/unit/test_hnsw.py: the brute-force top-k helper uses
`np.argpartition` (O(N)) instead of `np.argsort` (O(N log N)) — only
the SET of nearest k matters for the recall metric, not the order
inside it.
- tests/unit/test_hnsw.py: the recall test allocates the 10K × 16 float
matrix directly as float32 via `standard_normal(dtype=np.float32)`
instead of allocating float64 then `.astype(np.float32)`. Halves the
peak memory for that test.
* docs(embedded): drop numpy version pin in init comment, document argpartition pivot
- src/lib.rs: the explanation of why the pymodule fn does not call a
NumPy initializer no longer mentions a specific numpy crate line.
The numpy crate has held the same lazy-import policy across 0.23
and 0.24, and that note was already stale when the crate dependency
moved to 0.24 in the previous round.
- tests/unit/test_hnsw.py: `_brute_force_topk` gains an inline note
documenting numpy's argpartition pivot semantics — the `k` argument
is a pivot index, not an off-by-one. Verified empirically against
np.argsort on random vectors. Heads off the recurring "should be
k-1" review suggestion.
* fix(embedded): reject M values that overflow M * 2
`m_max0: M * 2` could panic on debug builds or wrap on release builds for
adversarial M values at or above usize::MAX / 2 + 1. The realistic M range
for ann-benchmarks workloads is 4..96, so any overflow at the binding
boundary is caller error rather than a workload to support. Replace the
plain multiplication with `M.checked_mul(2)` and surface the overflow as
a ValueError before the engine sees a broken HnswConfig.
* docs(embedded): sync metric alias list in type stub, switch argpartition pivot
- _coordinode_embedded.pyi: type stub docstring for `metric` now lists
every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/
ip/inner_product, manhattan/l1). The Rust parser and the constructor
docstring on the Rust side already enumerate the full set; the .pyi
stub had a stale subset.
- tests/unit/test_hnsw.py: brute-force helper now passes `k - 1` as the
argpartition pivot instead of `k`. Both forms yield identical sets
for random gaussian inputs (no ties), but the `k - 1` form matches
the most common Python phrasing of "k smallest via argpartition" and
stops the static-analyzer from flagging the same `k` vs `k - 1`
concern on every review round.
* ci(embedded): run unit tests against the built wheel in build-embedded job
The Hnsw unit tests in tests/unit/test_hnsw.py use pytest.importorskip to
remain skippable on fresh checkouts where the maturin wheel has not been
built. The trade-off is that the main `test` job — which has no Rust
toolchain and skips the wheel build — was silently skipping the entire
embedded test module, so a broken Hnsw binding could land without CI
catching it.
Extend the build-embedded job to install the freshly-built wheel into a
throwaway venv and run pytest against test_hnsw.py with --strict-markers.
The build job already has the wheel under /tmp/embedded-dist after the
maturin step, so this is a thin add-on: install the wheel + numpy +
pytest, then run the embedded test module. A guard at the end verifies
the suite was actually collected.
* ci(embedded): write maturin wheel under workspace dist/ so host can read it
The maturin-action runs the build inside a manylinux container; the
container's filesystem isn't bind-mounted to the GitHub Actions host
except for the GITHUB_WORKSPACE directory. The previous output path
/tmp/embedded-dist lived inside the container only — the follow-up
test step on the host therefore saw an empty directory and exited
with "ls: cannot access '/tmp/embedded-dist/coordinode_embedded-*.whl'".
Switch the output directory to the workspace-relative `dist/` (the
default convention for maturin-built wheels). The follow-up test step
now reads the wheel from the same location.
* ci(embedded): tighten test-skip check to match the comment
The previous guard verified that at least one test was collected, which
catches an entirely-skipped module but not a module where individual
tests were skipped via markers or runtime conditions. The intent was
the stricter check: any skip in build-embedded means the wheel was
installed but the binding is broken (wrong glibc, missing numpy, etc.).
Parse pytest's `-ra` short-summary header instead — it prints
`SKIPPED [N] ...` lines and a `= N skipped =` summary when anything
was skipped. Fail the job on either marker. Also assert that at least
one test passed (catches the "no tests collected" case the original
guard targeted, without conflating it with the skip detection).1 parent 35347fb commit a2986a6
10 files changed
Lines changed: 809 additions & 54 deletions
File tree
- .github/workflows
- coordinode-embedded
- python/coordinode_embedded
- src
- tests/unit
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
62 | | - | |
| 62 | + | |
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
67 | 102 | | |
68 | 103 | | |
69 | 104 | | |
| |||
0 commit comments