Skip to content

Commit a2986a6

Browse files
authored
feat(embedded): add native Hnsw class for ann-benchmarks integration (#70)
* feat(embedded): add native Hnsw class for ann-benchmarks integration The existing LocalClient routes through Cypher (parser + planner + executor), which carries overhead that's not comparable with in-process HNSW libraries like hnswlib, FAISS-HNSW, ScaNN, or Annoy. To put CoordiNode on the canonical ann-benchmarks.com leaderboard — and to drive the per-commit vector benchmark on docs.coordinode.com — we need a Cypher-bypass path that calls coordinode_vector::hnsw::HnswIndex directly. This commit: - Adds `coordinode-vector` and `numpy` as deps on coordinode-embedded. - New `Hnsw` PyO3 class with fit / set_ef / knn_query, mirroring the hnswlib BaseANN surface so the ann-benchmarks adapter is a one-line wrapper around it. - Numpy-native I/O: fit takes `np.ndarray[float32, 2D]`, knn_query returns `np.ndarray[int64, 1D]`. Zero per-row Python-side conversion on the query hot path. - GIL released around the build and search calls. - Internal IDs auto-assigned sequentially; fit returns the (start, end) range so callers can map their own labels. - Build uses per-item `insert` rather than `insert_batch`. The batch path's known recall divergence vs serial (engine parity bar is 0.7 top-10 agreement) is unacceptable for fair library-tier comparisons; a `fit_fast` opt-in can land later if a real workload needs the build-throughput trade. Bumps the `coordinode-rs` submodule to current main (engine HEAD has `HnswIndex::insert_batch` and `set_ef_search` which earlier pinned SHAs did not yet have, plus the f763f86 saturation-path back-edge fix). Tests at tests/unit/test_hnsw.py cover metric parsing, dimension validation, ID-range bookkeeping, and a recall@10 bar at 10K vectors (≥ 0.95 at ef=50, ≥ 0.99 at ef=200). Closes #69 * fix(embedded): correct empty-batch return + declare numpy runtime dep - hnsw.rs: empty-batch fit() returns (next_id, next_id) instead of (0, 0). Previous early-return broke the documented "contiguous range at the actual insertion point" contract once the index already held vectors — every subsequent fit() call would see a phantom (0, 0) origin from an upstream empty batch. - test_hnsw.py: add explicit alias test for "inner_product" alongside "dot". Locks the parser against a future rename silently dropping the alias. - pyproject.toml: declare numpy as a runtime dependency. The compiled extension links the Rust `numpy` crate which calls into the NumPy C-API on first PyArray operation; without NumPy installed the wheel panics on first call. Lower bound 1.26 is the oldest release shipping matching ABI headers for cp311-314. - lib.rs: comment documenting why there is no explicit numpy initialisation in the pymodule fn. The numpy 0.23 crate auto-imports numpy.core lazily; it exposes no public init function to call. * fix(embedded): pyclass module attr, full metric list, poison-aware repr - hnsw.rs: add `module = "coordinode_embedded"` to the `#[pyclass]` attribute so the class reports its module the same way LocalClient does (the public package, not the inner extension module). Affects `__module__`, generated docs, and pickling. - hnsw.rs: extend the unknown-metric error string to list every accepted spelling (cosine, angular, euclidean, l2, dot, dot_product, ip, inner_product, manhattan, l1) instead of a partial list. Pulled into a module-level ACCEPTED_METRICS constant kept next to the parser so the two cannot drift. - hnsw.rs: the string-representation method now emits a poison marker when the next_id mutex is poisoned instead of pretending the index is empty. The length accessor already surfaces the same condition as a RuntimeError; this brings the string-representation into the same shape so a debugger sees the failure deterministically. - pyproject.toml: rewrite the numpy-dependency note to match the actual import behaviour. The `numpy` crate imports `numpy.core` lazily on the first PyArray call, not eagerly at extension import time. Aligns the wording with the matching note in lib.rs so the two do not contradict each other. * fix(embedded): race-free len, full alias docstring, faster brute force - hnsw.rs: `__len__` now reads the count from the underlying HnswIndex instead of `next_id`. The previous derivation could observe phantom IDs under concurrent fit/__len__ — `fit` bumps next_id under its own mutex, releases it, then acquires the index lock for the actual inserts (with the GIL released around the build). A concurrent `__len__` between those two windows would see the bumped counter before the inserts landed. Locking `inner` makes the count reflect committed inserts only. - hnsw.rs: `__repr__` now reports the same len source plus a `<busy>` marker when the index lock is contended (via `try_lock`). Stops a debug REPL from blocking on a concurrent build. - hnsw.rs: docstring for the `metric` constructor argument now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1) instead of a partial list. The error message and the parser were already exhaustive; the docstring just needed to catch up. - tests/unit/test_hnsw.py: the brute-force top-k helper uses `np.argpartition` (O(N)) instead of `np.argsort` (O(N log N)) — only the SET of nearest k matters for the recall metric, not the order inside it. - tests/unit/test_hnsw.py: the recall test allocates the 10K × 16 float matrix directly as float32 via `standard_normal(dtype=np.float32)` instead of allocating float64 then `.astype(np.float32)`. Halves the peak memory for that test. * docs(embedded): drop numpy version pin in init comment, document argpartition pivot - src/lib.rs: the explanation of why the pymodule fn does not call a NumPy initializer no longer mentions a specific numpy crate line. The numpy crate has held the same lazy-import policy across 0.23 and 0.24, and that note was already stale when the crate dependency moved to 0.24 in the previous round. - tests/unit/test_hnsw.py: `_brute_force_topk` gains an inline note documenting numpy's argpartition pivot semantics — the `k` argument is a pivot index, not an off-by-one. Verified empirically against np.argsort on random vectors. Heads off the recurring "should be k-1" review suggestion. * fix(embedded): reject M values that overflow M * 2 `m_max0: M * 2` could panic on debug builds or wrap on release builds for adversarial M values at or above usize::MAX / 2 + 1. The realistic M range for ann-benchmarks workloads is 4..96, so any overflow at the binding boundary is caller error rather than a workload to support. Replace the plain multiplication with `M.checked_mul(2)` and surface the overflow as a ValueError before the engine sees a broken HnswConfig. * docs(embedded): sync metric alias list in type stub, switch argpartition pivot - _coordinode_embedded.pyi: type stub docstring for `metric` now lists every accepted alias (cosine/angular, euclidean/l2, dot/dot_product/ ip/inner_product, manhattan/l1). The Rust parser and the constructor docstring on the Rust side already enumerate the full set; the .pyi stub had a stale subset. - tests/unit/test_hnsw.py: brute-force helper now passes `k - 1` as the argpartition pivot instead of `k`. Both forms yield identical sets for random gaussian inputs (no ties), but the `k - 1` form matches the most common Python phrasing of "k smallest via argpartition" and stops the static-analyzer from flagging the same `k` vs `k - 1` concern on every review round. * ci(embedded): run unit tests against the built wheel in build-embedded job The Hnsw unit tests in tests/unit/test_hnsw.py use pytest.importorskip to remain skippable on fresh checkouts where the maturin wheel has not been built. The trade-off is that the main `test` job — which has no Rust toolchain and skips the wheel build — was silently skipping the entire embedded test module, so a broken Hnsw binding could land without CI catching it. Extend the build-embedded job to install the freshly-built wheel into a throwaway venv and run pytest against test_hnsw.py with --strict-markers. The build job already has the wheel under /tmp/embedded-dist after the maturin step, so this is a thin add-on: install the wheel + numpy + pytest, then run the embedded test module. A guard at the end verifies the suite was actually collected. * ci(embedded): write maturin wheel under workspace dist/ so host can read it The maturin-action runs the build inside a manylinux container; the container's filesystem isn't bind-mounted to the GitHub Actions host except for the GITHUB_WORKSPACE directory. The previous output path /tmp/embedded-dist lived inside the container only — the follow-up test step on the host therefore saw an empty directory and exited with "ls: cannot access '/tmp/embedded-dist/coordinode_embedded-*.whl'". Switch the output directory to the workspace-relative `dist/` (the default convention for maturin-built wheels). The follow-up test step now reads the wheel from the same location. * ci(embedded): tighten test-skip check to match the comment The previous guard verified that at least one test was collected, which catches an entirely-skipped module but not a module where individual tests were skipped via markers or runtime conditions. The intent was the stricter check: any skip in build-embedded means the wheel was installed but the binding is broken (wrong glibc, missing numpy, etc.). Parse pytest's `-ra` short-summary header instead — it prints `SKIPPED [N] ...` lines and a `= N skipped =` summary when anything was skipped. Fail the job on either marker. Also assert that at least one test passed (catches the "no tests collected" case the original guard targeted, without conflating it with the skip detection).
1 parent 35347fb commit a2986a6

10 files changed

Lines changed: 809 additions & 54 deletions

File tree

.github/workflows/ci.yml

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,11 +59,46 @@ jobs:
5959
command: build
6060
args: >-
6161
--manifest-path coordinode-embedded/Cargo.toml
62-
--out /tmp/embedded-dist
62+
--out dist
6363
manylinux: manylinux_2_28
6464
before-script-linux: |
6565
dnf install -y protobuf-compiler
6666
67+
- uses: astral-sh/setup-uv@38f3f104447c67c051c4a08e39b64a148898af3a # v4
68+
with:
69+
python-version: "3.12"
70+
71+
- name: Install wheel + run embedded tests
72+
# The main `test` job skips coordinode_embedded with importorskip
73+
# because that runner doesn't build the wheel. This job has the
74+
# wheel — install it and run the embedded tests here, then assert
75+
# nothing was skipped (a skip with the wheel installed means a
76+
# broken manylinux glibc / missing numpy / etc., not the expected
77+
# "no wheel" condition).
78+
# maturin-action runs inside a manylinux container; only paths
79+
# under $GITHUB_WORKSPACE are bind-mounted back to the host, so
80+
# the wheel goes to ./dist/ (workspace-relative), not /tmp/.
81+
run: |
82+
set -euo pipefail
83+
WHL=$(ls dist/coordinode_embedded-*.whl | head -1)
84+
test -n "$WHL"
85+
uv venv --python 3.12 /tmp/test-venv
86+
uv pip install --python /tmp/test-venv/bin/python "$WHL" numpy pytest pytest-timeout
87+
OUTPUT=$(/tmp/test-venv/bin/python -m pytest tests/unit/test_hnsw.py -v --strict-markers -ra 2>&1)
88+
echo "$OUTPUT"
89+
# `-ra` prints a "SKIPPED [N]" short summary header when anything was
90+
# skipped; importorskip-based skip also surfaces this way. Use the
91+
# full output instead of the exit code because pytest treats skips
92+
# as success.
93+
if echo "$OUTPUT" | grep -qE '^SKIPPED|=+ .* skipped'; then
94+
echo "::error::Tests were skipped in build-embedded; the wheel is installed so this is a real failure (broken glibc / missing numpy / etc.)"
95+
exit 1
96+
fi
97+
if ! echo "$OUTPUT" | grep -qE '=+ [0-9]+ passed'; then
98+
echo "::error::No tests passed — the module was likely not collected"
99+
exit 1
100+
fi
101+
67102
test-integration:
68103
name: Integration tests
69104
runs-on: ubuntu-latest

0 commit comments

Comments
 (0)