Skip to content

feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340

Open
SyedShahmeerAli12 wants to merge 2 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:feat/arangodb-document-store
Open

feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340
SyedShahmeerAli12 wants to merge 2 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:feat/arangodb-document-store

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented May 21, 2026

Related Issues

Proposed Changes:

Adds a new arangodb-haystack integration with:

  • ArangoDocumentStore full Haystack DocumentStore backed by ArangoDB, supporting write_documents, filter_documents, delete_documents, count_documents, and vector similarity retrieval via AQL COSINE_SIMILARITY (requires ArangoDB 3.12+)
  • ArangoEmbeddingRetriever @component wrapper that calls _embedding_retrieval on the document store
  • filters.py translates Haystack metadata filter dicts to AQL FILTER expressions
  • Auth via python-arango client; password stored as a Haystack Secret
  • Serialization/deserialization with to_dict / from_dict

How did you test it?

  • 26 unit tests (fully mocked) covering init, serialization round-trip, write/filter/delete/count, duplicate policies, and embedding retrieval
  • 1 integration test verified against a live ArangoDB instance running locally via Docker (docker run -p 8529:8529 -e ARANGO_ROOT_PASSWORD=testpw arangodb:latest): writes 2 documents with embeddings, retrieves by cosine similarity, deletes one, asserts count

Notes for the reviewer

  • AQL vector search uses COSINE_SIMILARITY(doc.embedding, @query_vec) requires ArangoDB 3.12+; documented in the docstring
  • python-arango>=8.0.0 added as the only new dependency
  • Integration test is gated on ARANGO_HOST and ARANGO_PASSWORD env vars

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: feat:

@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner May 21, 2026 14:45
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from bogdankostic and removed request for a team May 21, 2026 14:45
@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels May 21, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 21, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpython-arango@​8.3.299100100100100

View full report

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

@bogdankostic All 26 unit tests pass and the integration test was verified locally against ArangoDB via Docker.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @SyedShahmeerAli12!

I left a couple of inline comments. Apart from those, this integration could be further improved by adding async counterpart to write_documents etc.m python-arango ships an async client. Moreover, we could add an AranfoBM25Retriever to support text search.

*,
host: str = "http://localhost:8529",
database: str = "haystack",
username: str = "root",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make username also a Secret that can be set via an environment variable.

written += 1
continue
else:
col.insert(arango_doc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make use here of col.import_bulk or col.insert_many which should be optimized for writing many document at once?

col = cast(StandardCollection, self._col)
for doc_id in document_ids:
if col.has(doc_id):
col.delete(doc_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make use here of col.delete_many?

FOR doc IN {self.collection_name}
FILTER doc.embedding != null
{filter_clause}
LET score = COSINE_SIMILARITY(doc.embedding, @query_vec)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the approximate functions mentioned here? They are probably optimized for vector sector. Also, it would be nice to support dot product distance as well.

Comment on lines +35 to +40
if "operator" in node and "conditions" in node:
op = node["operator"].upper()
parts = [_parse_filter(c, bind_vars, counter) for c in node["conditions"]]
joiner = " AND " if op == "AND" else " OR "
inner = joiner.join(parts)
return f"({inner})"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A NOT filter like {"operator": "NOT", "conditions": [A, B]} silently produces (A OR B) instead of NOT (A AND B).

reason="Export ARANGO_HOST and ARANGO_PASSWORD to run integration tests.",
)
@pytest.mark.integration
class TestArangoDocumentStoreIntegration:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use our DocumentStoreTestMixins for the integration tests

readme = "README.md"
requires-python = ">=3.10"
license = "Apache-2.0"
keywords = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some relevant keywords.

all = 'pytest {args:tests}'
unit-cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x -m "not integration" {args:tests}'
integration-cov-append-retry = 'pytest --cov=haystack_integrations --cov-append --reruns 3 --reruns-delay 30 -x -m "integration" {args:tests}'
types = "mypy -p haystack_integrations.document_stores.arangodb {args}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the retriever here.

:param top_k: Maximum number of documents to return.
:param filters: Optional Haystack metadata filters applied at retrieval time.
"""
self.document_store = document_store
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's validate that document_store is an ArangoDocumentStore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:arangodb topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants