feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340
feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340SyedShahmeerAli12 wants to merge 2 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
@bogdankostic All 26 unit tests pass and the integration test was verified locally against ArangoDB via Docker. |
bogdankostic
left a comment
There was a problem hiding this comment.
Thank you for the PR @SyedShahmeerAli12!
I left a couple of inline comments. Apart from those, this integration could be further improved by adding async counterpart to write_documents etc.m python-arango ships an async client. Moreover, we could add an AranfoBM25Retriever to support text search.
| *, | ||
| host: str = "http://localhost:8529", | ||
| database: str = "haystack", | ||
| username: str = "root", |
There was a problem hiding this comment.
Let's make username also a Secret that can be set via an environment variable.
| written += 1 | ||
| continue | ||
| else: | ||
| col.insert(arango_doc) |
There was a problem hiding this comment.
Could we make use here of col.import_bulk or col.insert_many which should be optimized for writing many document at once?
| col = cast(StandardCollection, self._col) | ||
| for doc_id in document_ids: | ||
| if col.has(doc_id): | ||
| col.delete(doc_id) |
| FOR doc IN {self.collection_name} | ||
| FILTER doc.embedding != null | ||
| {filter_clause} | ||
| LET score = COSINE_SIMILARITY(doc.embedding, @query_vec) |
There was a problem hiding this comment.
Should we use the approximate functions mentioned here? They are probably optimized for vector sector. Also, it would be nice to support dot product distance as well.
| if "operator" in node and "conditions" in node: | ||
| op = node["operator"].upper() | ||
| parts = [_parse_filter(c, bind_vars, counter) for c in node["conditions"]] | ||
| joiner = " AND " if op == "AND" else " OR " | ||
| inner = joiner.join(parts) | ||
| return f"({inner})" |
There was a problem hiding this comment.
A NOT filter like {"operator": "NOT", "conditions": [A, B]} silently produces (A OR B) instead of NOT (A AND B).
| reason="Export ARANGO_HOST and ARANGO_PASSWORD to run integration tests.", | ||
| ) | ||
| @pytest.mark.integration | ||
| class TestArangoDocumentStoreIntegration: |
There was a problem hiding this comment.
Let's use our DocumentStoreTestMixins for the integration tests
| readme = "README.md" | ||
| requires-python = ">=3.10" | ||
| license = "Apache-2.0" | ||
| keywords = [] |
There was a problem hiding this comment.
Let's add some relevant keywords.
| all = 'pytest {args:tests}' | ||
| unit-cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x -m "not integration" {args:tests}' | ||
| integration-cov-append-retry = 'pytest --cov=haystack_integrations --cov-append --reruns 3 --reruns-delay 30 -x -m "integration" {args:tests}' | ||
| types = "mypy -p haystack_integrations.document_stores.arangodb {args}" |
There was a problem hiding this comment.
Let's add the retriever here.
| :param top_k: Maximum number of documents to return. | ||
| :param filters: Optional Haystack metadata filters applied at retrieval time. | ||
| """ | ||
| self.document_store = document_store |
There was a problem hiding this comment.
Let's validate that document_store is an ArangoDocumentStore.
Related Issues
Proposed Changes:
Adds a new
arangodb-haystackintegration with:ArangoDocumentStorefull Haystack DocumentStore backed by ArangoDB, supportingwrite_documents,filter_documents,delete_documents,count_documents, and vector similarity retrieval via AQLCOSINE_SIMILARITY(requires ArangoDB 3.12+)ArangoEmbeddingRetriever@componentwrapper that calls_embedding_retrievalon the document storefilters.pytranslates Haystack metadata filter dicts to AQL FILTER expressionspython-arangoclient; password stored as a HaystackSecretto_dict/from_dictHow did you test it?
docker run -p 8529:8529 -e ARANGO_ROOT_PASSWORD=testpw arangodb:latest): writes 2 documents with embeddings, retrieves by cosine similarity, deletes one, asserts countNotes for the reviewer
COSINE_SIMILARITY(doc.embedding, @query_vec)requires ArangoDB 3.12+; documented in the docstringpython-arango>=8.0.0added as the only new dependencyARANGO_HOSTandARANGO_PASSWORDenv varsChecklist
feat: