-
Notifications
You must be signed in to change notification settings - Fork 204
feat: Add operations to PineConeDocumentStore #2772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add operations to PineConeDocumentStore #2772
Conversation
761bd89 to
c0e6bbd
Compare
b96b0d3 to
cbcb796
Compare
| documents = await self.filter_documents_async(filters=filters) | ||
|
|
||
| result = {} | ||
| for field in metadata_fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should refactor the duplicated logic in both sync and async functions, avoiding duplicate code. That logic can be extracted into helper functions, which can then be used by both synchronous and asynchronous methods.
| # Collect all field values to infer types accurately | ||
| field_samples: dict[str, set[str]] = {} | ||
|
|
||
| for doc in documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above regarding duplicated logic.
|
|
||
| def get_metadata_field_min_max(self, metadata_field: str) -> dict[str, Any]: | ||
| """ | ||
| Returns the minimum and maximum values for a numeric metadata field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also work for other types: bool and keyword
davidsbatista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @GauravBansal29! I've left some initial comments, mostly regarding the duplicated code logic between sync and async versions, and considering all possible metadata types in the min_max operation.
Related Issues
PineConeDocumentStore#2641Proposed Changes:
Implemented five new metadata query and filtering operations for
PineconeDocumentStore:Methods Added:
count_documents_by_filter(filters)- Count documents matching filter criteriacount_unique_metadata_by_filter(filters, metadata_fields)- Count unique values across specified metadata fieldsget_metadata_fields_info()- Infer metadata field types by sampling documentsget_metadata_field_min_max(metadata_field)- Get min/max values for numeric metadata fieldsget_metadata_field_unique_values(metadata_field, search_term, from_, size)- Retrieve unique values with pagination and search supportAll methods include both synchronous and asynchronous versions (e.g.,
count_documents_by_filter_async).Implementation Details:
TOP_K_LIMITof 1000) and perform aggregations in Pythonget_metadata_fields_info()samples documents and infers types with the following mappings:bool→"boolean"int/float→"long"str→"keyword"listelements → inferred from first element type"keyword"with warningget_metadata_field_unique_values()supportsfrom_andsizeparametersTOP_K_LIMIT, indicating potential data incompletenessDesign Decisions:
filter_documents()method to leverage Pinecone's filtering capabilitiesboolfirst (beforeint) to avoid false positives sinceboolis a subclass ofintin PythonHow did you test it?
Integration Tests:
9 sync tests in
test_document_store.py:5 async tests in
test_document_store_async.py:All tests run against a real Pinecone instance (marked with
@pytest.mark.integration) and passed successfully in CI.Test Coverage:
Notes for the reviewer
Client-side limitations: Due to Pinecone's
TOP_K_LIMITof 1000 documents, these methods may return incomplete results for large collections. Warning logs inform users when this limit is reached.Type inference approach:
get_metadata_fields_info()collects all observed types for each field across sampled documents and detects inconsistencies. If mixed types are found, it logs a warning and defaults to"keyword".No
query_sqlmethod: Unlike some document stores, Pinecone doesn't support SQL queries, so this optional method from the issue was not implemented.Consistency with codebase:
Checklist
feat:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.