Skip to content

Conversation

@GauravBansal29
Copy link
Contributor

Related Issues

Proposed Changes:

Implemented five new metadata query and filtering operations for PineconeDocumentStore:

Methods Added:

  1. count_documents_by_filter(filters) - Count documents matching filter criteria
  2. count_unique_metadata_by_filter(filters, metadata_fields) - Count unique values across specified metadata fields
  3. get_metadata_fields_info() - Infer metadata field types by sampling documents
  4. get_metadata_field_min_max(metadata_field) - Get min/max values for numeric metadata fields
  5. get_metadata_field_unique_values(metadata_field, search_term, from_, size) - Retrieve unique values with pagination and search support

All methods include both synchronous and asynchronous versions (e.g., count_documents_by_filter_async).

Implementation Details:

  • Client-side aggregation: Pinecone lacks native aggregation APIs, so all operations fetch documents (up to TOP_K_LIMIT of 1000) and perform aggregations in Python
  • Schema inference: get_metadata_fields_info() samples documents and infers types with the following mappings:
    • bool"boolean"
    • int/float"long"
    • str"keyword"
    • list elements → inferred from first element type
    • Mixed types → defaults to "keyword" with warning
  • Pagination: get_metadata_field_unique_values() supports from_ and size parameters
  • Search filtering: Case-insensitive substring matching for unique value searches
  • Warning logs: Users are informed when hitting Pinecone's TOP_K_LIMIT, indicating potential data incompleteness

Design Decisions:

  • Followed OpenSearch document store as reference for output format and type mappings
  • Used existing filter_documents() method to leverage Pinecone's filtering capabilities
  • All methods handle edge cases (empty documents, non-existent fields, non-numeric values)
  • Type inference checks for bool first (before int) to avoid false positives since bool is a subclass of int in Python

How did you test it?

Integration Tests:

  • 9 sync tests in test_document_store.py:

    • Basic filtering and counting operations
    • Empty document store handling
    • Mixed types detection in metadata
    • Numeric min/max with error handling for non-numeric fields
    • Unique values with pagination and list metadata handling
  • 5 async tests in test_document_store_async.py:

    • Async versions of all core operations
    • Pagination and search term filtering

All tests run against a real Pinecone instance (marked with @pytest.mark.integration) and passed successfully in CI.

Test Coverage:

  • ✅ Happy path scenarios
  • ✅ Edge cases (empty stores, missing fields, non-numeric values)
  • ✅ Type inference with mixed types
  • ✅ List metadata handling
  • ✅ Pagination boundaries
  • ✅ Search term filtering

Notes for the reviewer

  1. Client-side limitations: Due to Pinecone's TOP_K_LIMIT of 1000 documents, these methods may return incomplete results for large collections. Warning logs inform users when this limit is reached.

  2. Type inference approach: get_metadata_fields_info() collects all observed types for each field across sampled documents and detects inconsistencies. If mixed types are found, it logs a warning and defaults to "keyword".

  3. No query_sql method: Unlike some document stores, Pinecone doesn't support SQL queries, so this optional method from the issue was not implemented.

  4. Consistency with codebase:

    • Followed existing Pinecone test patterns (integration tests with fixtures)
    • Used same logging conventions and error handling patterns
    • Matched type mappings used in OpenSearch integration

Checklist

@GauravBansal29 GauravBansal29 requested a review from a team as a code owner January 22, 2026 05:16
@GauravBansal29 GauravBansal29 requested review from sjrl and removed request for a team January 22, 2026 05:16
@github-actions github-actions bot added integration:pinecone type:documentation Improvements or additions to documentation labels Jan 22, 2026
@GauravBansal29 GauravBansal29 force-pushed the feat-pinecone-operations-2641 branch from 761bd89 to c0e6bbd Compare January 22, 2026 05:21
@GauravBansal29 GauravBansal29 force-pushed the feat-pinecone-operations-2641 branch from b96b0d3 to cbcb796 Compare January 22, 2026 05:26
documents = await self.filter_documents_async(filters=filters)

result = {}
for field in metadata_fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refactor the duplicated logic in both sync and async functions, avoiding duplicate code. That logic can be extracted into helper functions, which can then be used by both synchronous and asynchronous methods.

# Collect all field values to infer types accurately
field_samples: dict[str, set[str]] = {}

for doc in documents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding duplicated logic.


def get_metadata_field_min_max(self, metadata_field: str) -> dict[str, Any]:
"""
Returns the minimum and maximum values for a numeric metadata field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work for other types: bool and keyword

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @GauravBansal29! I've left some initial comments, mostly regarding the duplicated code logic between sync and async versions, and considering all possible metadata types in the min_max operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:pinecone type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add the following operations to PineConeDocumentStore

2 participants