Skip to content

Guard local vector store field size#2327

Open
ZaynJarvis wants to merge 1 commit into
volcengine:mainfrom
ZaynJarvis:fix/local-vector-store-field-limit
Open

Guard local vector store field size#2327
ZaynJarvis wants to merge 1 commit into
volcengine:mainfrom
ZaynJarvis:fix/local-vector-store-field-limit

Conversation

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Summary

  • cap local vector store scalar payloads to the internal 65535-byte string-field limit before native serialization
  • truncate oversized string fields with a marker, preserving primary keys and routing fields where possible
  • update the large-fields regression test to assert upsert succeeds and stores valid bounded JSON

Tests

  • ruff check openviking/storage/vectordb/collection/local_collection.py tests/vectordb/test_filter_ops.py
  • ruff format --check openviking/storage/vectordb/collection/local_collection.py tests/vectordb/test_filter_ops.py
  • python -m pytest tests/vectordb/test_filter_ops.py tests/vectordb/test_openviking_vectordb.py -q --no-cov

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Incorrect Primary Key Logging

The warning log uses the primary key field name instead of the actual record's primary key value, making it impossible to identify which specific record was truncated.

logger.warning(
    f"Truncated local vector store fields for collection '{collection_name}' "
    f"primary_key='{primary_key}' fields={truncated_fields} "
    f"bytes={initial_size}->{_utf8_len(fields)} "
    f"limit={FIELDS_STORAGE_MAX_BYTES}"
)

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Use primary key value in logs/errors

The log and error messages use the primary key field name instead of the actual
primary key value from the data. This makes debugging harder because you can't
identify which record was truncated. Extract the primary key value from data at the
start of the function and use that in messages.

openviking/storage/vectordb/collection/local_collection.py [105-151]

 def _serialize_fields_for_store(
     data: Dict[str, Any],
     collection_name: str,
     primary_key: str,
 ) -> str:
+    pk_value = data.get(primary_key, "<unknown>")
     fields = safe_json_dumps(data, ensure_ascii=False)
     initial_size = _utf8_len(fields)
     if initial_size <= FIELDS_STORAGE_MAX_BYTES:
         return fields
 
     compacted = dict(data)
     truncated_fields: List[str] = []
     for field in _truncatable_string_fields(compacted, primary_key):
         ...
         if _utf8_len(fields) <= FIELDS_STORAGE_MAX_BYTES:
             logger.warning(
                 f"Truncated local vector store fields for collection '{collection_name}' "
-                f"primary_key='{primary_key}' fields={truncated_fields} "
+                f"primary_key='{pk_value}' fields={truncated_fields} "
                 f"bytes={initial_size}->{_utf8_len(fields)} "
                 f"limit={FIELDS_STORAGE_MAX_BYTES}"
             )
             return fields
 
     final_size = _serialized_fields_len(compacted)
     raise ValueError(
         f"local vector store fields exceed {FIELDS_STORAGE_MAX_BYTES} bytes "
-        f"after truncation: collection={collection_name}, primary_key={primary_key}, "
+        f"after truncation: collection={collection_name}, primary_key={pk_value}, "
         f"bytes={initial_size}->{final_size}"
     )
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the log/error messages use the primary key field name instead of the actual value, which improves debugging. This is a moderate quality improvement that enhances maintainability.

Low

@ZaynJarvis ZaynJarvis force-pushed the fix/local-vector-store-field-limit branch from f681cfd to 02cdf0b Compare May 31, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant