Skip to content

fix(embedding): use word-boundary matching for numeric error patterns in classify_api_error#2370

Open
abinch wants to merge 2 commits into
volcengine:mainfrom
abinch:fix/embedding-error-classify
Open

fix(embedding): use word-boundary matching for numeric error patterns in classify_api_error#2370
abinch wants to merge 2 commits into
volcengine:mainfrom
abinch:fix/embedding-error-classify

Conversation

@abinch
Copy link
Copy Markdown
Contributor

@abinch abinch commented Jun 1, 2026

Description

Fix classify_api_error() in openviking/utils/model_retry.py to use regex word-boundary matching (\b) for purely numeric error patterns (e.g. "413", "400", "429"). Previously, bare substring matching via Python's in operator caused false positives when API error messages contained request IDs or hex strings that coincidentally included those digit sequences (e.g., "413" appearing inside request ID d7c9130f344...).

This caused 429 (RPM rate limit) errors to be misclassified as INPUT_TOO_LARGE when the request ID contained "413". In TextEmbeddingHandler.on_dequeue(), INPUT_TOO_LARGE errors are permanently dropped without re-enqueue, while TRANSIENT errors are re-enqueued for retry. During a full reindex of ~476K records, 2,741 embedding messages were permanently lost due to this bug.

Related Issue

Fixes #2369

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Changes Made

  • openviking/utils/model_retry.py

    • Added _pattern_matches() helper: numeric-only patterns use \b word-boundary regex, non-numeric patterns keep substring matching
    • Applied _pattern_matches() to all 4 pattern-matching locations in classify_api_error() (INPUT_TOO_LARGE, PERMANENT, TRANSIENT)
    • Also applied to PERMANENT and TRANSIENT patterns which had the same latent risk ("400" could match inside "1400", etc.)
  • tests/unit/test_model_retry.py

    • Added test_429_with_request_id_containing_413_is_transient: reproduces the exact bug scenario
    • Added test_numeric_status_code_inside_longer_number_is_not_matched: verifies "400" does not match "1400", "502" does not match "5020"

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes (19 passed)
  • I have tested this on the following platforms:
    • Linux
    • macOS

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

Additional Notes

This bug was discovered during production debugging of a full reindex operation. The fix is minimal and backward-compatible — all existing text-based patterns continue to use substring matching, and all numeric patterns produce the same results for "clean" error messages (e.g., "Error code: 413 - Payload Too Large" still correctly classifies as INPUT_TOO_LARGE).

Related to closed issue #2132 which originally added "413" to INPUT_TOO_LARGE_PATTERNS but did not anticipate the substring matching false positive.

… in classify_api_error

classify_api_error() used plain substring matching (Python `in` operator)
for numeric patterns like "413", "400", "429" etc. When API error messages
contained request IDs with those digit sequences as substrings (e.g.
"d7c9130f344..." containing "413"), 429 errors were misclassified as
INPUT_TOO_LARGE. This caused TextEmbeddingHandler to permanently drop
those messages instead of re-enqueuing for retry, losing 2741 records
during a full reindex of 476K entries.

Fix by using regex \b word-boundary matching for numeric-only patterns
while preserving substring matching for text patterns. Also adds two
unit tests covering the exact bug scenario and longer-number false
positives.

Fixes volcengine#2369

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

2369 - Fully compliant

Compliant requirements:

  • Added _pattern_matches() helper with word-boundary regex for numeric patterns
  • Applied _pattern_matches() to INPUT_TOO_LARGE, PERMANENT, and TRANSIENT pattern checks
  • Added tests reproducing the exact bug scenario and edge cases

2132 - Not compliant

Non-compliant requirements:

  • Fix memory_updater to truncate input before enqueue
  • Add 413 to PERMANENT_API_ERROR_PATTERNS (this was already done in prior work, not part of this PR)
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Inconsistent pattern handling for QUOTA_EXCEEDED

QUOTA_EXCEEDED_PATTERNS still uses plain substring matching instead of _pattern_matches(), which could lead to false positives if QUOTA_EXCEEDED_PATTERNS contains numeric patterns that appear inside request IDs or other numbers.

for pattern in QUOTA_EXCEEDED_PATTERNS:
    if pattern in text_lower:
        return ERROR_CLASS_QUOTA_EXCEEDED

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Use lru_cache for regex pattern caching

Use functools.lru_cache instead of a manual module-level dict to cache compiled
regex patterns. This removes boilerplate and avoids potential thread-safety concerns
with unsynchronized mutable state.

openviking/utils/model_retry.py [74-80]

-_NUMERIC_PATTERN_RE: dict[str, re.Pattern] = {}
+from functools import lru_cache
 
 
+@lru_cache(maxsize=None)
 def _get_numeric_pattern_re(pattern: str) -> re.Pattern:
-    if pattern not in _NUMERIC_PATTERN_RE:
-        _NUMERIC_PATTERN_RE[pattern] = re.compile(r'\b' + re.escape(pattern) + r'\b')
-    return _NUMERIC_PATTERN_RE[pattern]
+    return re.compile(r'\b' + re.escape(pattern) + r'\b')
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly refactors manual regex caching to use functools.lru_cache, removing boilerplate and improving maintainability. It's a valid, non-critical improvement aligned with Python best practices.

Low

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

fix(embedding): classify_api_error misclassifies 429 errors as INPUT_TOO_LARGE when request ID contains numeric pattern substrings

1 participant