fix(embedding): use word-boundary matching for numeric error patterns in classify_api_error by abinch · Pull Request #2370 · volcengine/OpenViking

abinch · 2026-06-01T13:44:57Z

Description

Fix classify_api_error() in openviking/utils/model_retry.py to use regex word-boundary matching (\b) for purely numeric error patterns (e.g. "413", "400", "429"). Previously, bare substring matching via Python's in operator caused false positives when API error messages contained request IDs or hex strings that coincidentally included those digit sequences (e.g., "413" appearing inside request ID d7c9130f344...).

This caused 429 (RPM rate limit) errors to be misclassified as INPUT_TOO_LARGE when the request ID contained "413". In TextEmbeddingHandler.on_dequeue(), INPUT_TOO_LARGE errors are permanently dropped without re-enqueue, while TRANSIENT errors are re-enqueued for retry. During a full reindex of ~476K records, 2,741 embedding messages were permanently lost due to this bug.

Related Issue

Fixes #2369

Type of Change

Bug fix (non-breaking change that fixes an issue)

Changes Made

openviking/utils/model_retry.py
- Added _pattern_matches() helper: numeric-only patterns use \b word-boundary regex, non-numeric patterns keep substring matching
- Applied _pattern_matches() to all 4 pattern-matching locations in classify_api_error() (INPUT_TOO_LARGE, PERMANENT, TRANSIENT)
- Also applied to PERMANENT and TRANSIENT patterns which had the same latent risk ("400" could match inside "1400", etc.)
tests/unit/test_model_retry.py
- Added test_429_with_request_id_containing_413_is_transient: reproduces the exact bug scenario
- Added test_numeric_status_code_inside_longer_number_is_not_matched: verifies "400" does not match "1400", "502" does not match "5020"

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes (19 passed)
I have tested this on the following platforms:
- Linux
- macOS

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

Additional Notes

This bug was discovered during production debugging of a full reindex operation. The fix is minimal and backward-compatible — all existing text-based patterns continue to use substring matching, and all numeric patterns produce the same results for "clean" error messages (e.g., "Error code: 413 - Payload Too Large" still correctly classifies as INPUT_TOO_LARGE).

Related to closed issue #2132 which originally added "413" to INPUT_TOO_LARGE_PATTERNS but did not anticipate the substring matching false positive.

… in classify_api_error classify_api_error() used plain substring matching (Python `in` operator) for numeric patterns like "413", "400", "429" etc. When API error messages contained request IDs with those digit sequences as substrings (e.g. "d7c9130f344..." containing "413"), 429 errors were misclassified as INPUT_TOO_LARGE. This caused TextEmbeddingHandler to permanently drop those messages instead of re-enqueuing for retry, losing 2741 records during a full reindex of 476K entries. Fix by using regex \b word-boundary matching for numeric-only patterns while preserving substring matching for text patterns. Also adds two unit tests covering the exact bug scenario and longer-number false positives. Fixes volcengine#2369 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-06-01T13:46:43Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶 2369 - Fully compliant Compliant requirements: Added _pattern_matches() helper with word-boundary regex for numeric patterns Applied _pattern_matches() to INPUT_TOO_LARGE, PERMANENT, and TRANSIENT pattern checks Added tests reproducing the exact bug scenario and edge cases 2132 - Not compliant Non-compliant requirements: Fix memory_updater to truncate input before enqueue Add 413 to PERMANENT_API_ERROR_PATTERNS (this was already done in prior work, not part of this PR)
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Inconsistent pattern handling for QUOTA_EXCEEDED QUOTA_EXCEEDED_PATTERNS still uses plain substring matching instead of _pattern_matches(), which could lead to false positives if QUOTA_EXCEEDED_PATTERNS contains numeric patterns that appear inside request IDs or other numbers. for pattern in QUOTA_EXCEEDED_PATTERNS: if pattern in text_lower: return ERROR_CLASS_QUOTA_EXCEEDED

github-actions · 2026-06-01T13:47:59Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Use lru_cache for regex pattern caching Use `functools.lru_cache` instead of a manual module-level dict to cache compiled regex patterns. This removes boilerplate and avoids potential thread-safety concerns with unsynchronized mutable state. openviking/utils/model_retry.py [74-80] -_NUMERIC_PATTERN_RE: dict[str, re.Pattern] = {} +from functools import lru_cache +@lru_cache(maxsize=None) def _get_numeric_pattern_re(pattern: str) -> re.Pattern: - if pattern not in _NUMERIC_PATTERN_RE: - _NUMERIC_PATTERN_RE[pattern] = re.compile(r'\b' + re.escape(pattern) + r'\b') - return _NUMERIC_PATTERN_RE[pattern] + return re.compile(r'\b' + re.escape(pattern) + r'\b') Suggestion importance[1-10]: 5 __ Why: The suggestion correctly refactors manual regex caching to use `functools.lru_cache`, removing boilerplate and improving maintainability. It's a valid, non-critical improvement aligned with Python best practices.	Low

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-project-automation Bot added this to OpenViking project Jun 1, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 1, 2026

github-actions Bot added the Review effort 2/5 label Jun 1, 2026

style: fix ruff format for string quote style in model_retry.py

b1e5c2e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embedding): use word-boundary matching for numeric error patterns in classify_api_error#2370

fix(embedding): use word-boundary matching for numeric error patterns in classify_api_error#2370
abinch wants to merge 2 commits into
volcengine:mainfrom
abinch:fix/embedding-error-classify

abinch commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abinch commented Jun 1, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jun 1, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 1, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant