… in classify_api_error
classify_api_error() used plain substring matching (Python `in` operator)
for numeric patterns like "413", "400", "429" etc. When API error messages
contained request IDs with those digit sequences as substrings (e.g.
"d7c9130f344..." containing "413"), 429 errors were misclassified as
INPUT_TOO_LARGE. This caused TextEmbeddingHandler to permanently drop
those messages instead of re-enqueuing for retry, losing 2741 records
during a full reindex of 476K entries.
Fix by using regex \b word-boundary matching for numeric-only patterns
while preserving substring matching for text patterns. Also adds two
unit tests covering the exact bug scenario and longer-number false
positives.
Fixes volcengine#2369
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Description
Fix
classify_api_error()inopenviking/utils/model_retry.pyto use regex word-boundary matching (\b) for purely numeric error patterns (e.g."413","400","429"). Previously, bare substring matching via Python'sinoperator caused false positives when API error messages contained request IDs or hex strings that coincidentally included those digit sequences (e.g.,"413"appearing inside request IDd7c9130f344...).This caused 429 (RPM rate limit) errors to be misclassified as
INPUT_TOO_LARGEwhen the request ID contained"413". InTextEmbeddingHandler.on_dequeue(),INPUT_TOO_LARGEerrors are permanently dropped without re-enqueue, whileTRANSIENTerrors are re-enqueued for retry. During a full reindex of ~476K records, 2,741 embedding messages were permanently lost due to this bug.Related Issue
Fixes #2369
Type of Change
Changes Made
openviking/utils/model_retry.py_pattern_matches()helper: numeric-only patterns use\bword-boundary regex, non-numeric patterns keep substring matching_pattern_matches()to all 4 pattern-matching locations inclassify_api_error()(INPUT_TOO_LARGE, PERMANENT, TRANSIENT)PERMANENTandTRANSIENTpatterns which had the same latent risk ("400"could match inside"1400", etc.)tests/unit/test_model_retry.pytest_429_with_request_id_containing_413_is_transient: reproduces the exact bug scenariotest_numeric_status_code_inside_longer_number_is_not_matched: verifies"400"does not match"1400","502"does not match"5020"Testing
Checklist
Additional Notes
This bug was discovered during production debugging of a full reindex operation. The fix is minimal and backward-compatible — all existing text-based patterns continue to use substring matching, and all numeric patterns produce the same results for "clean" error messages (e.g.,
"Error code: 413 - Payload Too Large"still correctly classifies asINPUT_TOO_LARGE).Related to closed issue #2132 which originally added
"413"toINPUT_TOO_LARGE_PATTERNSbut did not anticipate the substring matching false positive.