fix(backend): tag BYOK LLM errors by source#7321
Conversation
Greptile SummaryThis PR adds structured BYOK vs. platform source tagging to LLM error logs across HTTP, WebSocket, and background processing paths. A new
Confidence Score: 4/5Safe to merge; changes are observability-only and do not alter any request path or data handling logic. The ContextVar plumbing is correct across all three entry points. The main gaps are that Anthropic credit-exhaustion 429s don't contain the word 'quota' so they log reason='unknown', and platform client factories register callbacks without feature context so platform failures always log feature='unknown'. Neither affects correctness. backend/utils/llm/byok_errors.py (quota keyword coverage) and backend/utils/llm/clients.py (feature context for platform callbacks) Important Files Changed
|
| _QUOTA_ERROR_NAMES = frozenset({'RateLimitError'}) | ||
|
|
||
|
|
||
| def get_llm_error_source(provider: Optional[str]) -> str: |
There was a problem hiding this comment.
The
_QUOTA_ERROR_NAMES frozenset is named after quota errors but contains RateLimitError, which is a superset of quota failures. More critically, Anthropic's credit-exhaustion 429 response body typically reads "Your credit balance is too low…" — the word "quota" never appears. As a result, an Anthropic BYOK user who runs out of credits will log reason='unknown' instead of reason='quota', defeating the classification for the one provider wired to this path via handle_llm_error_async. Adding a 'credit' / 'balance' keyword check (and optionally renaming the frozenset) would fix the gap.
| _QUOTA_ERROR_NAMES = frozenset({'RateLimitError'}) | |
| def get_llm_error_source(provider: Optional[str]) -> str: | |
| _RATE_LIMIT_ERROR_NAMES = frozenset({'RateLimitError'}) | |
| _QUOTA_KEYWORDS = frozenset({'insufficient_quota', 'quota', 'credit', 'balance'}) | |
| def get_llm_error_source(provider: Optional[str]) -> str: |
| if status_code == 429 or error_name in _QUOTA_ERROR_NAMES: | ||
| if 'insufficient_quota' in error_text or 'quota' in error_text: | ||
| return 'quota' |
There was a problem hiding this comment.
If the rename above is applied, the 429-branch should use the updated name and keyword set to catch Anthropic credit errors.
| if status_code == 429 or error_name in _QUOTA_ERROR_NAMES: | |
| if 'insufficient_quota' in error_text or 'quota' in error_text: | |
| return 'quota' | |
| if status_code == 429 or error_name in _RATE_LIMIT_ERROR_NAMES: | |
| if any(kw in error_text for kw in _QUOTA_KEYWORDS): | |
| return 'quota' |
| _llm_error_callbacks: Dict[Tuple[str, str, str], _LLMErrorCallback] = {} | ||
|
|
||
|
|
||
| def _get_llm_error_callback(provider: str, model: str = '', feature: str = '') -> _LLMErrorCallback: | ||
| key = (provider, model, feature) | ||
| if key not in _llm_error_callbacks: | ||
| _llm_error_callbacks[key] = _LLMErrorCallback(provider, model=model, feature=feature) | ||
| return _llm_error_callbacks[key] |
There was a problem hiding this comment.
Unbounded module-level callback cache
_llm_error_callbacks has no maximum size. While the set of (provider, model, feature) tuples is small in practice today, it will silently grow indefinitely as new models or features are introduced, and there is no eviction path. Consider using a bounded structure (e.g. TTLCache) consistent with the caches already used in this file, or at minimum document the expected upper bound.
| async def handle_llm_error_async( | ||
| error: Exception, | ||
| provider: Optional[str], | ||
| feature: Optional[str] = None, | ||
| model: Optional[str] = None, | ||
| operation: str = 'chat', | ||
| ) -> None: | ||
| """Run LLM error handling off the event loop while preserving BYOK context.""" | ||
| future = submit_with_context(storage_executor, handle_llm_error, error, provider, feature, model, operation) | ||
| try: | ||
| await asyncio.wrap_future(future) | ||
| except Exception as e: | ||
| logger.error('Async LLM error handler failed provider=%s feature=%s: %s', provider, feature, e) |
There was a problem hiding this comment.
storage_executor is the wrong executor for error logging
storage_executor is documented and dimensioned (16 threads) for audio file pre-caching and GCS operations. Submitting logging work there mixes latency classes and could steal threads from ongoing audio uploads during busy periods. postprocess_executor or the default loop.run_in_executor(None, ...) would be more appropriate.
Summary
Closes #7290
Verification