fix(backend): surface BYOK LLM failures and chat fallback#7320
fix(backend): surface BYOK LLM failures and chat fallback#7320apoorvdarshan wants to merge 1 commit into
Conversation
Greptile SummaryThis PR introduces source-aware BYOK LLM error logging with rate-limited push notifications, propagates BYOK UID context into background and async processing, and adds a
Confidence Score: 3/5Safe to merge for most users, but the Anthropic chat error handler now makes synchronous Firestore + FCM network calls inside an async function, which stalls the event loop on the first BYOK failure for each user. The UID context propagation and Redis lock are solid, and the notification deduplication logic is correct. The concern is backend/utils/retrieval/agentic.py warrants the closest look — the Anthropic error handler's sync I/O and the OpenAI stream's missing error surfacing and task-cancel gap are all in this file. Important Files Changed
Reviews (1): Last reviewed commit: "fix(backend): surface BYOK LLM failures ..." | Re-trigger Greptile |
| except Exception as e: | ||
| logger.error(f"Anthropic API error: {e}") | ||
| handle_llm_error(e, 'anthropic', feature='chat_agent', model=ANTHROPIC_AGENT_MODEL) | ||
| await callback.put_data(f"\n\nSorry, I encountered an error. Please try again.") | ||
| await callback.end() | ||
| return |
There was a problem hiding this comment.
Blocking sync I/O called directly in async error handler
handle_llm_error is a synchronous function that unconditionally calls _send_byok_llm_error_notification, which makes a synchronous Firestore read (notification_db.get_all_tokens), a Redis write (try_acquire_byok_llm_error_notification_lock), and a synchronous FCM send (messaging.send_each). Calling this directly in an async def — without asyncio.get_event_loop().run_in_executor or similar — will stall the entire event loop on the first BYOK error until all three network calls resolve, which could be several hundred milliseconds and will delay every other in-flight coroutine on the server.
| except Exception as e: | ||
| logger.error(f"Error in OpenAI agent stream: {e}") | ||
| traceback.print_exc() | ||
| await callback.end() |
There was a problem hiding this comment.
OpenAI stream-level errors bypass
handle_llm_error
The LangChain _LLMErrorCallback fires for errors routed through LangChain's callback system (e.g., model invocation failures), but stream-level or chain-level failures that propagate directly to astream_events as an exception can reach this except block without triggering the callback. Unlike the Anthropic path, this handler does not call handle_llm_error, so those errors are silently swallowed without BYOK push notifications.
| except Exception as e: | |
| logger.error(f"Error in OpenAI agent stream: {e}") | |
| traceback.print_exc() | |
| await callback.end() | |
| except Exception as e: | |
| handle_llm_error(e, 'openai', feature='chat_agent') | |
| logger.error(f"Error in OpenAI agent stream: {e}") | |
| traceback.print_exc() | |
| await callback.end() |
| except Exception as e: | ||
| logger.error(f"Error in execute_agentic_chat_stream openai: {e}") | ||
| traceback.print_exc() | ||
| if callback_data is not None: | ||
| callback_data['error'] = str(e) |
There was a problem hiding this comment.
Background task not cancelled on outer
except Exception
When the outer generator's except Exception as e: block is hit (a non-CancelledError failure in the queue-draining loop), task is never cancelled or awaited. The _run_openai_agent_stream coroutine will keep running — consuming OpenAI tokens and filling callback.queue — until it finishes on its own, which could be many seconds for a long tool chain.
| except Exception as e: | |
| logger.error(f"Error in execute_agentic_chat_stream openai: {e}") | |
| traceback.print_exc() | |
| if callback_data is not None: | |
| callback_data['error'] = str(e) | |
| except Exception as e: | |
| task.cancel() | |
| logger.error(f"Error in execute_agentic_chat_stream openai: {e}") | |
| traceback.print_exc() | |
| if callback_data is not None: | |
| callback_data['error'] = str(e) |
|
Hey @apoorvdarshan 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
Summary
Closes #7290
Closes #7291
Closes #7238
Verification
Note: test.sh completed with the existing fair-use integration block skipped because Redis is not available locally.