fix(backend): surface BYOK LLM failures and chat fallback by apoorvdarshan · Pull Request #7320 · BasedHardware/omi

apoorvdarshan · 2026-05-15T18:03:45Z

Summary

add source-aware LLM error logging and rate-limited BYOK push notifications for invalid, denied, or exhausted user keys
propagate BYOK uid context into request and background processing so provider failures can be tied to the right user
add CHAT_PROVIDER=openai LangGraph fallback for agentic chat and lazy Anthropic client construction when Anthropic env is unavailable
clean up VAD hosted calls to use httpx and move account deletion background wipe onto the shared storage executor

Closes #7290
Closes #7291
Closes #7238

Verification

PATH=/tmp/omi-backend-py311/bin:$PATH ./test-preflight.sh
PATH=/tmp/omi-backend-py311/bin:$PATH ./test.sh

Note: test.sh completed with the existing fair-use integration block skipped because Redis is not available locally.

greptile-apps · 2026-05-15T18:09:04Z

Greptile Summary

This PR introduces source-aware BYOK LLM error logging with rate-limited push notifications, propagates BYOK UID context into background and async processing, and adds a CHAT_PROVIDER=openai LangGraph/ReAct fallback path alongside the existing Anthropic agent.

byok_errors.py + clients.py: New handle_llm_error + _LLMErrorCallback surface 401/403/429-quota errors to BYOK users via FCM; _AnthropicClientProxy defers client construction until first call so missing ANTHROPIC_API_KEY doesn't crash at import time.
agentic.py: Adds a full _execute_agentic_chat_stream_openai path using LangGraph create_react_agent; the existing Anthropic error handler is upgraded to call handle_llm_error; CHAT_PROVIDER env var selects the active backend at startup.
byok.py / sync.py / pusher.py: _byok_uid_ctx ContextVar is threaded through HTTP, WebSocket, and background-thread request contexts so provider errors can be tied back to the originating user.

Confidence Score: 3/5

Safe to merge for most users, but the Anthropic chat error handler now makes synchronous Firestore + FCM network calls inside an async function, which stalls the event loop on the first BYOK failure for each user.

The UID context propagation and Redis lock are solid, and the notification deduplication logic is correct. The concern is handle_llm_error being called at line 438 of agentic.py inside an async def — the downstream path does a Firestore read, a Redis write, and an FCM batch send all synchronously, with no executor offload. This blocks the event loop for every BYOK user who hits a broken key, which is the exact scenario this PR aims to handle more gracefully. The new OpenAI stream path also has a gap: its top-level exception handler doesn't call handle_llm_error, and the outer generator's non-cancel exception path doesn't cancel the background LangGraph task.

backend/utils/retrieval/agentic.py warrants the closest look — the Anthropic error handler's sync I/O and the OpenAI stream's missing error surfacing and task-cancel gap are all in this file.

Important Files Changed

Filename	Overview
backend/utils/retrieval/agentic.py	Adds new OpenAI/LangGraph ReAct agent path and CHAT_PROVIDER switch. The Anthropic error handler now calls `handle_llm_error` but does so synchronously in an async context; the new OpenAI stream error handler doesn't call `handle_llm_error`; and the outer generator's non-cancel exception path doesn't cancel its background task.
backend/utils/llm/byok_errors.py	New module that classifies BYOK LLM errors and sends rate-limited FCM push notifications. Logic is clean; all I/O is synchronous (Firestore + Redis + FCM), which becomes a problem when called from async callers.
backend/utils/llm/clients.py	Adds `_LLMErrorCallback` and `_AnthropicClientProxy` with lazy construction. Error callback registration and BYOK proxy pattern look correct; `_AnthropicClientProxy._get_default()` avoids creating the client at import time when `ANTHROPIC_API_KEY` is absent.
backend/utils/byok.py	Adds `_byok_uid_ctx` ContextVar and `get/set_byok_uid` helpers. Both `validate_byok_request` and `validate_byok_websocket` now call `set_byok_uid`; the middleware correctly resets and restores both tokens in its `finally` block.
backend/database/redis_db.py	Adds `try_acquire_byok_llm_error_notification_lock` with a 24-hour TTL using SET NX. Implementation is correct; no issues found.
backend/routers/sync.py	Propagates BYOK UID into background pipeline via `set_byok_uid`; correctly clears it in the `finally` block alongside `set_byok_keys`.
backend/routers/pusher.py	Sets BYOK UID in the async background task context when BYOK keys are present; correct since asyncio task context is independent of the request context.
backend/routers/users.py	Replaces bare `threading.Thread` for account deletion wipe with `submit_with_context(storage_executor, ...)`, moving the heavy background job onto the shared storage thread pool.
backend/utils/stt/vad.py	Switches the hosted VAD call from `requests` to `httpx`; timeout is passed as a float (300.0) which is correct for httpx's API.
backend/tests/unit/test_byok_llm_errors.py	New test file covering classification, notification gating, UID propagation, and the lazy Anthropic proxy. Coverage of the happy-path notification flow and platform-error exclusion looks solid.

_{Reviews (1): Last reviewed commit: "fix(backend): surface BYOK LLM failures ..." | Re-trigger Greptile}

greptile-apps · 2026-05-15T18:09:08Z

        except Exception as e:
-            logger.error(f"Anthropic API error: {e}")
+            handle_llm_error(e, 'anthropic', feature='chat_agent', model=ANTHROPIC_AGENT_MODEL)
            await callback.put_data(f"\n\nSorry, I encountered an error. Please try again.")
            await callback.end()
            return


Blocking sync I/O called directly in async error handler

handle_llm_error is a synchronous function that unconditionally calls _send_byok_llm_error_notification, which makes a synchronous Firestore read (notification_db.get_all_tokens), a Redis write (try_acquire_byok_llm_error_notification_lock), and a synchronous FCM send (messaging.send_each). Calling this directly in an async def — without asyncio.get_event_loop().run_in_executor or similar — will stall the entire event loop on the first BYOK error until all three network calls resolve, which could be several hundred milliseconds and will delay every other in-flight coroutine on the server.

greptile-apps · 2026-05-15T18:09:09Z

+    except Exception as e:
+        logger.error(f"Error in OpenAI agent stream: {e}")
+        traceback.print_exc()
+        await callback.end()


OpenAI stream-level errors bypass handle_llm_error

The LangChain _LLMErrorCallback fires for errors routed through LangChain's callback system (e.g., model invocation failures), but stream-level or chain-level failures that propagate directly to astream_events as an exception can reach this except block without triggering the callback. Unlike the Anthropic path, this handler does not call handle_llm_error, so those errors are silently swallowed without BYOK push notifications.

Suggested change

except Exception as e:

logger.error(f"Error in OpenAI agent stream: {e}")

traceback.print_exc()

await callback.end()

except Exception as e:

handle_llm_error(e, 'openai', feature='chat_agent')

logger.error(f"Error in OpenAI agent stream: {e}")

traceback.print_exc()

await callback.end()

greptile-apps · 2026-05-15T18:09:10Z

+    except Exception as e:
+        logger.error(f"Error in execute_agentic_chat_stream openai: {e}")
+        traceback.print_exc()
+        if callback_data is not None:
+            callback_data['error'] = str(e)


Background task not cancelled on outer except Exception

When the outer generator's except Exception as e: block is hit (a non-CancelledError failure in the queue-draining loop), task is never cancelled or awaited. The _run_openai_agent_stream coroutine will keep running — consuming OpenAI tokens and filling callback.queue — until it finishes on its own, which could be many seconds for a long tool chain.

Suggested change

except Exception as e:

logger.error(f"Error in execute_agentic_chat_stream openai: {e}")

traceback.print_exc()

if callback_data is not None:

callback_data['error'] = str(e)

except Exception as e:

task.cancel()

logger.error(f"Error in execute_agentic_chat_stream openai: {e}")

traceback.print_exc()

if callback_data is not None:

callback_data['error'] = str(e)

apoorvdarshan · 2026-05-15T18:28:12Z

Closing in favor of split issue-specific PRs: #7321 (#7290), #7322 (#7238), and #7323 (#7291).

github-actions · 2026-05-15T18:28:22Z

Hey @apoorvdarshan 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

Project standards — Ensuring consistency across the codebase
User needs — Making sure changes align with what our users need
Code best practices — Maintaining code quality and maintainability
Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

fix(backend): surface BYOK LLM failures and chat fallback

936680a

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

apoorvdarshan closed this May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): surface BYOK LLM failures and chat fallback#7320

fix(backend): surface BYOK LLM failures and chat fallback#7320
apoorvdarshan wants to merge 1 commit into
BasedHardware:mainfrom
apoorvdarshan:fix/llm-byok-chat-provider

apoorvdarshan commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 15, 2026

Uh oh!

greptile-apps Bot May 15, 2026

Uh oh!

greptile-apps Bot May 15, 2026

Uh oh!

greptile-apps Bot May 15, 2026

Uh oh!

apoorvdarshan commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apoorvdarshan commented May 15, 2026

Summary

Verification

Uh oh!

greptile-apps Bot commented May 15, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

apoorvdarshan commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant