Skip to content

fix(backend): notify users about BYOK LLM key failures#7323

Open
apoorvdarshan wants to merge 2 commits into
BasedHardware:mainfrom
apoorvdarshan:fix/byok-llm-error-notifications
Open

fix(backend): notify users about BYOK LLM key failures#7323
apoorvdarshan wants to merge 2 commits into
BasedHardware:mainfrom
apoorvdarshan:fix/byok-llm-error-notifications

Conversation

@apoorvdarshan
Copy link
Copy Markdown

Summary

  • add rate-limited FCM notifications for actionable BYOK LLM failures: invalid key, permission denied, and quota exhaustion
  • deduplicate notifications per user/provider/reason with a Redis SET NX lock
  • remove permanently invalid FCM tokens returned by Firebase

Closes #7291

Stack note

This PR builds on the BYOK error-source plumbing from #7290. GitHub shows that base commit here because the upstream repo does not have the fork branch as a selectable base; after #7290 lands this can be rebased to show only the notification commit.

Verification

  • PATH=/tmp/omi-backend-py311/bin:$PATH ./test-preflight.sh
  • PATH=/tmp/omi-backend-py311/bin:$PATH pytest tests/unit/test_byok_llm_logging.py tests/unit/test_byok_llm_notifications.py -q

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds end-to-end BYOK LLM failure notifications: a new _byok_uid_ctx ContextVar threads the authenticated user ID through LLM call stacks, a new byok_errors.py module classifies actionable errors (invalid key, permission denied, quota exhaustion) and dispatches FCM push notifications, and a Redis SET NX lock deduplicates notifications to at most one per user/provider/reason per 24 hours.

  • byok_errors.py introduces classify_byok_llm_error, handle_llm_error/handle_llm_error_async, and _send_byok_llm_error_notification; the Redis lock is acquired before the FCM send, so a Firebase failure consumes the dedup window silently.
  • clients.py adds _LLMErrorCallback (a LangChain BaseCallbackHandler) and wires it into every LLM/embedding factory via _with_llm_callbacks.
  • byok.py, pusher.py, sync.py, and agentic.py are updated to propagate and reset the new byok_uid context variable correctly.

Confidence Score: 3/5

Safe to merge after addressing the lock-before-send ordering in _send_byok_llm_error_notification; all other changes are additive and well-tested.

The core notification function acquires the 24-hour Redis dedup lock before attempting the FCM send. If Firebase throws, the lock is held but the user was never notified, defeating the feature's primary goal.

backend/utils/llm/byok_errors.py — specifically the lock-acquire-then-send ordering in _send_byok_llm_error_notification.

Important Files Changed

Filename Overview
backend/utils/llm/byok_errors.py New module implementing BYOK LLM error classification, structured logging, and FCM notification delivery. Contains a logic bug where the Redis dedup lock is consumed before a successful FCM send, silencing notifications for 24h on any Firebase failure. Token fetch also occurs before the lock check, wasting Firestore reads on every suppressed duplicate.
backend/utils/llm/clients.py Adds _LLMErrorCallback LangChain callback handler and wires it into all LLM client factories via _with_llm_callbacks. Embeddings proxy now wraps async/sync calls to catch and route errors. Changes are additive and non-breaking.
backend/utils/byok.py Adds _byok_uid_ctx ContextVar and corresponding get/set helpers; properly reset in BYOKMiddleware and validate_byok_websocket. Change is minimal and well-scoped.
backend/database/redis_db.py Adds try_acquire_byok_llm_error_notification_lock using Redis SET NX with a 24h default TTL. Implementation is correct.
backend/routers/sync.py Adds set_byok_uid call at pipeline start and a matching reset in the finally block.
backend/routers/pusher.py Adds set_byok_uid(uid) alongside the existing set_byok_keys call inside an async task.
backend/utils/retrieval/agentic.py Replaces a plain logger.error with handle_llm_error_async in the Anthropic agent stream error handler.
backend/tests/unit/test_byok_llm_notifications.py Unit tests covering successful send, deduplication, platform errors not notified, and bad-token cleanup.
backend/tests/unit/test_byok_llm_logging.py Unit tests for error classification and the logging/callback plumbing.

Sequence Diagram

sequenceDiagram
    participant Router as Router (pusher/sync)
    participant BYOK as byok.py
    participant Handler as handle_llm_error
    participant Redis as Redis
    participant Firestore as Firestore
    participant FCM as Firebase FCM

    Router->>BYOK: set_byok_uid(uid)
    Router->>Handler: LLM error triggers callback
    Handler->>Firestore: get_all_tokens(uid)
    Handler->>Redis: SET NX lock EX 86400
    alt Lock acquired
        Handler->>FCM: send_each(messages)
        alt FCM success
            FCM-->>Handler: BatchResponse
        else FCM failure
            Note over Handler,Redis: Lock held, no notification sent
        end
    else Lock not acquired
        Handler-->>Router: return
    end
Loading

Reviews (1): Last reviewed commit: "fix(backend): notify users about BYOK LL..." | Re-trigger Greptile

Comment on lines +140 to +158
try:
acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason)
except Exception as e:
logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e)
return

if not acquired:
logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason)
return

notification = messaging.Notification(title='omi', body=body)
data = {'type': 'byok_llm_error', 'provider': provider, 'reason': reason}
messages = [messaging.Message(token=token, notification=notification, data=data) for token in tokens]

try:
response = messaging.send_each(messages)
except Exception as e:
logger.error('BYOK LLM notification send failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Redis lock consumed but notification silently lost on FCM failure

The Redis dedup lock is acquired at line 141 with a 24-hour TTL, then if messaging.send_each raises (lines 154-158), the function returns without releasing the key. The lock is now held for 24 hours even though the user received no notification. Every subsequent BYOK error of the same type within that window will be suppressed by the held lock. A transient Firebase outage silently swallows the user's entire next-day notification quota for that provider/reason.

Comment on lines +128 to +148
try:
tokens = notification_db.get_all_tokens(uid)
except Exception as e:
logger.error(
'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e
)
return

if not tokens:
logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason)
return

try:
acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason)
except Exception as e:
logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e)
return

if not acquired:
logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Tokens fetched from DB before lock is checked — wastes reads for every suppressed duplicate

notification_db.get_all_tokens(uid) queries Firestore on every error occurrence, even when the Redis dedup lock is already held and no notification will be sent. For quota or auth errors (which repeat on every LLM call until the user fixes the key), this adds a Firestore read per call for what is effectively a no-op. Checking the lock first avoids that cost for the overwhelmingly common "already notified" path.

Suggested change
try:
tokens = notification_db.get_all_tokens(uid)
except Exception as e:
logger.error(
'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e
)
return
if not tokens:
logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason)
return
try:
acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason)
except Exception as e:
logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e)
return
if not acquired:
logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason)
return
try:
acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason)
except Exception as e:
logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e)
return
if not acquired:
logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason)
return
try:
tokens = notification_db.get_all_tokens(uid)
except Exception as e:
logger.error(
'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e
)
return
if not tokens:
logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason)
return

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backend: notify users when their BYOK key is invalid or exhausted

1 participant