fix(backend): notify users about BYOK LLM key failures#7323
fix(backend): notify users about BYOK LLM key failures#7323apoorvdarshan wants to merge 2 commits into
Conversation
Greptile SummaryThis PR adds end-to-end BYOK LLM failure notifications: a new
Confidence Score: 3/5Safe to merge after addressing the lock-before-send ordering in _send_byok_llm_error_notification; all other changes are additive and well-tested. The core notification function acquires the 24-hour Redis dedup lock before attempting the FCM send. If Firebase throws, the lock is held but the user was never notified, defeating the feature's primary goal. backend/utils/llm/byok_errors.py — specifically the lock-acquire-then-send ordering in _send_byok_llm_error_notification. Important Files Changed
Sequence DiagramsequenceDiagram
participant Router as Router (pusher/sync)
participant BYOK as byok.py
participant Handler as handle_llm_error
participant Redis as Redis
participant Firestore as Firestore
participant FCM as Firebase FCM
Router->>BYOK: set_byok_uid(uid)
Router->>Handler: LLM error triggers callback
Handler->>Firestore: get_all_tokens(uid)
Handler->>Redis: SET NX lock EX 86400
alt Lock acquired
Handler->>FCM: send_each(messages)
alt FCM success
FCM-->>Handler: BatchResponse
else FCM failure
Note over Handler,Redis: Lock held, no notification sent
end
else Lock not acquired
Handler-->>Router: return
end
Reviews (1): Last reviewed commit: "fix(backend): notify users about BYOK LL..." | Re-trigger Greptile |
| try: | ||
| acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason) | ||
| except Exception as e: | ||
| logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e) | ||
| return | ||
|
|
||
| if not acquired: | ||
| logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason) | ||
| return | ||
|
|
||
| notification = messaging.Notification(title='omi', body=body) | ||
| data = {'type': 'byok_llm_error', 'provider': provider, 'reason': reason} | ||
| messages = [messaging.Message(token=token, notification=notification, data=data) for token in tokens] | ||
|
|
||
| try: | ||
| response = messaging.send_each(messages) | ||
| except Exception as e: | ||
| logger.error('BYOK LLM notification send failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e) | ||
| return |
There was a problem hiding this comment.
Redis lock consumed but notification silently lost on FCM failure
The Redis dedup lock is acquired at line 141 with a 24-hour TTL, then if messaging.send_each raises (lines 154-158), the function returns without releasing the key. The lock is now held for 24 hours even though the user received no notification. Every subsequent BYOK error of the same type within that window will be suppressed by the held lock. A transient Firebase outage silently swallows the user's entire next-day notification quota for that provider/reason.
| try: | ||
| tokens = notification_db.get_all_tokens(uid) | ||
| except Exception as e: | ||
| logger.error( | ||
| 'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e | ||
| ) | ||
| return | ||
|
|
||
| if not tokens: | ||
| logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason) | ||
| return | ||
|
|
||
| try: | ||
| acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason) | ||
| except Exception as e: | ||
| logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e) | ||
| return | ||
|
|
||
| if not acquired: | ||
| logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason) | ||
| return |
There was a problem hiding this comment.
Tokens fetched from DB before lock is checked — wastes reads for every suppressed duplicate
notification_db.get_all_tokens(uid) queries Firestore on every error occurrence, even when the Redis dedup lock is already held and no notification will be sent. For quota or auth errors (which repeat on every LLM call until the user fixes the key), this adds a Firestore read per call for what is effectively a no-op. Checking the lock first avoids that cost for the overwhelmingly common "already notified" path.
| try: | |
| tokens = notification_db.get_all_tokens(uid) | |
| except Exception as e: | |
| logger.error( | |
| 'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e | |
| ) | |
| return | |
| if not tokens: | |
| logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason) | |
| return | |
| try: | |
| acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason) | |
| except Exception as e: | |
| logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e) | |
| return | |
| if not acquired: | |
| logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason) | |
| return | |
| try: | |
| acquired = try_acquire_byok_llm_error_notification_lock(uid, provider, reason) | |
| except Exception as e: | |
| logger.error('BYOK LLM notification lock failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e) | |
| return | |
| if not acquired: | |
| logger.info('BYOK LLM notification already sent recently uid=%s provider=%s reason=%s', uid, provider, reason) | |
| return | |
| try: | |
| tokens = notification_db.get_all_tokens(uid) | |
| except Exception as e: | |
| logger.error( | |
| 'BYOK LLM notification token lookup failed uid=%s provider=%s reason=%s: %s', uid, provider, reason, e | |
| ) | |
| return | |
| if not tokens: | |
| logger.info('No tokens found for BYOK LLM notification uid=%s provider=%s reason=%s', uid, provider, reason) | |
| return |
Summary
Closes #7291
Stack note
This PR builds on the BYOK error-source plumbing from #7290. GitHub shows that base commit here because the upstream repo does not have the fork branch as a selectable base; after #7290 lands this can be rebased to show only the notification commit.
Verification