Skip to content

fix: Redis resilience - replace permanent failure flag with retry#16

Merged
shuveb merged 5 commits intomainfrom
fix/redis-resilience-stale-connections
Mar 13, 2026
Merged

fix: Redis resilience - replace permanent failure flag with retry#16
shuveb merged 5 commits intomainfrom
fix/redis-resilience-stale-connections

Conversation

@ritwik-g
Copy link
Contributor

@ritwik-g ritwik-g commented Mar 13, 2026

Summary

  • Bug: When Redis connections failed (stale TCP connections dropped by k8s conntrack/kube-proxy), _redis_available was permanently set to False, disabling caching for the entire process lifetime. This caused all VFS operations to hit the database directly, leading to slow MCP responses (>30s) and 502 backend_timeout errors at the GCP load balancer.
  • Fix: Replace the permanent _redis_available = False flag with a time-based backoff (_redis_unavailable_until) that retries after 30 seconds.
  • Add health_check_interval=15 to detect stale connections before use (sends PING if 15s since last command).
  • On ConnectionError or TimeoutError during cache read/write/invalidate, reset the client and enter backoff so subsequent requests don't hammer a dead connection.
  • Re-fetch Redis client before cache writes to avoid using a stale local reference after a read failure.

Context

In production (unstract-prod cluster), Redis connections were being dropped by k8s networking after idle periods. The first failure set _redis_available = False permanently, disabling all caching until pod restart. With no cache, VFS operations for large projects exceeded the 30s LB timeout.

Related PR: Zipstack/mfbt-enterprise - BackendConfig timeout increase and Redis tcp-keepalive

Test plan

  • Verify Redis client reconnects after transient connection failure
  • Verify 30s backoff prevents hammering Redis during outage
  • Verify health_check_interval detects stale connections
  • Verify reset_redis_connection() still works for tests

🤖 Generated with Claude Code

When Redis connections failed (e.g., stale TCP connections dropped by
k8s networking), the _redis_available flag was permanently set to False,
disabling caching for the entire process lifetime. This caused all VFS
operations to hit the database directly, leading to slow MCP responses
and 502 backend_timeout errors at the load balancer.

Changes:
- Replace permanent _redis_available flag with time-based backoff (30s)
- Add health_check_interval=15 to detect stale connections before use
- Add retry_on_timeout=True for transient timeout errors
- Reset client on ConnectionError so pool can create fresh connections

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ritwik-g ritwik-g requested a review from shuveb March 13, 2026 06:12
@greptile-apps
Copy link

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR fixes a critical production issue in analytics_cache.py where a single Redis ConnectionError would permanently disable caching for the entire pod lifetime (via _redis_available = False). The fix replaces that flag with a time-based _redis_unavailable_until float that expires after 30 seconds, allowing the pod to self-heal without a restart. It also adds health_check_interval=15 to proactively detect stale TCP connections, separates the cache-write client lookup so a mid-request failure doesn't attempt a write on a known-dead connection, and adds specific ConnectionError/TimeoutError handlers in all three Redis-touching paths (cache_result read, cache_result write, invalidate_analytics_cache).

  • The core backoff logic is correct and addresses the described k8s conntrack/kube-proxy TCP-drop scenario.
  • One discrepancy: the PR description states "Add retry_on_timeout=True for transient timeout errors" but line 58 has retry_on_timeout=False. If the intent is to let redis-py retry once on a TimeoutError before surfacing it to the application, this should be True. If the intent is to fail-fast and let the application-level backoff take over (avoiding 2× socket timeout latency), False is correct — but the PR description must be updated.
  • No tests appear to have been added or updated alongside this change; CLAUDE.md requires all new code to have tests, and the PR test plan items remain unchecked.

Confidence Score: 4/5

  • Safe to merge with one discrepancy to resolve: clarify whether retry_on_timeout should be True or False and align with the PR description.
  • The change is a targeted, well-scoped resilience improvement with graceful fallback throughout. The backoff logic is sound and directly addresses the described production failure. The only notable issue is the retry_on_timeout=False vs. the stated True — depending on intent this is either a comment error or a logic bug, but either way it does not make the system less reliable than the old code. The lack of tests is a process gap per the project's TDD policy.
  • backend/app/services/analytics_cache.py — confirm retry_on_timeout value intent

Important Files Changed

Filename Overview
backend/app/services/analytics_cache.py Replaces the permanent _redis_available boolean with a time-based _redis_unavailable_until float for a 30-second retry backoff. Adds health_check_interval=15, splits the cache-write path to re-fetch the client after a connection failure, and adds explicit ConnectionError/TimeoutError handling in all three Redis-touching code paths. One discrepancy: retry_on_timeout=False in code vs. retry_on_timeout=True stated in PR description.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Request: get_redis_client] --> B{redis_url configured?}
    B -- No --> C[Return None]
    B -- Yes --> D{_redis_unavailable_until\n> time.monotonic?}
    D -- Yes\nbackoff active --> C
    D -- No --> E{_redis_client\nnot None?}
    E -- Yes --> F[Return existing client]
    E -- No --> G[Create new client\nhealth_check_interval=15\nretry_on_timeout=False]
    G --> H[client.ping]
    H -- Success --> I[_redis_unavailable_until = 0\nReturn client]
    H -- Exception --> J[_redis_unavailable_until =\nnow + 30s\n_redis_client = None\nReturn None]

    K[cache_result wrapper] --> L[get_redis_client]
    L -- None --> M[Execute func directly]
    L -- client --> N[client.get cache_key]
    N -- Hit --> O[Return cached value]
    N -- ConnectionError/TimeoutError --> P[_mark_redis_unavailable\n_redis_client=None\n_redis_unavailable_until=now+30s]
    P --> Q[Execute func]
    Q --> R[get_redis_client again]
    R -- None\nstill in backoff --> S[Skip write\nReturn result]
    R -- client --> T[client.setex cache_key]
    T -- ConnectionError/TimeoutError --> U[_mark_redis_unavailable]
    T -- Success --> V[Return result]
Loading

Comments Outside Diff (1)

  1. backend/app/services/analytics_cache.py, line 58 (link)

    retry_on_timeout contradicts PR description

    The PR description explicitly states: "Add retry_on_timeout=True for transient timeout errors." However, the code sets retry_on_timeout=False.

    With retry_on_timeout=False, a TimeoutError during the health_check_interval PING (or any redis command) will immediately propagate to the application, causing _mark_redis_unavailable() to be called and entering the 30-second backoff. With retry_on_timeout=True, redis-py would attempt one automatic retry before raising — which could help with momentary network hiccups without triggering the full backoff.

    If the intent is to rely solely on the application-level backoff (fail-fast to avoid doubling the 2s socket timeout), then False may be the correct choice — but the PR description and the comment in the PR need to be corrected. If the intent was True, this is a bug.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: backend/app/services/analytics_cache.py
Line: 58

Comment:
**`retry_on_timeout` contradicts PR description**

The PR description explicitly states: _"Add `retry_on_timeout=True` for transient timeout errors."_ However, the code sets `retry_on_timeout=False`.

With `retry_on_timeout=False`, a `TimeoutError` during the `health_check_interval` PING (or any redis command) will immediately propagate to the application, causing `_mark_redis_unavailable()` to be called and entering the 30-second backoff. With `retry_on_timeout=True`, redis-py would attempt one automatic retry before raising — which could help with momentary network hiccups without triggering the full backoff.

If the intent is to rely solely on the application-level backoff (fail-fast to avoid doubling the 2s socket timeout), then `False` may be the correct choice — but the PR description and the comment in the PR need to be corrected. If the intent was `True`, this is a bug.

```suggestion
            retry_on_timeout=True,
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: b1a3099

ritwik-g and others added 4 commits March 13, 2026 11:49
…r read failure

- Add ConnectionError handling to invalidate_analytics_cache so it
  resets the stale client instead of leaving it for subsequent callers.
- Skip cache write attempt if read already marked Redis unavailable,
  avoiding a redundant 2s timeout on the dead connection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Error

- Write path now calls get_redis_client() for a fresh reference instead
  of reusing the potentially stale local `client` from the read phase.
  Prevents a reconnected client from being killed by a stale reference.
- Catch redis.TimeoutError alongside ConnectionError in all three sites
  (read, write, invalidate). TimeoutError is not a ConnectionError
  subclass, so it was falling through to the generic handler without
  triggering backoff — causing 4s penalty per request under load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With retry_on_timeout=True, timed-out commands block for 4s (2s × 2)
before fallback. health_check_interval=15 already detects stale
connections proactively, making the retry redundant and harmful.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shuveb shuveb merged commit c490443 into main Mar 13, 2026
3 checks passed
@shuveb shuveb deleted the fix/redis-resilience-stale-connections branch March 13, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants