fix: Redis resilience - replace permanent failure flag with retry#16
fix: Redis resilience - replace permanent failure flag with retry#16
Conversation
When Redis connections failed (e.g., stale TCP connections dropped by k8s networking), the _redis_available flag was permanently set to False, disabling caching for the entire process lifetime. This caused all VFS operations to hit the database directly, leading to slow MCP responses and 502 backend_timeout errors at the load balancer. Changes: - Replace permanent _redis_available flag with time-based backoff (30s) - Add health_check_interval=15 to detect stale connections before use - Add retry_on_timeout=True for transient timeout errors - Reset client on ConnectionError so pool can create fresh connections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a critical production issue in
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Request: get_redis_client] --> B{redis_url configured?}
B -- No --> C[Return None]
B -- Yes --> D{_redis_unavailable_until\n> time.monotonic?}
D -- Yes\nbackoff active --> C
D -- No --> E{_redis_client\nnot None?}
E -- Yes --> F[Return existing client]
E -- No --> G[Create new client\nhealth_check_interval=15\nretry_on_timeout=False]
G --> H[client.ping]
H -- Success --> I[_redis_unavailable_until = 0\nReturn client]
H -- Exception --> J[_redis_unavailable_until =\nnow + 30s\n_redis_client = None\nReturn None]
K[cache_result wrapper] --> L[get_redis_client]
L -- None --> M[Execute func directly]
L -- client --> N[client.get cache_key]
N -- Hit --> O[Return cached value]
N -- ConnectionError/TimeoutError --> P[_mark_redis_unavailable\n_redis_client=None\n_redis_unavailable_until=now+30s]
P --> Q[Execute func]
Q --> R[get_redis_client again]
R -- None\nstill in backoff --> S[Skip write\nReturn result]
R -- client --> T[client.setex cache_key]
T -- ConnectionError/TimeoutError --> U[_mark_redis_unavailable]
T -- Success --> V[Return result]
|
…r read failure - Add ConnectionError handling to invalidate_analytics_cache so it resets the stale client instead of leaving it for subsequent callers. - Skip cache write attempt if read already marked Redis unavailable, avoiding a redundant 2s timeout on the dead connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Error - Write path now calls get_redis_client() for a fresh reference instead of reusing the potentially stale local `client` from the read phase. Prevents a reconnected client from being killed by a stale reference. - Catch redis.TimeoutError alongside ConnectionError in all three sites (read, write, invalidate). TimeoutError is not a ConnectionError subclass, so it was falling through to the generic handler without triggering backoff — causing 4s penalty per request under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With retry_on_timeout=True, timed-out commands block for 4s (2s × 2) before fallback. health_check_interval=15 already detects stale connections proactively, making the retry redundant and harmful. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
_redis_availablewas permanently set toFalse, disabling caching for the entire process lifetime. This caused all VFS operations to hit the database directly, leading to slow MCP responses (>30s) and 502backend_timeouterrors at the GCP load balancer._redis_available = Falseflag with a time-based backoff (_redis_unavailable_until) that retries after 30 seconds.health_check_interval=15to detect stale connections before use (sends PING if 15s since last command).ConnectionErrororTimeoutErrorduring cache read/write/invalidate, reset the client and enter backoff so subsequent requests don't hammer a dead connection.Context
In production (
unstract-prodcluster), Redis connections were being dropped by k8s networking after idle periods. The first failure set_redis_available = Falsepermanently, disabling all caching until pod restart. With no cache, VFS operations for large projects exceeded the 30s LB timeout.Related PR: Zipstack/mfbt-enterprise - BackendConfig timeout increase and Redis
tcp-keepaliveTest plan
health_check_intervaldetects stale connectionsreset_redis_connection()still works for tests🤖 Generated with Claude Code