fix: Redis resilience - replace permanent failure flag with retry by ritwik-g · Pull Request #16 · Zipstack/mfbt

ritwik-g · 2026-03-13T06:12:00Z

Summary

Bug: When Redis connections failed (stale TCP connections dropped by k8s conntrack/kube-proxy), _redis_available was permanently set to False, disabling caching for the entire process lifetime. This caused all VFS operations to hit the database directly, leading to slow MCP responses (>30s) and 502 backend_timeout errors at the GCP load balancer.
Fix: Replace the permanent _redis_available = False flag with a time-based backoff (_redis_unavailable_until) that retries after 30 seconds.
Add health_check_interval=15 to detect stale connections before use (sends PING if 15s since last command).
On ConnectionError or TimeoutError during cache read/write/invalidate, reset the client and enter backoff so subsequent requests don't hammer a dead connection.
Re-fetch Redis client before cache writes to avoid using a stale local reference after a read failure.

Context

In production (unstract-prod cluster), Redis connections were being dropped by k8s networking after idle periods. The first failure set _redis_available = False permanently, disabling all caching until pod restart. With no cache, VFS operations for large projects exceeded the 30s LB timeout.

Related PR: Zipstack/mfbt-enterprise - BackendConfig timeout increase and Redis tcp-keepalive

Test plan

Verify Redis client reconnects after transient connection failure
Verify 30s backoff prevents hammering Redis during outage
Verify health_check_interval detects stale connections
Verify reset_redis_connection() still works for tests

🤖 Generated with Claude Code

When Redis connections failed (e.g., stale TCP connections dropped by k8s networking), the _redis_available flag was permanently set to False, disabling caching for the entire process lifetime. This caused all VFS operations to hit the database directly, leading to slow MCP responses and 502 backend_timeout errors at the load balancer. Changes: - Replace permanent _redis_available flag with time-based backoff (30s) - Add health_check_interval=15 to detect stale connections before use - Add retry_on_timeout=True for transient timeout errors - Reset client on ConnectionError so pool can create fresh connections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-13T06:16:09Z

Greptile Summary

This PR fixes a critical production issue in analytics_cache.py where a single Redis ConnectionError would permanently disable caching for the entire pod lifetime (via _redis_available = False). The fix replaces that flag with a time-based _redis_unavailable_until float that expires after 30 seconds, allowing the pod to self-heal without a restart. It also adds health_check_interval=15 to proactively detect stale TCP connections, separates the cache-write client lookup so a mid-request failure doesn't attempt a write on a known-dead connection, and adds specific ConnectionError/TimeoutError handlers in all three Redis-touching paths (cache_result read, cache_result write, invalidate_analytics_cache).

The core backoff logic is correct and addresses the described k8s conntrack/kube-proxy TCP-drop scenario.
One discrepancy: the PR description states "Add retry_on_timeout=True for transient timeout errors" but line 58 has retry_on_timeout=False. If the intent is to let redis-py retry once on a TimeoutError before surfacing it to the application, this should be True. If the intent is to fail-fast and let the application-level backoff take over (avoiding 2× socket timeout latency), False is correct — but the PR description must be updated.
No tests appear to have been added or updated alongside this change; CLAUDE.md requires all new code to have tests, and the PR test plan items remain unchecked.

Confidence Score: 4/5

Safe to merge with one discrepancy to resolve: clarify whether retry_on_timeout should be True or False and align with the PR description.
The change is a targeted, well-scoped resilience improvement with graceful fallback throughout. The backoff logic is sound and directly addresses the described production failure. The only notable issue is the retry_on_timeout=False vs. the stated True — depending on intent this is either a comment error or a logic bug, but either way it does not make the system less reliable than the old code. The lack of tests is a process gap per the project's TDD policy.
backend/app/services/analytics_cache.py — confirm retry_on_timeout value intent

Important Files Changed

Filename	Overview
backend/app/services/analytics_cache.py	Replaces the permanent `_redis_available` boolean with a time-based `_redis_unavailable_until` float for a 30-second retry backoff. Adds `health_check_interval=15`, splits the cache-write path to re-fetch the client after a connection failure, and adds explicit `ConnectionError`/`TimeoutError` handling in all three Redis-touching code paths. One discrepancy: `retry_on_timeout=False` in code vs. `retry_on_timeout=True` stated in PR description.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Request: get_redis_client] --> B{redis_url configured?}
    B -- No --> C[Return None]
    B -- Yes --> D{_redis_unavailable_until\n> time.monotonic?}
    D -- Yes\nbackoff active --> C
    D -- No --> E{_redis_client\nnot None?}
    E -- Yes --> F[Return existing client]
    E -- No --> G[Create new client\nhealth_check_interval=15\nretry_on_timeout=False]
    G --> H[client.ping]
    H -- Success --> I[_redis_unavailable_until = 0\nReturn client]
    H -- Exception --> J[_redis_unavailable_until =\nnow + 30s\n_redis_client = None\nReturn None]

    K[cache_result wrapper] --> L[get_redis_client]
    L -- None --> M[Execute func directly]
    L -- client --> N[client.get cache_key]
    N -- Hit --> O[Return cached value]
    N -- ConnectionError/TimeoutError --> P[_mark_redis_unavailable\n_redis_client=None\n_redis_unavailable_until=now+30s]
    P --> Q[Execute func]
    Q --> R[get_redis_client again]
    R -- None\nstill in backoff --> S[Skip write\nReturn result]
    R -- client --> T[client.setex cache_key]
    T -- ConnectionError/TimeoutError --> U[_mark_redis_unavailable]
    T -- Success --> V[Return result]

Comments Outside Diff (1)

backend/app/services/analytics_cache.py, line 58 (link)

retry_on_timeout contradicts PR description

The PR description explicitly states: "Add retry_on_timeout=True for transient timeout errors." However, the code sets retry_on_timeout=False.

With retry_on_timeout=False, a TimeoutError during the health_check_interval PING (or any redis command) will immediately propagate to the application, causing _mark_redis_unavailable() to be called and entering the 30-second backoff. With retry_on_timeout=True, redis-py would attempt one automatic retry before raising — which could help with momentary network hiccups without triggering the full backoff.

If the intent is to rely solely on the application-level backoff (fail-fast to avoid doubling the 2s socket timeout), then False may be the correct choice — but the PR description and the comment in the PR need to be corrected. If the intent was True, this is a bug.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: backend/app/services/analytics_cache.py
Line: 58

Comment:
**`retry_on_timeout` contradicts PR description**

The PR description explicitly states: _"Add `retry_on_timeout=True` for transient timeout errors."_ However, the code sets `retry_on_timeout=False`.

With `retry_on_timeout=False`, a `TimeoutError` during the `health_check_interval` PING (or any redis command) will immediately propagate to the application, causing `_mark_redis_unavailable()` to be called and entering the 30-second backoff. With `retry_on_timeout=True`, redis-py would attempt one automatic retry before raising — which could help with momentary network hiccups without triggering the full backoff.

If the intent is to rely solely on the application-level backoff (fail-fast to avoid doubling the 2s socket timeout), then `False` may be the correct choice — but the PR description and the comment in the PR need to be corrected. If the intent was `True`, this is a bug.

```suggestion
            retry_on_timeout=True,
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: b1a3099}

…r read failure - Add ConnectionError handling to invalidate_analytics_cache so it resets the stale client instead of leaving it for subsequent callers. - Skip cache write attempt if read already marked Redis unavailable, avoiding a redundant 2s timeout on the dead connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Error - Write path now calls get_redis_client() for a fresh reference instead of reusing the potentially stale local `client` from the read phase. Prevents a reconnected client from being killed by a stale reference. - Catch redis.TimeoutError alongside ConnectionError in all three sites (read, write, invalidate). TimeoutError is not a ConnectionError subclass, so it was falling through to the generic handler without triggering backoff — causing 4s penalty per request under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With retry_on_timeout=True, timed-out commands block for 4s (2s × 2) before fallback. health_check_interval=15 already detects stale connections proactively, making the retry redundant and harmful. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ritwik-g requested a review from shuveb March 13, 2026 06:12

ritwik-g and others added 4 commits March 13, 2026 11:49

fix: restore debug log when Redis URL is not configured

b1a3099

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shuveb approved these changes Mar 13, 2026

View reviewed changes

shuveb merged commit c490443 into main Mar 13, 2026
3 checks passed

shuveb deleted the fix/redis-resilience-stale-connections branch March 13, 2026 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Redis resilience - replace permanent failure flag with retry#16

fix: Redis resilience - replace permanent failure flag with retry#16
shuveb merged 5 commits intomainfrom
fix/redis-resilience-stale-connections

ritwik-g commented Mar 13, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 13, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ritwik-g commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

greptile-apps bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ritwik-g commented Mar 13, 2026 •

edited

Loading

greptile-apps bot commented Mar 13, 2026 •

edited

Loading