Skip to content

fix: MutexOwner retries on transient Redis errors instead of crashing#2131

Open
isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
isaacrowntree:fix/redis-reconnect-resilience
Open

fix: MutexOwner retries on transient Redis errors instead of crashing#2131
isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
isaacrowntree:fix/redis-reconnect-resilience

Conversation

@isaacrowntree
Copy link
Copy Markdown

@isaacrowntree isaacrowntree commented Mar 26, 2026

Summary

Fixes #2072 — When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts, Sequin enters an unrecoverable failure state requiring a full restart. This has bitten us (and others per #2072) multiple times in production.

Root cause: MutexOwner in :has_mutex state calls acquire_mutex which returns :error when Redis is unreachable. The handler immediately returns {:shutdown, :err_keeping_mutex} — an invalid GenStateMachine return that crashes the process with {:bad_return_from_state_function, ...}. Due to MutexedSupervisor's :one_for_all strategy, this cascades and takes down the entire Runtime.Supervisor including all consumers.

Changes:

  • MutexOwner: Instead of crashing on Redis errors, retry indefinitely with exponential backoff (capped at 1 hour). When Redis comes back, the error counter resets and normal operation resumes. Also fixes the GenStateMachine return value ({:stop, {:shutdown, reason}} instead of invalid {:shutdown, reason}).
  • SinkConsumersLive.Index: Handle {:error, _} from metrics calls instead of crashing with MatchError.
  • SinkConsumersLive.Show: Same defensive handling for all three metrics calls.
  • HttpEndpointsLive.Show: Same fix for get_http_endpoint_throughput/1.

Context

We self-host Sequin on Railway with Dragonfly (Redis-compatible) as the backing store. Railway periodically auto-updates Dragonfly, which causes a brief restart. Every time this happens, Sequin fails to self-heal and requires a manual restart — we've hit this 3 times now. The :await_mutex state already retries correctly on Redis errors; this PR brings the same resilience to the :has_mutex state, but with exponential backoff so it doesn't spam during extended outages.

Test plan

  • Added MutexOwnerTest with unit tests for State struct and integration tests (tagged :integration) that use iptables REJECT to simulate Redis going down and coming back
  • Integration tests verify: process survives Redis outage, never crashes, recovers when Redis returns
  • All existing tests pass (mix test)
  • mix format --check-formatted passes
  • Verify in staging by restarting Redis while Sequin is running

Note: Integration tests require NET_ADMIN capability (for iptables) and are tagged :integration so they can be excluded from normal test runs: mix test --exclude integration

🤖 Generated with Claude Code

When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts,
MutexOwner would immediately crash with {:shutdown, :err_keeping_mutex}.
Since MutexedSupervisor uses :one_for_all strategy, this cascades and
takes down the entire Runtime.Supervisor including all consumers.

The fix:
- MutexOwner now retries up to 5 times with backoff on Redis errors
  while in :has_mutex state, giving Redis time to come back
- Resets the error counter on successful reconnection
- Also fixes the GenStateMachine return value (was {:shutdown, reason}
  which is invalid - now {:stop, {:shutdown, reason}})
- LiveView pages (index.ex, show.ex) now handle Redis errors gracefully
  in metrics loading instead of crashing with MatchError

Fixes sequinstream#2072

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working reliability labels Mar 26, 2026
isaacrowntree and others added 2 commits March 27, 2026 00:35
…s errors

Instead of giving up after N errors, MutexOwner now retries indefinitely
with exponential backoff capped at 1 hour. Redis going down should never
crash Sequin — it should degrade gracefully and self-heal when Redis returns.

Integration tests use iptables REJECT to simulate a real Redis/Dragonfly
redeploy and verify the process survives the outage and recovers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation tests

Unit tests (6, <0.1s): verify backoff math, error counter behavior, and
state struct defaults without needing Redis.

Integration tests (2, ~35s each, tagged :integration): use iptables REJECT
to simulate Redis going down, verifying the process survives and recovers.
Run with: mix test --include integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working reliability size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redis lost connection does not seem to get re-created

1 participant