fix: MutexOwner retries on transient Redis errors instead of crashing by isaacrowntree · Pull Request #2131 · sequinstream/sequin

isaacrowntree · 2026-03-26T13:24:39Z

Summary

Fixes #2072 — When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts, Sequin enters an unrecoverable failure state requiring a full restart. This has bitten us (and others per #2072) multiple times in production.

Root cause: MutexOwner in :has_mutex state calls acquire_mutex which returns :error when Redis is unreachable. The handler immediately returns {:shutdown, :err_keeping_mutex} — an invalid GenStateMachine return that crashes the process with {:bad_return_from_state_function, ...}. Due to MutexedSupervisor's :one_for_all strategy, this cascades and takes down the entire Runtime.Supervisor including all consumers.

Changes:

MutexOwner: Instead of crashing on Redis errors, retry indefinitely with exponential backoff (capped at 1 hour). When Redis comes back, the error counter resets and normal operation resumes. Also fixes the GenStateMachine return value ({:stop, {:shutdown, reason}} instead of invalid {:shutdown, reason}).
SinkConsumersLive.Index: Handle {:error, _} from metrics calls instead of crashing with MatchError.
SinkConsumersLive.Show: Same defensive handling for all three metrics calls.
HttpEndpointsLive.Show: Same fix for get_http_endpoint_throughput/1.

Context

We self-host Sequin on Railway with Dragonfly (Redis-compatible) as the backing store. Railway periodically auto-updates Dragonfly, which causes a brief restart. Every time this happens, Sequin fails to self-heal and requires a manual restart — we've hit this 3 times now. The :await_mutex state already retries correctly on Redis errors; this PR brings the same resilience to the :has_mutex state, but with exponential backoff so it doesn't spam during extended outages.

Test plan

Added MutexOwnerTest with unit tests for State struct and integration tests (tagged :integration) that use iptables REJECT to simulate Redis going down and coming back
Integration tests verify: process survives Redis outage, never crashes, recovers when Redis returns
All existing tests pass (mix test)
mix format --check-formatted passes
Verify in staging by restarting Redis while Sequin is running

Note: Integration tests require NET_ADMIN capability (for iptables) and are tagged :integration so they can be excluded from normal test runs: mix test --exclude integration

🤖 Generated with Claude Code

When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts, MutexOwner would immediately crash with {:shutdown, :err_keeping_mutex}. Since MutexedSupervisor uses :one_for_all strategy, this cascades and takes down the entire Runtime.Supervisor including all consumers. The fix: - MutexOwner now retries up to 5 times with backoff on Redis errors while in :has_mutex state, giving Redis time to come back - Resets the error counter on successful reconnection - Also fixes the GenStateMachine return value (was {:shutdown, reason} which is invalid - now {:stop, {:shutdown, reason}}) - LiveView pages (index.ex, show.ex) now handle Redis errors gracefully in metrics loading instead of crashing with MatchError Fixes sequinstream#2072 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s errors Instead of giving up after N errors, MutexOwner now retries indefinitely with exponential backoff capped at 1 hour. Redis going down should never crash Sequin — it should degrade gracefully and self-heal when Redis returns. Integration tests use iptables REJECT to simulate a real Redis/Dragonfly redeploy and verify the process survives the outage and recovers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation tests Unit tests (6, <0.1s): verify backoff math, error counter behavior, and state struct defaults without needing Redis. Integration tests (2, ~35s each, tagged :integration): use iptables REJECT to simulate Redis going down, verifying the process survives and recovers. Run with: mix test --include integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working reliability labels Mar 26, 2026

isaacrowntree and others added 2 commits March 27, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MutexOwner retries on transient Redis errors instead of crashing#2131

fix: MutexOwner retries on transient Redis errors instead of crashing#2131
isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
isaacrowntree:fix/redis-reconnect-resilience

isaacrowntree commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isaacrowntree commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaacrowntree commented Mar 26, 2026 •

edited

Loading