fix: MutexOwner retries on transient Redis errors instead of crashing#2131
Open
isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
Open
fix: MutexOwner retries on transient Redis errors instead of crashing#2131isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
isaacrowntree wants to merge 3 commits intosequinstream:mainfrom
Conversation
When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts,
MutexOwner would immediately crash with {:shutdown, :err_keeping_mutex}.
Since MutexedSupervisor uses :one_for_all strategy, this cascades and
takes down the entire Runtime.Supervisor including all consumers.
The fix:
- MutexOwner now retries up to 5 times with backoff on Redis errors
while in :has_mutex state, giving Redis time to come back
- Resets the error counter on successful reconnection
- Also fixes the GenStateMachine return value (was {:shutdown, reason}
which is invalid - now {:stop, {:shutdown, reason}})
- LiveView pages (index.ex, show.ex) now handle Redis errors gracefully
in metrics loading instead of crashing with MatchError
Fixes sequinstream#2072
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s errors Instead of giving up after N errors, MutexOwner now retries indefinitely with exponential backoff capped at 1 hour. Redis going down should never crash Sequin — it should degrade gracefully and self-heal when Redis returns. Integration tests use iptables REJECT to simulate a real Redis/Dragonfly redeploy and verify the process survives the outage and recovers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation tests Unit tests (6, <0.1s): verify backoff math, error counter behavior, and state struct defaults without needing Redis. Integration tests (2, ~35s each, tagged :integration): use iptables REJECT to simulate Redis going down, verifying the process survives and recovers. Run with: mix test --include integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2072 — When Redis (or a Redis-compatible store like Dragonfly/KeyDB) restarts, Sequin enters an unrecoverable failure state requiring a full restart. This has bitten us (and others per #2072) multiple times in production.
Root cause:
MutexOwnerin:has_mutexstate callsacquire_mutexwhich returns:errorwhen Redis is unreachable. The handler immediately returns{:shutdown, :err_keeping_mutex}— an invalid GenStateMachine return that crashes the process with{:bad_return_from_state_function, ...}. Due toMutexedSupervisor's:one_for_allstrategy, this cascades and takes down the entireRuntime.Supervisorincluding all consumers.Changes:
MutexOwner: Instead of crashing on Redis errors, retry indefinitely with exponential backoff (capped at 1 hour). When Redis comes back, the error counter resets and normal operation resumes. Also fixes the GenStateMachine return value ({:stop, {:shutdown, reason}}instead of invalid{:shutdown, reason}).SinkConsumersLive.Index: Handle{:error, _}from metrics calls instead of crashing withMatchError.SinkConsumersLive.Show: Same defensive handling for all three metrics calls.HttpEndpointsLive.Show: Same fix forget_http_endpoint_throughput/1.Context
We self-host Sequin on Railway with Dragonfly (Redis-compatible) as the backing store. Railway periodically auto-updates Dragonfly, which causes a brief restart. Every time this happens, Sequin fails to self-heal and requires a manual restart — we've hit this 3 times now. The
:await_mutexstate already retries correctly on Redis errors; this PR brings the same resilience to the:has_mutexstate, but with exponential backoff so it doesn't spam during extended outages.Test plan
MutexOwnerTestwith unit tests for State struct and integration tests (tagged:integration) that useiptables REJECTto simulate Redis going down and coming backmix test)mix format --check-formattedpassesNote: Integration tests require
NET_ADMINcapability (for iptables) and are tagged:integrationso they can be excluded from normal test runs:mix test --exclude integration🤖 Generated with Claude Code