Leader election recovery with generation isolation#93
Draft
ssteele110 wants to merge 1 commit intomasterfrom
Draft
Leader election recovery with generation isolation#93ssteele110 wants to merge 1 commit intomasterfrom
ssteele110 wants to merge 1 commit intomasterfrom
Conversation
When the master worker dies mid-population, workers now detect the
failure and re-elect a new master automatically instead of hanging
until timeout.
Key changes:
- Master lock uses SET NX EX with TTL, renewed during population
- CAS Lua scripts verify lock ownership before push (prevents
split-brain when lock expires while master is still alive)
- Queue data namespaced by generation UUID (gen:{uuid}:*) to
isolate each population attempt
- Workers detect generation staleness in poll loop (throttled)
- Fix pre-existing bug: reserve_lost used 'completed' key instead
of 'processed', causing acknowledged tests to be re-stolen
New files:
- redis/renew_master_lock.lua — CAS lock TTL renewal
- redis/push_queue.lua — CAS queue commit with ownership check
- ruby/test/ci/queue/redis_generation_test.rb — 8 test scenarios
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Redesigns master election to handle worker death mid-population.
SET NX EXwith TTL, renewed during population via Lua CAS script. Final push uses a second Lua CAS to verify ownership before committing queue data — prevents split-brain when lock expires while master is still alive.gen:{uuid}:*). Each population attempt gets its own isolated keyspace. Build-wide aggregates (error reports, flaky reports, stats) remain unscoped.max_election_attempts(default 3).current-generation(throttled to 1 check/5s) and exit gracefully on mismatch.try_to_reserve_lost_testpassedkey('completed')toreserve_lost.lua, but Lua uses KEYS[2] asprocessed_key. Sinceacknowledgewrites toprocessed(notcompleted), the guard was always false — acknowledged tests were being re-stolen after deadline.New files
redis/renew_master_lock.lua— CAS lock TTL renewalredis/push_queue.lua— CAS queue commit with ownership verificationruby/test/ci/queue/redis_generation_test.rb— 8 test scenariosTest plan