Skip to content

Leader election recovery with generation isolation#93

Draft
ssteele110 wants to merge 1 commit intomasterfrom
ssteele/leader-election-recovery-v2
Draft

Leader election recovery with generation isolation#93
ssteele110 wants to merge 1 commit intomasterfrom
ssteele/leader-election-recovery-v2

Conversation

@ssteele110
Copy link
Contributor

@ssteele110 ssteele110 commented Mar 16, 2026

Summary

Redesigns master election to handle worker death mid-population.

  • Master lock with CAS: Lock acquired via SET NX EX with TTL, renewed during population via Lua CAS script. Final push uses a second Lua CAS to verify ownership before committing queue data — prevents split-brain when lock expires while master is still alive.
  • Generation isolation: Queue-operational keys namespaced by UUID (gen:{uuid}:*). Each population attempt gets its own isolated keyspace. Build-wide aggregates (error reports, flaky reports, stats) remain unscoped.
  • Automatic re-election: Workers detect master death (lock TTL expiry during setup), re-elect a new master, retry up to max_election_attempts (default 3).
  • Staleness detection: Polling workers periodically check current-generation (throttled to 1 check/5s) and exit gracefully on mismatch.
  • Bug fix: try_to_reserve_lost_test passed key('completed') to reserve_lost.lua, but Lua uses KEYS[2] as processed_key. Since acknowledge writes to processed (not completed), the guard was always false — acknowledged tests were being re-stolen after deadline.

New files

  • redis/renew_master_lock.lua — CAS lock TTL renewal
  • redis/push_queue.lua — CAS queue commit with ownership verification
  • ruby/test/ci/queue/redis_generation_test.rb — 8 test scenarios

Test plan

  • 8 new generation test scenarios pass (election, CAS rejection, staleness, retry, supervisor)
  • All existing unit tests pass (0 new failures)
  • 22 pre-existing integration test failures unchanged

When the master worker dies mid-population, workers now detect the
failure and re-elect a new master automatically instead of hanging
until timeout.

Key changes:
- Master lock uses SET NX EX with TTL, renewed during population
- CAS Lua scripts verify lock ownership before push (prevents
  split-brain when lock expires while master is still alive)
- Queue data namespaced by generation UUID (gen:{uuid}:*) to
  isolate each population attempt
- Workers detect generation staleness in poll loop (throttled)
- Fix pre-existing bug: reserve_lost used 'completed' key instead
  of 'processed', causing acknowledged tests to be re-stolen

New files:
- redis/renew_master_lock.lua — CAS lock TTL renewal
- redis/push_queue.lua — CAS queue commit with ownership check
- ruby/test/ci/queue/redis_generation_test.rb — 8 test scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant