Skip to content

feat(keeper): implement automated disaster recovery and multi-region failover#495

Draft
d3vobed wants to merge 1 commit into
SoroLabs:mainfrom
d3vobed:feat/issue-452-dr-failover
Draft

feat(keeper): implement automated disaster recovery and multi-region failover#495
d3vobed wants to merge 1 commit into
SoroLabs:mainfrom
d3vobed:feat/issue-452-dr-failover

Conversation

@d3vobed
Copy link
Copy Markdown

@d3vobed d3vobed commented May 30, 2026

Summary

Implements an automated disaster recovery and failover system for the keeper service with multi-region RPC endpoint support.

What was added

  • New MultiRegionRPCClient failover layer:
    • active endpoint routing
    • automatic endpoint fallback on RPC failure
    • endpoint quarantine with configurable cooldown
    • background health checks and endpoint recovery
  • Keeper startup integration in keeper/index.js:
    • uses multi-region failover client as the primary RPC abstraction
    • exposes live failover state to metrics/health
    • lifecycle shutdown handling for failover manager
  • Configuration additions in keeper/src/config.js:
    • SOROBAN_RPC_URLS
    • RPC_FAILOVER_ENABLED
    • RPC_FAILOVER_FAILURE_THRESHOLD
    • RPC_FAILOVER_COOLDOWN_MS
    • RPC_FAILOVER_HEALTH_CHECK_INTERVAL_MS
  • Observability enhancements in keeper/src/metrics.js:
    • failover counters and gauges
    • failover state in /health and /metrics
    • Prometheus failover metrics
  • Documentation:
    • keeper/docs/disaster-recovery-failover.md
    • updates in keeper/README.md and keeper/.env.example
  • Tests:
    • keeper/__tests__/disasterRecovery.test.js
    • extended keeper/__tests__/metrics.test.js

Acceptance criteria mapping

  • Feature implementation: ✅
  • Error tracking/fallback behavior: ✅
  • Infrastructure integration: ✅
  • Documentation: ✅
  • Unit coverage for failover paths: ✅

Validation notes

  • Static diagnostics reported no file errors for modified files.
  • Runtime tests could not be executed in this container because Node/npm are unavailable and package installation is not permitted in this environment.

Closes #452

@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented May 30, 2026

@d3vobed Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@d3vobed d3vobed force-pushed the feat/issue-452-dr-failover branch from 6ec3471 to 680ccd8 Compare May 31, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Backend] Implement Automated Disaster Recovery and Failover System

1 participant