A production-grade PostgreSQL monitoring and auto-recovery system built for CERN database operations.
- Real-time Health Checks: Continuous monitoring of PostgreSQL primary and replica instances
- Advanced Metrics: Connection tracking, query performance, replication lag, resource utilization
- Prometheus Integration: Full metrics export for external monitoring systems
- Intelligent Failure Detection: Automatic detection of database failures and performance issues
- Automated Recovery: Self-healing capabilities with configurable retry limits
- Recovery Logging: Complete audit trail of all recovery actions
- Slack Integration: Real-time alerts to Slack channels
- Multi-level Alerts: Critical, warning, and informational alert categories
- Alert Cooldown: Configurable cooldown periods to prevent alert spam
- React Frontend: Modern, responsive web interface
- Real-time Updates: Live monitoring data with automatic refresh
- Grafana Integration: Advanced visualization dashboards
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β React App β β FastAPI β β PostgreSQL β
β (Port 3001) βββββΊβ (Port 8000) βββββΊβ (Port 5432) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Prometheus β
β (Port 9090) β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Grafana β
β (Port 3000) β
βββββββββββββββββββ
- FastAPI: Modern Python web framework
- PostgreSQL: Primary database with replication
- Prometheus: Metrics collection and storage
- AsyncPG: High-performance async PostgreSQL driver
- React 18: Modern UI framework
- Tailwind CSS: Utility-first CSS framework
- Recharts: Data visualization library
- Lucide React: Icon library
- Docker: Containerization
- Docker Compose: Multi-container orchestration
- Grafana: Advanced dashboards
- AlertManager: Alert routing and management
- Docker and Docker Compose
- Git
- Clone the repository
git clone <repository-url>
cd cern-db- Configure environment
cp src/backend/.env.example src/backend/.env
# Edit .env with your configuration- Start the system
docker-compose up -d- Access the applications
- Frontend Dashboard: http://localhost:3001
- API Documentation: http://localhost:8000/docs
- Grafana: http://localhost:3000 (admin/admin123)
- Prometheus: http://localhost:9090
- Database connectivity monitoring
- Connection pool status
- Replication lag tracking
- Resource utilization monitoring
- Active connections
- Query performance
- Database size
- CPU and memory usage
- Replication status
- Automatic database restart
- Connection pool reset
- Replica promotion
- Recovery attempt logging
# Database Configuration
DATABASE_URL=postgresql://admin:admin123@postgres-primary:5432/monitoring_db
REPLICA_URL=postgresql://admin:admin123@postgres-replica:5432/monitoring_db
# Monitoring Configuration
MONITOR_INTERVAL=30
HEALTH_CHECK_INTERVAL=10
AUTO_RECOVERY_ENABLED=true
# Alert Configuration
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
ALERT_COOLDOWN=300
# Thresholds
MAX_CONNECTIONS=100
SLOW_QUERY_THRESHOLD=1.0
CPU_THRESHOLD=80.0
REPLICATION_LAG_THRESHOLD=10.0The system automatically configures Prometheus with:
- PostgreSQL exporter metrics
- Custom application metrics
- Alert rules for common issues
Pre-configured dashboards include:
- Database overview
- Performance metrics
- Recovery actions
- Alert history
- Database Down: Critical alert when database becomes unavailable
- High Connections: Warning when connection count approaches limit
- Slow Queries: Warning for performance degradation
- Replication Lag: Warning for replication delays
- Recovery Actions: Info alerts for recovery attempts
Configure Slack alerts by setting the SLACK_WEBHOOK_URL environment variable. The system will send:
- Critical alerts to
#db-alerts-critical - Warning alerts to
#db-alerts-warning - General alerts to
#db-alerts
The system automatically attempts recovery when:
- Database connection fails
- Performance thresholds are exceeded
- Replication lag exceeds limits
- Database service restart
- Connection pool reset
- Replica promotion
- Configuration reload
- Maximum 3 recovery attempts per failure
- 5-minute cooldown between attempts
- Manual override available through dashboard
GET /api/v1/health- Basic health checkGET /api/v1/health/detailed- Detailed health statusPOST /api/v1/health/check- Force health check
GET /api/v1/monitoring/status- Monitoring statusGET /api/v1/monitoring/metrics- Current metricsGET /api/v1/monitoring/database/stats- Database statistics
GET /api/v1/recovery/status- Recovery statusPOST /api/v1/recovery/trigger- Manual recovery triggerGET /api/v1/recovery/history- Recovery history
# Simulate database failure
curl -X POST http://localhost:8000/api/v1/monitoring/test/failure \
-H "Content-Type: application/json" \
-d '{"db_type": "primary"}'# Test alert system
curl -X POST http://localhost:8000/api/v1/recovery/test-alert \
-H "Content-Type: application/json" \
-d '{"alert_type": "database_down", "message": "Test alert"}'- Minimum: 2 CPU cores, 4GB RAM, 20GB storage
- Recommended: 4 CPU cores, 8GB RAM, 50GB storage
- CPU Usage: < 5% overhead
- Memory Usage: < 500MB additional memory
- Network: Minimal bandwidth usage
- API endpoints secured with configurable authentication
- Database connections use encrypted passwords
- Environment-based configuration management
- Internal container communication
- Configurable external access
- SSL/TLS support for external connections
# Backend development
cd src/backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload
# Frontend development
cd src/frontend
npm install
npm startsrc/
βββ backend/
β βββ app/
β β βββ core/ # Core functionality
β β βββ api/routes/ # API endpoints
β β βββ models/ # Data models
β βββ main.py # Application entry
βββ frontend/
βββ src/
β βββ components/ # React components
β βββ pages/ # Page components
β βββ services/ # API services
βββ public/ # Static assets
- API Documentation: http://localhost:8000/docs
- Grafana Dashboards: http://localhost:3000
- Prometheus Metrics: http://localhost:9090/metrics
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue in the repository
- Check the API documentation
- Review the Grafana dashboards
# Production deployment
docker-compose -f docker-compose.prod.yml up -d- Set production environment variables
- Configure external database connections
- Set up proper SSL certificates
- Configure backup strategies
- Configure external Prometheus instance
- Set up Grafana with persistent storage
- Configure AlertManager routing
- Set up log aggregation
Built for CERN Database Operations π