Purpose: Restore the YieldVault database from backup
RTO Target: 1 hour
RPO Target: 15 minutes
Last Updated: April 29, 2026
Last Tested:
Use this runbook when:
- Database corruption detected
- Accidental data deletion
- Database server failure
- Data integrity issues
- Rollback required after failed migration
- Disaster recovery scenario
- SSH access to database server
- Database admin credentials
- Backup storage access (S3 or equivalent)
- PagerDuty/Slack access for notifications
-
psql(PostgreSQL client) -
pg_restore(PostgreSQL restore utility) -
awsCLI (if using S3 for backups) -
sshclient
- Database connection string
- Backup location/path
- Target restore point (timestamp)
- Incident ticket number
- Verify backup exists for target restore point
- Notify team via Slack/PagerDuty
- Create incident ticket and document
- Stop backend services to prevent writes
- Verify disk space (need 2x database size)
- Take snapshot of current database (if possible)
- Document current state (table counts, checksums)
# Check if database is accessible
psql $DATABASE_URL -c "SELECT version();"
# Check database size
psql $DATABASE_URL -c "SELECT pg_size_pretty(pg_database_size('yieldvault_prod'));"
# Check table counts (for later verification)
psql $DATABASE_URL -c "
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
"Expected Output:
PostgreSQL 15.x on x86_64-pc-linux-gnu
If database is inaccessible:
- Proceed to Step 2 immediately
- Document error messages
- Check database server logs
# List available backups
aws s3 ls s3://yieldvault-backups/database/ --recursive | tail -20
# Or for local backups
ls -lh /var/backups/postgresql/ | tail -20Choose restore point based on:
- Time of last known good state
- RPO requirements (15 minutes)
- Backup availability
Document:
Restore Point: 2026-04-29 14:30:00 UTC
Backup File: yieldvault_prod_20260429_1430.dump
Backup Size: 2.5 GB
# If using systemd
sudo systemctl stop yieldvault-backend
# If using Docker
docker stop yieldvault-backend
# If using PM2
pm2 stop yieldvault-backend
# Verify stopped
curl http://localhost:3000/health
# Should return connection refused or 503# Check active connections
psql $DATABASE_URL -c "
SELECT pid, usename, application_name, client_addr, state
FROM pg_stat_activity
WHERE datname = 'yieldvault_prod'
AND pid <> pg_backend_pid();
"
# Terminate active connections (if necessary)
psql $DATABASE_URL -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'yieldvault_prod'
AND pid <> pg_backend_pid();
"Expected Output:
pid | usename | application_name | client_addr | state
-----+---------+------------------+-------------+-------
(0 rows)
# Set backup filename with timestamp
SAFETY_BACKUP="yieldvault_pre_restore_$(date +%Y%m%d_%H%M%S).dump"
# Create backup (this may fail if database is corrupted)
pg_dump $DATABASE_URL \
--format=custom \
--file="/var/backups/postgresql/safety/${SAFETY_BACKUP}" \
--verbose
# If pg_dump fails, try plain SQL format
pg_dump $DATABASE_URL \
--format=plain \
--file="/var/backups/postgresql/safety/${SAFETY_BACKUP}.sql" \
--verbose
# Upload to S3 for safekeeping
aws s3 cp "/var/backups/postgresql/safety/${SAFETY_BACKUP}" \
"s3://yieldvault-backups/safety/${SAFETY_BACKUP}"If backup fails:
- Document the error
- Proceed with restore (data may be unrecoverable anyway)
- Note in incident report
# Set variables
BACKUP_FILE="yieldvault_prod_20260429_1430.dump"
LOCAL_PATH="/var/backups/postgresql/restore/${BACKUP_FILE}"
# Download backup
aws s3 cp "s3://yieldvault-backups/database/${BACKUP_FILE}" "${LOCAL_PATH}"
# Verify download
ls -lh "${LOCAL_PATH}"
md5sum "${LOCAL_PATH}"
# Compare with stored checksum
aws s3api head-object \
--bucket yieldvault-backups \
--key "database/${BACKUP_FILE}" \
--query 'Metadata.md5' \
--output textExpected Output:
-rw-r--r-- 1 postgres postgres 2.5G Apr 29 14:30 yieldvault_prod_20260429_1430.dump
a1b2c3d4e5f6... /var/backups/postgresql/restore/yieldvault_prod_20260429_1430.dump
If checksums don't match:
⚠️ STOP - Backup may be corrupted- Try downloading again
- Try alternate backup
- Escalate to senior engineer
# Connect as superuser
psql postgres://postgres:password@localhost/postgres
# Drop database (this will fail if connections exist)
DROP DATABASE yieldvault_prod;
# If drop fails, force disconnect and try again
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'yieldvault_prod';
DROP DATABASE yieldvault_prod;
# Create fresh database
CREATE DATABASE yieldvault_prod
WITH OWNER = yieldvault_user
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
TEMPLATE = template0;
# Grant privileges
GRANT ALL PRIVILEGES ON DATABASE yieldvault_prod TO yieldvault_user;
# Exit psql
\qExpected Output:
DROP DATABASE
CREATE DATABASE
GRANT
# Set variables
BACKUP_FILE="/var/backups/postgresql/restore/yieldvault_prod_20260429_1430.dump"
DATABASE_URL="postgresql://yieldvault_user:password@localhost:5432/yieldvault_prod"
# Restore database
pg_restore \
--dbname="${DATABASE_URL}" \
--verbose \
--clean \
--if-exists \
--no-owner \
--no-acl \
--jobs=4 \
"${BACKUP_FILE}" \
2>&1 | tee /var/log/postgresql/restore_$(date +%Y%m%d_%H%M%S).log
# Check exit code
echo "Restore exit code: $?"Expected Output:
pg_restore: creating TABLE "public.User"
pg_restore: creating TABLE "public.VaultState"
pg_restore: creating TABLE "public.Transaction"
...
pg_restore: creating INDEX "public.Transaction_timestamp_idx"
...
Restore exit code: 0
Common Errors:
Error: role "yieldvault_user" does not exist
# Create user
psql postgres -c "CREATE USER yieldvault_user WITH PASSWORD 'password';"
# Retry restoreError: database "yieldvault_prod" already exists
# Use --clean flag (already in command above)
# Or manually drop database firstError: out of memory
# Reduce --jobs parameter
pg_restore --jobs=2 ...
# Or restore without parallel jobs
pg_restore --jobs=1 ...# Compare table counts with pre-restore snapshot
psql $DATABASE_URL -c "
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
"Expected Output:
schemaname | tablename | n_live_tup
------------+------------------+------------
public | Transaction | 15234
public | User | 1523
public | Referral | 342
public | AdminAuditLog | 8901
public | VaultState | 1
public | ReferralCode | 156
Verify:
- All tables present
- Row counts reasonable (within RPO window)
- No missing tables
# Check for NULL values in critical fields
psql $DATABASE_URL -c "
SELECT COUNT(*) as null_addresses
FROM \"User\"
WHERE address IS NULL;
"
# Should return 0# Check transaction data
psql $DATABASE_URL -c "
SELECT
type,
COUNT(*) as count,
SUM(CAST(amount AS NUMERIC)) as total_amount
FROM \"Transaction\"
GROUP BY type;
"Expected Output:
type | count | total_amount
-----------+-------+--------------
deposit | 8234 | 12345678.90
withdraw | 7000 | 10234567.80
# Verify indexes
psql $DATABASE_URL -c "
SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename, indexname;
"
# Verify foreign keys
psql $DATABASE_URL -c "
SELECT
tc.table_name,
kcu.column_name,
ccu.table_name AS foreign_table_name,
ccu.column_name AS foreign_column_name
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage AS ccu
ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
"# Check migration status
cd /app/backend
npx prisma migrate status
# If migrations are pending
npx prisma migrate deploy
# Verify schema
npx prisma validate# If using systemd
sudo systemctl start yieldvault-backend
# If using Docker
docker start yieldvault-backend
# If using PM2
pm2 start yieldvault-backend
# Wait for startup
sleep 10# Check health endpoint
curl http://localhost:3000/health
# Expected response
{
"status": "healthy",
"timestamp": "2026-04-29T15:30:00.000Z",
"checks": {
"api": "up",
"database": "up",
"stellarRpc": "up"
}
}
# Check database connectivity
curl http://localhost:3000/api/v1/vault/summary
# Should return vault data without errors# Test user lookup
curl http://localhost:3000/api/v1/user/GXXX...
# Test transaction history
curl http://localhost:3000/api/v1/transactions?limit=10
# Test vault state
curl http://localhost:3000/api/v1/vault/metrics# Test audit log write (admin endpoint)
curl -X POST http://localhost:3000/admin/test-write \
-H "Authorization: ApiKey test-key" \
-H "Content-Type: application/json"
# Verify write succeeded
psql $DATABASE_URL -c "
SELECT * FROM \"AdminAuditLog\"
ORDER BY \"createdAt\" DESC
LIMIT 1;
"# Check application logs
tail -f /var/log/yieldvault/backend.log
# Check for errors
grep -i error /var/log/yieldvault/backend.log | tail -20
# Check database logs
tail -f /var/log/postgresql/postgresql-15-main.log# Clear any false alerts
# Update status page
# Notify monitoring teamCreate incident report with:
- Incident start time
- Root cause
- Restore point selected
- Data loss (if any)
- Total downtime
- Lessons learned
# Send notification via Slack
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{
"text": "✅ Database restore completed successfully",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Database Restore Complete*\n• Restore Point: 2026-04-29 14:30 UTC\n• Data Loss: ~5 minutes\n• Total Downtime: 45 minutes\n• Status: All systems operational"
}
}
]
}'If restore fails or causes issues:
# Use the safety backup created in Step 3
pg_restore --dbname=$DATABASE_URL \
--clean --if-exists \
/var/backups/postgresql/safety/${SAFETY_BACKUP}# List available backups
aws s3 ls s3://yieldvault-backups/database/ | tail -50
# Select earlier backup
# Repeat restore procedure from Step 4# This requires WAL archiving to be enabled
# Contact senior DBA for assistance
# See: https://www.postgresql.org/docs/current/continuous-archiving.htmlSymptoms: Restore exceeds 20-minute window
Solutions:
- Increase
--jobsparameter (more parallel workers) - Check disk I/O performance
- Verify network speed (if downloading backup)
- Consider restoring to faster storage first
Symptoms: No space left on device
Solutions:
- Check disk space:
df -h - Clean up old backups:
rm /var/backups/postgresql/old/* - Clean up logs:
find /var/log -name "*.log" -mtime +7 -delete - Expand disk volume (if cloud)
Symptoms: Backend can't connect to database
Solutions:
- Check PostgreSQL is running:
systemctl status postgresql - Check pg_hba.conf for connection permissions
- Verify DATABASE_URL is correct
- Check firewall rules
- Restart PostgreSQL:
systemctl restart postgresql
Symptoms: Missing or incorrect data
Solutions:
- Verify backup file integrity (checksum)
- Check restore logs for errors
- Try restoring from different backup
- Run data validation queries
- Consider point-in-time recovery from WAL
After restore, verify:
- All tables present and populated
- Row counts within expected range
- No NULL values in critical fields
- Indexes and constraints intact
- Foreign key relationships valid
- Backend connects successfully
- Health checks pass
- Critical endpoints respond
- Database writes work
- No errors in logs
- Monitoring shows healthy state
- Stakeholders notified
- Incident documented
Document these metrics for each restore:
| Metric | Target | Actual | Status |
|---|---|---|---|
| Detection Time | 5 min | ___ min | ___ |
| Decision Time | 10 min | ___ min | ___ |
| Backup Download | 10 min | ___ min | ___ |
| Restore Time | 20 min | ___ min | ___ |
| Verification Time | 10 min | ___ min | ___ |
| Total RTO | 60 min | ___ min | ___ |
| Data Loss | 15 min | ___ min | ___ |
| RPO Met | Yes | ___ | ___ |
Schedule post-incident review within 48 hours:
- What caused the need for restore?
- Was the restore successful?
- Did we meet RTO/RPO targets?
- What went well?
- What could be improved?
- Do runbooks need updates?
- Do we need additional monitoring?
- Should we adjust RTO/RPO targets?
- Update runbook with lessons learned
- Implement preventive measures
- Schedule follow-up testing
- Update monitoring/alerting
- Train team on improvements
| Role | Name | Phone | |
|---|---|---|---|
| Database Admin | TBD | TBD | TBD |
| DevOps Lead | TBD | TBD | TBD |
| On-Call Engineer | TBD | TBD | TBD |
| Team Lead | TBD | TBD | TBD |
PagerDuty: [Escalation Policy Link]
Slack Channel: #yieldvault-incidents
Last Updated: April 29, 2026
Next Review: July 29, 2026
Tested: