Database Restore Runbook

Purpose: Restore the YieldVault database from backup
RTO Target: 1 hour
RPO Target: 15 minutes
Last Updated: April 29, 2026
Last Tested: ⚠️ Never - Requires testing

When to Use This Runbook

Use this runbook when:

Database corruption detected
Accidental data deletion
Database server failure
Data integrity issues
Rollback required after failed migration
Disaster recovery scenario

Prerequisites

Required Access

SSH access to database server
Database admin credentials
Backup storage access (S3 or equivalent)
PagerDuty/Slack access for notifications

Required Tools

psql (PostgreSQL client)
pg_restore (PostgreSQL restore utility)
aws CLI (if using S3 for backups)
ssh client

Required Information

Database connection string
Backup location/path
Target restore point (timestamp)
Incident ticket number

Pre-Restore Checklist

Verify backup exists for target restore point
Notify team via Slack/PagerDuty
Create incident ticket and document
Stop backend services to prevent writes
Verify disk space (need 2x database size)
Take snapshot of current database (if possible)
Document current state (table counts, checksums)

Restore Procedure

Step 1: Assess the Situation (5 minutes)

1.1 Verify Database Status

# Check if database is accessible
psql $DATABASE_URL -c "SELECT version();"

# Check database size
psql $DATABASE_URL -c "SELECT pg_size_pretty(pg_database_size('yieldvault_prod'));"

# Check table counts (for later verification)
psql $DATABASE_URL -c "
  SELECT schemaname, tablename, n_live_tup 
  FROM pg_stat_user_tables 
  ORDER BY n_live_tup DESC;
"

Expected Output:

PostgreSQL 15.x on x86_64-pc-linux-gnu

If database is inaccessible:

Proceed to Step 2 immediately
Document error messages
Check database server logs

1.2 Identify Restore Point

# List available backups
aws s3 ls s3://yieldvault-backups/database/ --recursive | tail -20

# Or for local backups
ls -lh /var/backups/postgresql/ | tail -20

Choose restore point based on:

Time of last known good state
RPO requirements (15 minutes)
Backup availability

Document:

Restore Point: 2026-04-29 14:30:00 UTC
Backup File: yieldvault_prod_20260429_1430.dump
Backup Size: 2.5 GB

Step 2: Stop Backend Services (5 minutes)

2.1 Stop Application Servers

# If using systemd
sudo systemctl stop yieldvault-backend

# If using Docker
docker stop yieldvault-backend

# If using PM2
pm2 stop yieldvault-backend

# Verify stopped
curl http://localhost:3000/health
# Should return connection refused or 503

2.2 Verify No Active Connections

# Check active connections
psql $DATABASE_URL -c "
  SELECT pid, usename, application_name, client_addr, state
  FROM pg_stat_activity
  WHERE datname = 'yieldvault_prod'
  AND pid <> pg_backend_pid();
"

# Terminate active connections (if necessary)
psql $DATABASE_URL -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE datname = 'yieldvault_prod'
  AND pid <> pg_backend_pid();
"

Expected Output:

 pid | usename | application_name | client_addr | state
-----+---------+------------------+-------------+-------
(0 rows)

Step 3: Backup Current State (10 minutes)

⚠️ CRITICAL: Always backup current state before restore, even if corrupted.

3.1 Create Safety Backup

# Set backup filename with timestamp
SAFETY_BACKUP="yieldvault_pre_restore_$(date +%Y%m%d_%H%M%S).dump"

# Create backup (this may fail if database is corrupted)
pg_dump $DATABASE_URL \
  --format=custom \
  --file="/var/backups/postgresql/safety/${SAFETY_BACKUP}" \
  --verbose

# If pg_dump fails, try plain SQL format
pg_dump $DATABASE_URL \
  --format=plain \
  --file="/var/backups/postgresql/safety/${SAFETY_BACKUP}.sql" \
  --verbose

# Upload to S3 for safekeeping
aws s3 cp "/var/backups/postgresql/safety/${SAFETY_BACKUP}" \
  "s3://yieldvault-backups/safety/${SAFETY_BACKUP}"

If backup fails:

Document the error
Proceed with restore (data may be unrecoverable anyway)
Note in incident report

Step 4: Download Backup File (10 minutes)

4.1 Download from S3

# Set variables
BACKUP_FILE="yieldvault_prod_20260429_1430.dump"
LOCAL_PATH="/var/backups/postgresql/restore/${BACKUP_FILE}"

# Download backup
aws s3 cp "s3://yieldvault-backups/database/${BACKUP_FILE}" "${LOCAL_PATH}"

# Verify download
ls -lh "${LOCAL_PATH}"
md5sum "${LOCAL_PATH}"

# Compare with stored checksum
aws s3api head-object \
  --bucket yieldvault-backups \
  --key "database/${BACKUP_FILE}" \
  --query 'Metadata.md5' \
  --output text

Expected Output:

-rw-r--r-- 1 postgres postgres 2.5G Apr 29 14:30 yieldvault_prod_20260429_1430.dump
a1b2c3d4e5f6... /var/backups/postgresql/restore/yieldvault_prod_20260429_1430.dump

If checksums don't match:

⚠️ STOP - Backup may be corrupted
Try downloading again
Try alternate backup
Escalate to senior engineer

Step 5: Prepare Database (5 minutes)

5.1 Drop Existing Database (⚠️ DESTRUCTIVE)

# Connect as superuser
psql postgres://postgres:password@localhost/postgres

# Drop database (this will fail if connections exist)
DROP DATABASE yieldvault_prod;

# If drop fails, force disconnect and try again
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'yieldvault_prod';

DROP DATABASE yieldvault_prod;

# Create fresh database
CREATE DATABASE yieldvault_prod
  WITH OWNER = yieldvault_user
  ENCODING = 'UTF8'
  LC_COLLATE = 'en_US.UTF-8'
  LC_CTYPE = 'en_US.UTF-8'
  TEMPLATE = template0;

# Grant privileges
GRANT ALL PRIVILEGES ON DATABASE yieldvault_prod TO yieldvault_user;

# Exit psql
\q

Expected Output:

DROP DATABASE
CREATE DATABASE
GRANT

Step 6: Restore Database (20 minutes)

6.1 Perform Restore

# Set variables
BACKUP_FILE="/var/backups/postgresql/restore/yieldvault_prod_20260429_1430.dump"
DATABASE_URL="postgresql://yieldvault_user:password@localhost:5432/yieldvault_prod"

# Restore database
pg_restore \
  --dbname="${DATABASE_URL}" \
  --verbose \
  --clean \
  --if-exists \
  --no-owner \
  --no-acl \
  --jobs=4 \
  "${BACKUP_FILE}" \
  2>&1 | tee /var/log/postgresql/restore_$(date +%Y%m%d_%H%M%S).log

# Check exit code
echo "Restore exit code: $?"

Expected Output:

pg_restore: creating TABLE "public.User"
pg_restore: creating TABLE "public.VaultState"
pg_restore: creating TABLE "public.Transaction"
...
pg_restore: creating INDEX "public.Transaction_timestamp_idx"
...
Restore exit code: 0

Common Errors:

Error: role "yieldvault_user" does not exist

# Create user
psql postgres -c "CREATE USER yieldvault_user WITH PASSWORD 'password';"
# Retry restore

Error: database "yieldvault_prod" already exists

# Use --clean flag (already in command above)
# Or manually drop database first

Error: out of memory

# Reduce --jobs parameter
pg_restore --jobs=2 ...
# Or restore without parallel jobs
pg_restore --jobs=1 ...

Step 7: Verify Restore (10 minutes)

7.1 Check Table Counts

# Compare table counts with pre-restore snapshot
psql $DATABASE_URL -c "
  SELECT schemaname, tablename, n_live_tup 
  FROM pg_stat_user_tables 
  ORDER BY n_live_tup DESC;
"

Expected Output:

 schemaname |    tablename     | n_live_tup
------------+------------------+------------
 public     | Transaction      |      15234
 public     | User             |       1523
 public     | Referral         |        342
 public     | AdminAuditLog    |       8901
 public     | VaultState       |          1
 public     | ReferralCode     |        156

Verify:

All tables present
Row counts reasonable (within RPO window)
No missing tables

7.2 Check Data Integrity

# Check for NULL values in critical fields
psql $DATABASE_URL -c "
  SELECT COUNT(*) as null_addresses
  FROM \"User\"
  WHERE address IS NULL;
"

# Should return 0

# Check transaction data
psql $DATABASE_URL -c "
  SELECT 
    type,
    COUNT(*) as count,
    SUM(CAST(amount AS NUMERIC)) as total_amount
  FROM \"Transaction\"
  GROUP BY type;
"

Expected Output:

   type    | count | total_amount
-----------+-------+--------------
 deposit   |  8234 |  12345678.90
 withdraw  |  7000 |  10234567.80

7.3 Check Indexes and Constraints

# Verify indexes
psql $DATABASE_URL -c "
  SELECT schemaname, tablename, indexname
  FROM pg_indexes
  WHERE schemaname = 'public'
  ORDER BY tablename, indexname;
"

# Verify foreign keys
psql $DATABASE_URL -c "
  SELECT
    tc.table_name,
    kcu.column_name,
    ccu.table_name AS foreign_table_name,
    ccu.column_name AS foreign_column_name
  FROM information_schema.table_constraints AS tc
  JOIN information_schema.key_column_usage AS kcu
    ON tc.constraint_name = kcu.constraint_name
  JOIN information_schema.constraint_column_usage AS ccu
    ON ccu.constraint_name = tc.constraint_name
  WHERE tc.constraint_type = 'FOREIGN KEY';
"

7.4 Run Prisma Migrations (if needed)

# Check migration status
cd /app/backend
npx prisma migrate status

# If migrations are pending
npx prisma migrate deploy

# Verify schema
npx prisma validate

Step 8: Restart Services (5 minutes)

8.1 Start Backend Services

# If using systemd
sudo systemctl start yieldvault-backend

# If using Docker
docker start yieldvault-backend

# If using PM2
pm2 start yieldvault-backend

# Wait for startup
sleep 10

8.2 Verify Service Health

# Check health endpoint
curl http://localhost:3000/health

# Expected response
{
  "status": "healthy",
  "timestamp": "2026-04-29T15:30:00.000Z",
  "checks": {
    "api": "up",
    "database": "up",
    "stellarRpc": "up"
  }
}

# Check database connectivity
curl http://localhost:3000/api/v1/vault/summary

# Should return vault data without errors

Step 9: Smoke Tests (10 minutes)

9.1 Test Critical Endpoints

# Test user lookup
curl http://localhost:3000/api/v1/user/GXXX...

# Test transaction history
curl http://localhost:3000/api/v1/transactions?limit=10

# Test vault state
curl http://localhost:3000/api/v1/vault/metrics

9.2 Test Database Writes

# Test audit log write (admin endpoint)
curl -X POST http://localhost:3000/admin/test-write \
  -H "Authorization: ApiKey test-key" \
  -H "Content-Type: application/json"

# Verify write succeeded
psql $DATABASE_URL -c "
  SELECT * FROM \"AdminAuditLog\"
  ORDER BY \"createdAt\" DESC
  LIMIT 1;
"

9.3 Monitor Logs

# Check application logs
tail -f /var/log/yieldvault/backend.log

# Check for errors
grep -i error /var/log/yieldvault/backend.log | tail -20

# Check database logs
tail -f /var/log/postgresql/postgresql-15-main.log

Step 10: Post-Restore Actions (10 minutes)

10.1 Update Monitoring

# Clear any false alerts
# Update status page
# Notify monitoring team

10.2 Document Restore

Create incident report with:

10.3 Notify Stakeholders

# Send notification via Slack
curl -X POST $SLACK_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "✅ Database restore completed successfully",
    "blocks": [
      {
        "type": "section",
        "text": {
          "type": "mrkdwn",
          "text": "*Database Restore Complete*\n• Restore Point: 2026-04-29 14:30 UTC\n• Data Loss: ~5 minutes\n• Total Downtime: 45 minutes\n• Status: All systems operational"
        }
      }
    ]
  }'

Rollback Procedure

If restore fails or causes issues:

Option 1: Restore from Safety Backup

# Use the safety backup created in Step 3
pg_restore --dbname=$DATABASE_URL \
  --clean --if-exists \
  /var/backups/postgresql/safety/${SAFETY_BACKUP}

Option 2: Try Different Backup

# List available backups
aws s3 ls s3://yieldvault-backups/database/ | tail -50

# Select earlier backup
# Repeat restore procedure from Step 4

Option 3: Restore from WAL Archive (Point-in-Time)

# This requires WAL archiving to be enabled
# Contact senior DBA for assistance
# See: https://www.postgresql.org/docs/current/continuous-archiving.html

Troubleshooting

Issue: Restore Takes Too Long

Symptoms: Restore exceeds 20-minute window

Solutions:

Increase --jobs parameter (more parallel workers)
Check disk I/O performance
Verify network speed (if downloading backup)
Consider restoring to faster storage first

Issue: Out of Disk Space

Symptoms: No space left on device

Solutions:

Check disk space: df -h
Clean up old backups: rm /var/backups/postgresql/old/*
Clean up logs: find /var/log -name "*.log" -mtime +7 -delete
Expand disk volume (if cloud)

Issue: Connection Refused After Restore

Symptoms: Backend can't connect to database

Solutions:

Check PostgreSQL is running: systemctl status postgresql
Check pg_hba.conf for connection permissions
Verify DATABASE_URL is correct
Check firewall rules
Restart PostgreSQL: systemctl restart postgresql

Issue: Data Inconsistencies

Symptoms: Missing or incorrect data

Solutions:

Verify backup file integrity (checksum)
Check restore logs for errors
Try restoring from different backup
Run data validation queries
Consider point-in-time recovery from WAL

Verification Checklist

After restore, verify:

Metrics to Track

Document these metrics for each restore:

Metric	Target	Actual	Status
Detection Time	5 min	___ min	___
Decision Time	10 min	___ min	___
Backup Download	10 min	___ min	___
Restore Time	20 min	___ min	___
Verification Time	10 min	___ min	___
Total RTO	60 min	___ min	___
Data Loss	15 min	___ min	___
RPO Met	Yes	___	___

Post-Incident Review

Schedule post-incident review within 48 hours:

Questions to Answer:

What caused the need for restore?
Was the restore successful?
Did we meet RTO/RPO targets?
What went well?
What could be improved?
Do runbooks need updates?
Do we need additional monitoring?
Should we adjust RTO/RPO targets?

Action Items:

Update runbook with lessons learned
Implement preventive measures
Schedule follow-up testing
Update monitoring/alerting
Train team on improvements

Related Runbooks

Emergency Contacts

Role	Name	Phone	Email
Database Admin	TBD	TBD	TBD
DevOps Lead	TBD	TBD	TBD
On-Call Engineer	TBD	TBD	TBD
Team Lead	TBD	TBD	TBD

PagerDuty: [Escalation Policy Link]
Slack Channel: #yieldvault-incidents

Last Updated: April 29, 2026
Next Review: July 29, 2026
Tested: ⚠️ Never - Schedule test ASAP

FilesExpand file tree

DATABASE_RESTORE.md

Latest commit

History

DATABASE_RESTORE.md

File metadata and controls

Database Restore Runbook

When to Use This Runbook

Prerequisites

Required Access

Required Tools

Required Information

Pre-Restore Checklist

Restore Procedure

Step 1: Assess the Situation (5 minutes)

1.1 Verify Database Status

1.2 Identify Restore Point

Step 2: Stop Backend Services (5 minutes)

2.1 Stop Application Servers

2.2 Verify No Active Connections

Step 3: Backup Current State (10 minutes)

3.1 Create Safety Backup

Step 4: Download Backup File (10 minutes)

4.1 Download from S3

Step 5: Prepare Database (5 minutes)

5.1 Drop Existing Database (⚠️ DESTRUCTIVE)

Step 6: Restore Database (20 minutes)

6.1 Perform Restore

Step 7: Verify Restore (10 minutes)

7.1 Check Table Counts

7.2 Check Data Integrity

7.3 Check Indexes and Constraints

7.4 Run Prisma Migrations (if needed)

Step 8: Restart Services (5 minutes)

8.1 Start Backend Services

8.2 Verify Service Health

Step 9: Smoke Tests (10 minutes)

9.1 Test Critical Endpoints

9.2 Test Database Writes

9.3 Monitor Logs

Step 10: Post-Restore Actions (10 minutes)

10.1 Update Monitoring

10.2 Document Restore

10.3 Notify Stakeholders

Rollback Procedure

Option 1: Restore from Safety Backup

Option 2: Try Different Backup

Option 3: Restore from WAL Archive (Point-in-Time)

Troubleshooting

Issue: Restore Takes Too Long

Issue: Out of Disk Space

Issue: Connection Refused After Restore

Issue: Data Inconsistencies

Verification Checklist

Metrics to Track

Post-Incident Review

Questions to Answer:

Action Items:

Related Runbooks

Emergency Contacts