fix(replica): tolerate per-statement errors in schema migration by passcod · Pull Request #39 · beyondessential/postgres-restore-operator

passcod · 2026-05-12T17:10:23Z

The persistent_schemas migration uses pg_dump | psql to copy schemas between restores. Until now psql ran with ON_ERROR_STOP=1, so the first failing statement killed the whole migration. That blocks the replica from coming up whenever some object in the persistent schema references upstream state that has since changed — e.g. dbt views referencing tamanu columns that have been renamed or dropped upstream.

Observed in production on palau-prod: an upstream column rename (notes.note_type → notes.note_type_id) made one dbt view fail to recreate, which aborted the whole migration. The restore stayed stuck in Switching for 10+ hours with ~130 migration Job pods retrying the same failure.

Replicas don't control upstream and can't keep dbt strictly in sync. Stated preference: replica availability over schema completeness. Clients can regenerate broken views afterward; what they can't do is wait days for a replica to come back.

Drop ON_ERROR_STOP, capture psql's stderr to count ^ERROR: lines, and report the outcome via the callback as either success (no errors) or partial: N statement error(s). The script always exits 0 unless something catastrophic happens upstream of psql (network, auth, pg_dump segfault, etc.).

On the operator side, route the callback body: partial results log a warn, emit a Warning event (SchemaMigrationPartial) for visibility, and set schemaMigrationPhase: "partial". The sweep gate accepts partial alongside complete and None, since the migration has run and we no longer depend on the previous restore.

The persistent_schemas migration uses pg_dump | psql to copy schemas between restores. Until now psql ran with `ON_ERROR_STOP=1` so the first failing statement killed the whole migration. That blocks the replica from coming up whenever some object in the persistent schema references upstream state that has since changed — e.g. dbt views referencing tamanu columns that have been renamed or dropped upstream. Replicas don't control upstream and can't keep dbt strictly in sync. The user's preference is firmly: replica availability over schema completeness. Clients can regenerate broken views afterward. Drop ON_ERROR_STOP, capture psql's stderr to count `^ERROR:` lines, and report the outcome via the callback as either `success` (no errors) or `partial: N statement error(s)`. The script always exits 0 unless something catastrophic happens upstream of psql (network, auth, pg_dump segfault, etc.). On the operator side, route the callback body: `partial` results log a warn, emit a Warning event (`SchemaMigrationPartial`) for visibility, and set `schemaMigrationPhase: "partial"`. The sweep gate accepts `partial` alongside `complete` and `None`, since the migration has run and we no longer depend on the previous restore. Add a unit test on the migration script content asserting the tolerance guarantees.

passcod merged commit 3d6f4f6 into main May 12, 2026
19 checks passed

passcod deleted the schema-migration-tolerant branch May 12, 2026 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replica): tolerate per-statement errors in schema migration#39

fix(replica): tolerate per-statement errors in schema migration#39
passcod merged 1 commit into
mainfrom
schema-migration-tolerant

passcod commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

passcod commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant