Skip to content

fix(replica): tolerate per-statement errors in schema migration#39

Merged
passcod merged 1 commit into
mainfrom
schema-migration-tolerant
May 12, 2026
Merged

fix(replica): tolerate per-statement errors in schema migration#39
passcod merged 1 commit into
mainfrom
schema-migration-tolerant

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented May 12, 2026

The persistent_schemas migration uses pg_dump | psql to copy schemas between restores. Until now psql ran with ON_ERROR_STOP=1, so the first failing statement killed the whole migration. That blocks the replica from coming up whenever some object in the persistent schema references upstream state that has since changed — e.g. dbt views referencing tamanu columns that have been renamed or dropped upstream.

Observed in production on palau-prod: an upstream column rename (notes.note_typenotes.note_type_id) made one dbt view fail to recreate, which aborted the whole migration. The restore stayed stuck in Switching for 10+ hours with ~130 migration Job pods retrying the same failure.

Replicas don't control upstream and can't keep dbt strictly in sync. Stated preference: replica availability over schema completeness. Clients can regenerate broken views afterward; what they can't do is wait days for a replica to come back.

Drop ON_ERROR_STOP, capture psql's stderr to count ^ERROR: lines, and report the outcome via the callback as either success (no errors) or partial: N statement error(s). The script always exits 0 unless something catastrophic happens upstream of psql (network, auth, pg_dump segfault, etc.).

On the operator side, route the callback body: partial results log a warn, emit a Warning event (SchemaMigrationPartial) for visibility, and set schemaMigrationPhase: "partial". The sweep gate accepts partial alongside complete and None, since the migration has run and we no longer depend on the previous restore.

The persistent_schemas migration uses pg_dump | psql to copy schemas
between restores. Until now psql ran with `ON_ERROR_STOP=1` so the
first failing statement killed the whole migration. That blocks the
replica from coming up whenever some object in the persistent schema
references upstream state that has since changed — e.g. dbt views
referencing tamanu columns that have been renamed or dropped
upstream.

Replicas don't control upstream and can't keep dbt strictly in sync.
The user's preference is firmly: replica availability over schema
completeness. Clients can regenerate broken views afterward.

Drop ON_ERROR_STOP, capture psql's stderr to count `^ERROR:` lines,
and report the outcome via the callback as either `success` (no
errors) or `partial: N statement error(s)`. The script always exits
0 unless something catastrophic happens upstream of psql (network,
auth, pg_dump segfault, etc.).

On the operator side, route the callback body: `partial` results
log a warn, emit a Warning event (`SchemaMigrationPartial`) for
visibility, and set `schemaMigrationPhase: "partial"`. The sweep
gate accepts `partial` alongside `complete` and `None`, since the
migration has run and we no longer depend on the previous restore.

Add a unit test on the migration script content asserting the
tolerance guarantees.
@passcod passcod merged commit 3d6f4f6 into main May 12, 2026
19 checks passed
@passcod passcod deleted the schema-migration-tolerant branch May 12, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant