feat: harden add-node workflow for data safety#385
Conversation
- Add `PeerCatchupResource` to block add-node progress until the new node's replication slot catches up via `spock.progress.remote_lsn`, preventing data loss during failover after node addition - Advance replication origin alongside slot during add-node to ensure consistent origin tracking on the subscriber - Treat disabled/down subscriptions as transient in `wait_for_sync_event` to avoid false failures mid-add-node - Bump Postgres image tags to Spock 5.0.8 - Add E2E tests covering add-node replication data safety
📝 WalkthroughWalkthroughThis PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8. ChangesReplication Origin Advancement for Multi-Node Topologies
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 1 medium |
🟢 Metrics 36 complexity · 0 duplication
Metric Results Complexity 36 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.
In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.
In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5
📒 Files selected for processing (11)
e2e/add_node_data_safety_test.goserver/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.jsonserver/internal/database/operations/populate_nodes.goserver/internal/database/operations/populate_nodes_test.goserver/internal/database/peer_catchup_resource.goserver/internal/database/replication_origin_advance_resource.goserver/internal/database/replication_slot_advance_from_cts_resource.goserver/internal/database/resources.goserver/internal/database/wait_for_sync_event_resource.goserver/internal/orchestrator/swarm/images.goserver/internal/postgres/create_db.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
server/internal/orchestrator/swarm/images.go (1)
78-78:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDefault version points to non-existent Spock 5.0.8 image.
The default version now points to
18.4/5, which uses thespock5.0.8-standard-1image that doesn't exist yet. This means any new deployments using the default version will fail until the images from postgres-images PR#20are published. Consider keeping the default at18.3/5(which uses the existingspock5.0.6-standard-2image) until the 5.0.8 images are verified in production.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/orchestrator/swarm/images.go` at line 78, The defaultVersion assignment in versions.defaultVersion using ds.MustPgEdgeVersion("18.4", "5") points to a non-existent Spock 5.0.8 image; change the call to use the verified image tag by setting ds.MustPgEdgeVersion("18.3", "5") instead (update the versions.defaultVersion initializer where ds.MustPgEdgeVersion is invoked) so new deployments continue to use the existing spock5.0.6-standard-2 image until the 5.0.8 images are published.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Around line 38-39: The code changes downgrade several image tags from
spock5.0.7 to spock5.0.6 (seen where imageTag(...) is used, e.g., PgEdgeImage),
which conflicts with the PR goal to update to Spock 5.0.8; either update these
imageTag invocations to the intended spock5.0.8-standard-<variant> (or restore
them to spock5.0.7 if that was the intent) across all occurrences, or if the
downgrade to spock5.0.6 is deliberate, add a short rationale to the PR
description documenting why 5.0.7/5.0.8 were not used; search for imageTag(...
"spock5.0.6-standard-2") and fix the tag strings or document the decision
accordingly (e.g., for PgEdgeImage and the other image variables that use
imageTag).
---
Duplicate comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Line 78: The defaultVersion assignment in versions.defaultVersion using
ds.MustPgEdgeVersion("18.4", "5") points to a non-existent Spock 5.0.8 image;
change the call to use the verified image tag by setting
ds.MustPgEdgeVersion("18.3", "5") instead (update the versions.defaultVersion
initializer where ds.MustPgEdgeVersion is invoked) so new deployments continue
to use the existing spock5.0.6-standard-2 image until the 5.0.8 images are
published.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 857aa726-a07a-4979-ad33-103cc0fefa36
📒 Files selected for processing (4)
e2e/add_node_data_safety_test.goe2e/minor_version_upgrade_test.goserver/internal/database/peer_catchup_resource.goserver/internal/orchestrator/swarm/images.go
✅ Files skipped from review due to trivial changes (1)
- e2e/minor_version_upgrade_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- e2e/add_node_data_safety_test.go
- server/internal/database/peer_catchup_resource.go
| PgEdgeImage: imageTag(cfg, "16.13-spock5.0.6-standard-2"), | ||
| }) |
There was a problem hiding this comment.
Verify the downgrade from Spock 5.0.7 to 5.0.6 is intentional.
The PR objectives state "Update Postgres/Spock to Spock 5.0.8," but these lines downgrade existing versions (16.13, 17.9, 18.3) from spock5.0.7-standard-1 to spock5.0.6-standard-2. This appears inconsistent with the stated goal. If the downgrade is intentional (e.g., 5.0.7 had critical issues), please document the rationale in the PR description.
Also applies to: 55-56, 72-73
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/internal/orchestrator/swarm/images.go` around lines 38 - 39, The code
changes downgrade several image tags from spock5.0.7 to spock5.0.6 (seen where
imageTag(...) is used, e.g., PgEdgeImage), which conflicts with the PR goal to
update to Spock 5.0.8; either update these imageTag invocations to the intended
spock5.0.8-standard-<variant> (or restore them to spock5.0.7 if that was the
intent) across all occurrences, or if the downgrade to spock5.0.6 is deliberate,
add a short rationale to the PR description documenting why 5.0.7/5.0.8 were not
used; search for imageTag(... "spock5.0.6-standard-2") and fix the tag strings
or document the decision accordingly (e.g., for PgEdgeImage and the other image
variables that use imageTag).
Summary
This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.
Changes
PeerCatchupResourceto block add-node progress until the newnode's replication slot has caught up via
spock.progress.remote_lsnconsistent origin tracking on the subscriber
wait_for_sync_eventto avoid false failures mid-add-nodeTesting
Verification:
Checklist