feat: harden add-node workflow for data safety by tsivaprasad · Pull Request #385 · pgEdge/control-plane

tsivaprasad · 2026-05-14T12:30:56Z

Summary

This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.

Changes

Add PeerCatchupResource to block add-node progress until the new
node's replication slot has caught up via spock.progress.remote_lsn
Advance replication origin alongside slot during add-node to ensure
consistent origin tracking on the subscriber
Treat disabled/down subscriptions as transient in
wait_for_sync_event to avoid false failures mid-add-node
Updated to Spock 5.0.8

Testing

Verification:

Created cluster
Created DB with 2 nodes

cp1-req create-database '{
  "id": "testdb",
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] }
    ]
  }
}'

Insert data in n1 and verify in n2

cp-psql -i testdb-n1-689qacsi -U admin

psql (18.4)
Type "help" for help.

testdb=# CREATE TABLE test (id INT PRIMARY KEY, data TEXT);
INSERT INTO test VALUES (1, 'from-n1');
INFO:  DDL statement replicated.
CREATE TABLE
INSERT 0 1

cp-psql -i testdb-n2-9ptayhma -U admin

psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
(1 row)

testdb=# INSERT INTO test VALUES (2, 'from-n2');
INSERT 0 1
testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)

cp-psql -i testdb-n1-689qacsi -U admin

psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)

Update-database

cp1-req update-database testdb '{
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] },
      { "name": "n3", "host_ids": ["host-3"] }
    ]
  }
}'

Confirming data sync , replication slots and insert data into

cp-psql -i testdb-n3-ant97dj4 -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test ORDER BY id;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)

testdb=# SELECT slot_name, failover FROM pg_replication_slots;
        slot_name        | failover 
-------------------------+----------
 spk_testdb_n3_sub_n3_n2 | f
 spk_testdb_n3_sub_n3_n1 | f
(2 rows)

testdb=# SELECT * FROM pg_replication_origin;
 roident |         roname          
---------+-------------------------
       1 | spk_testdb_n2_sub_n2_n3
       2 | spk_testdb_n1_sub_n1_n3
(2 rows)

testdb=# INSERT INTO test VALUES (3, 'from-n3');
INSERT 0 1

cp-psql -i testdb-n1-689qacsi -U admin                                     
  
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
  3 | from-n3
(3 rows)

Update-database with n1 2 hosts

cp1-req update-database testdb '{              
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-4"] },
      { "name": "n2", "host_ids": ["host-2"] },
      { "name": "n3", "host_ids": ["host-3"] }
    ]
  }
}'

switchover-database-name

cp1-req switchover-database-node testdb n1 '{}'
HTTP/1.1 200 OK
Content-Length: 215
Content-Type: application/json
Date: Thu, 14 May 2026 18:01:04 GMT

{
  task: {
    created_at: "2026-05-14T18:01:04Z"
    database_id: "testdb"
    entity_id: "testdb"
    node_name: "n1"
    scope: "database"
    status: "pending"
    task_id: "019e27a6-363c-7687-826f-10ef082903e9"
    type: "switchover"
  }
}

cp-psql -i testdb-n1-2kqnsx2a -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT slot_name, active, failover FROM pg_replication_slots;
        slot_name        | active | failover 
-------------------------+--------+----------
 spk_testdb_n1_sub_n1_n2 | t      | f
 testdb_n1_689qacsi      | t      | f
 spk_testdb_n1_sub_n1_n3 | t      | f
(3 rows)

testdb=# SELECT slot_name, active, failover FROM pg_replication_slots;
        slot_name        | active | failover 
-------------------------+--------+----------
 spk_testdb_n1_sub_n1_n2 | t      | f
 testdb_n1_689qacsi      | t      | f
 spk_testdb_n1_sub_n1_n3 | t      | f
(3 rows)

testdb=# SELECT sub_name, sub_enabled FROM spock.subscription;
 sub_name  | sub_enabled 
-----------+-------------
 sub_n2_n1 | t
 sub_n3_n1 | t
(2 rows)

testdb=# INSERT INTO test VALUES (10, 'post-switchover');
INSERT 0 1

Confriming data sync n2 & n3

cp-psql -i testdb-n2-9ptayhma -U admin
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |      data       
----+-----------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
(4 rows)

cp-psql -i testdb-n3-ant97dj4 -U admin

psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test;            
 id |      data       
----+-----------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
(4 rows)

Swtchover-database-node

cp1-req switchover-database-node testdb n1 '{"candidate_instance_id": "testdb-n1-689qacsi"}'
HTTP/1.1 200 OK
Content-Length: 250
Content-Type: application/json
Date: Thu, 14 May 2026 18:10:39 GMT

{
  task: {
    created_at: "2026-05-14T18:10:39Z"
    database_id: "testdb"
    entity_id: "testdb"
    instance_id: "testdb-n1-689qacsi"
    node_name: "n1"
    scope: "database"
    status: "pending"
    task_id: "019e27ae-fc68-7012-9fba-1c5f62f24402"
    type: "switchover"
  }
}

After switchover-database-node inserting data

cp-psql -i testdb-n1-689qacsi -U admin
psql (18.4)
Type "help" for help.

testdb=# INSERT INTO test VALUES (30, 'final-validation');
INSERT 0 1
testdb=# select * from test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)

Confirming data sync

cp-psql -i testdb-n2-9ptayhma -U admin
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)

cp-psql -i testdb-n3-ant97dj4 -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)

Checklist

Tests added and updated

- Add `PeerCatchupResource` to block add-node progress until the new node's replication slot catches up via `spock.progress.remote_lsn`, preventing data loss during failover after node addition - Advance replication origin alongside slot during add-node to ensure consistent origin tracking on the subscriber - Treat disabled/down subscriptions as transient in `wait_for_sync_event` to avoid false failures mid-add-node - Bump Postgres image tags to Spock 5.0.8 - Add E2E tests covering add-node replication data safety

coderabbitai · 2026-05-14T12:31:09Z

📝 Walkthrough

Walkthrough

This PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8.

Changes

Replication Origin Advancement for Multi-Node Topologies

Layer / File(s)	Summary
PostgreSQL replication origin helpers `server/internal/postgres/create_db.go`	New functions EnsureReplicationOriginExists, AdvanceReplicationOrigin, and SpockProgressReachedLSN provide replication origin management and peer progress tracking on the provider side.
PeerCatchupResource implementation `server/internal/database/peer_catchup_resource.go`	PeerCatchupResource polls until a peer node's apply progress reaches a sync event LSN using SpockProgressReachedLSN, with resource metadata, identifier helpers, and resource lifecycle methods.
ReplicationOriginAdvanceResource implementation `server/internal/database/replication_origin_advance_resource.go`	ReplicationOriginAdvanceResource advances the replication origin on the subscriber node to the LSN computed by ReplicationSlotAdvanceFromCTSResource, ensuring apply resumes from the correct position.
Existing resource output and status handling `server/internal/database/replication_slot_advance_from_cts_resource.go`, `server/internal/database/wait_for_sync_event_resource.go`	ReplicationSlotAdvanceFromCTSResource outputs AdvancedToLSN for subscriber consumption and ignores it during diffing. WaitForSyncEventResource treats disabled/down subscription states as transient and retryable.
Resource orchestration in PopulateNode `server/internal/database/operations/populate_nodes.go`, `server/internal/database/resources.go`	PopulateNode wires PeerCatchupResourceIdentifier into sync dependencies. addPeerResources creates both PeerCatchupResource and ReplicationOriginAdvanceResource. RegisterResourceTypes registers both new resource types.
Unit and golden test expectations `server/internal/database/operations/populate_nodes_test.go`, `server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json`	Test expectations updated with PeerCatchupResource entries, PeerCatchupResourceIdentifier dependencies, and ReplicationOriginAdvanceResource replacing ReplicationSlotAdvanceFromCTSResource in multi-node scenarios.
End-to-end replication origin test `e2e/add_node_data_safety_test.go`	New TestAddNodeOriginAdvanced verifies replication origin advancement after adding a subscriber node by provisioning a 3-node topology, writing data for WAL advancement, and asserting the new subscriber's origin progresses past zero-LSN.
Container image version update `server/internal/orchestrator/swarm/images.go`	Spock container image tag updated to 5.0.8-standard-1 for PgEdge versions 16.13/5, 17.9/5, and 18.3/5.

🐰 New resources hop into place,
Peers catch up with grace and pace,
Origins advance with care,
Multi-node replication, fair!
Spock 5.0.8 leads the way,
Safer clusters every day. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main objective: hardening the add-node workflow for data safety.
Description check	✅ Passed	The description follows the template with clear Summary, Changes, Testing, and Checklist sections; all required information is present and detailed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-05-14T12:31:53Z

Up to standards ✅

🟢 Issues 1 medium

Results:
1 new issue

Category Results

Complexity 1 medium

View in Codacy

🟢 Metrics 36 complexity · 0 duplication

Metric Results

Complexity 36

Duplication 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.

In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.

In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc438d and ca0900c.

📒 Files selected for processing (11)

e2e/add_node_data_safety_test.go
server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
server/internal/database/operations/populate_nodes.go
server/internal/database/operations/populate_nodes_test.go
server/internal/database/peer_catchup_resource.go
server/internal/database/replication_origin_advance_resource.go
server/internal/database/replication_slot_advance_from_cts_resource.go
server/internal/database/resources.go
server/internal/database/wait_for_sync_event_resource.go
server/internal/orchestrator/swarm/images.go
server/internal/postgres/create_db.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/internal/orchestrator/swarm/images.go (1)
78-78: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Default version points to non-existent Spock 5.0.8 image.

The default version now points to 18.4/5, which uses the spock5.0.8-standard-1 image that doesn't exist yet. This means any new deployments using the default version will fail until the images from postgres-images PR #20 are published. Consider keeping the default at 18.3/5 (which uses the existing spock5.0.6-standard-2 image) until the 5.0.8 images are verified in production.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/orchestrator/swarm/images.go` at line 78, The defaultVersion
assignment in versions.defaultVersion using ds.MustPgEdgeVersion("18.4", "5")
points to a non-existent Spock 5.0.8 image; change the call to use the verified
image tag by setting ds.MustPgEdgeVersion("18.3", "5") instead (update the
versions.defaultVersion initializer where ds.MustPgEdgeVersion is invoked) so
new deployments continue to use the existing spock5.0.6-standard-2 image until
the 5.0.8 images are published.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Around line 38-39: The code changes downgrade several image tags from
spock5.0.7 to spock5.0.6 (seen where imageTag(...) is used, e.g., PgEdgeImage),
which conflicts with the PR goal to update to Spock 5.0.8; either update these
imageTag invocations to the intended spock5.0.8-standard-<variant> (or restore
them to spock5.0.7 if that was the intent) across all occurrences, or if the
downgrade to spock5.0.6 is deliberate, add a short rationale to the PR
description documenting why 5.0.7/5.0.8 were not used; search for imageTag(...
"spock5.0.6-standard-2") and fix the tag strings or document the decision
accordingly (e.g., for PgEdgeImage and the other image variables that use
imageTag).

---

Duplicate comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Line 78: The defaultVersion assignment in versions.defaultVersion using
ds.MustPgEdgeVersion("18.4", "5") points to a non-existent Spock 5.0.8 image;
change the call to use the verified image tag by setting
ds.MustPgEdgeVersion("18.3", "5") instead (update the versions.defaultVersion
initializer where ds.MustPgEdgeVersion is invoked) so new deployments continue
to use the existing spock5.0.6-standard-2 image until the 5.0.8 images are
published.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 857aa726-a07a-4979-ad33-103cc0fefa36

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee14f7 and 24c56a6.

📒 Files selected for processing (4)

e2e/add_node_data_safety_test.go
e2e/minor_version_upgrade_test.go
server/internal/database/peer_catchup_resource.go
server/internal/orchestrator/swarm/images.go

✅ Files skipped from review due to trivial changes (1)

e2e/minor_version_upgrade_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

e2e/add_node_data_safety_test.go
server/internal/database/peer_catchup_resource.go

coderabbitai · 2026-05-18T10:51:47Z

+		PgEdgeImage: imageTag(cfg, "16.13-spock5.0.6-standard-2"),
+	})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify the downgrade from Spock 5.0.7 to 5.0.6 is intentional.

The PR objectives state "Update Postgres/Spock to Spock 5.0.8," but these lines downgrade existing versions (16.13, 17.9, 18.3) from spock5.0.7-standard-1 to spock5.0.6-standard-2. This appears inconsistent with the stated goal. If the downgrade is intentional (e.g., 5.0.7 had critical issues), please document the rationale in the PR description.

Also applies to: 55-56, 72-73

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/orchestrator/swarm/images.go` around lines 38 - 39, The code changes downgrade several image tags from spock5.0.7 to spock5.0.6 (seen where imageTag(...) is used, e.g., PgEdgeImage), which conflicts with the PR goal to update to Spock 5.0.8; either update these imageTag invocations to the intended spock5.0.8-standard-<variant> (or restore them to spock5.0.7 if that was the intent) across all occurrences, or if the downgrade to spock5.0.6 is deliberate, add a short rationale to the PR description documenting why 5.0.7/5.0.8 were not used; search for imageTag(... "spock5.0.6-standard-2") and fix the tag strings or document the decision accordingly (e.g., for PgEdgeImage and the other image variables that use imageTag).

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread server/internal/database/replication_slot_advance_from_cts_resource.go

Comment thread server/internal/database/wait_for_sync_event_resource.go

Comment thread server/internal/orchestrator/swarm/images.go Outdated

addressing AI review comments

1ee14f7

tsivaprasad marked this pull request as draft May 14, 2026 13:28

tsivaprasad marked this pull request as ready for review May 15, 2026 11:41

updating spock 5.0.8

24c56a6

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

jason-lynch approved these changes May 18, 2026

View reviewed changes

tsivaprasad merged commit 1f2c4e6 into main May 18, 2026
3 checks passed

tsivaprasad deleted the PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes branch May 18, 2026 12:56

mmols mentioned this pull request May 20, 2026

feat: harden add-node workflow for data safety, add failover test pgEdge/pgedge-helm#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harden add-node workflow for data safety#385

feat: harden add-node workflow for data safety#385
tsivaprasad merged 3 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

tsivaprasad commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		PgEdgeImage: imageTag(cfg, "16.13-spock5.0.6-standard-2"),
		})

Conversation

tsivaprasad commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsivaprasad commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

codacy-production Bot commented May 14, 2026 •

edited

Loading