Skip to content

feat: harden add-node workflow for data safety#385

Merged
tsivaprasad merged 3 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes
May 18, 2026
Merged

feat: harden add-node workflow for data safety#385
tsivaprasad merged 3 commits into
mainfrom
PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Conversation

@tsivaprasad
Copy link
Copy Markdown
Contributor

@tsivaprasad tsivaprasad commented May 14, 2026

Summary

This PR hardens the add-node (ZODAN) workflow to prevent data loss during
failover after a node is added, and aligns slot creation with Spock
5.0.8 behavior.

Changes

  • Add PeerCatchupResource to block add-node progress until the new
    node's replication slot has caught up via spock.progress.remote_lsn
  • Advance replication origin alongside slot during add-node to ensure
    consistent origin tracking on the subscriber
  • Treat disabled/down subscriptions as transient in
    wait_for_sync_event to avoid false failures mid-add-node
  • Updated to Spock 5.0.8

Testing

Verification:

  1. Created cluster
  2. Created DB with 2 nodes
cp1-req create-database '{
  "id": "testdb",
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] }
    ]
  }
}'
  1. Insert data in n1 and verify in n2
cp-psql -i testdb-n1-689qacsi -U admin

psql (18.4)
Type "help" for help.

testdb=# CREATE TABLE test (id INT PRIMARY KEY, data TEXT);
INSERT INTO test VALUES (1, 'from-n1');
INFO:  DDL statement replicated.
CREATE TABLE
INSERT 0 1

cp-psql -i testdb-n2-9ptayhma -U admin

psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
(1 row)

testdb=# INSERT INTO test VALUES (2, 'from-n2');
INSERT 0 1
testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)

cp-psql -i testdb-n1-689qacsi -U admin

psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)
  1. Update-database
cp1-req update-database testdb '{
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] },
      { "name": "n3", "host_ids": ["host-3"] }
    ]
  }
}'
  1. Confirming data sync , replication slots and insert data into
cp-psql -i testdb-n3-ant97dj4 -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test ORDER BY id;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
(2 rows)

testdb=# SELECT slot_name, failover FROM pg_replication_slots;
        slot_name        | failover 
-------------------------+----------
 spk_testdb_n3_sub_n3_n2 | f
 spk_testdb_n3_sub_n3_n1 | f
(2 rows)

testdb=# SELECT * FROM pg_replication_origin;
 roident |         roname          
---------+-------------------------
       1 | spk_testdb_n2_sub_n2_n3
       2 | spk_testdb_n1_sub_n1_n3
(2 rows)

testdb=# INSERT INTO test VALUES (3, 'from-n3');
INSERT 0 1

cp-psql -i testdb-n1-689qacsi -U admin                                     
  
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |  data   
----+---------
  1 | from-n1
  2 | from-n2
  3 | from-n3
(3 rows)

  1. Update-database with n1 2 hosts
cp1-req update-database testdb '{              
  "spec": {
    "database_name": "testdb",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-4"] },
      { "name": "n2", "host_ids": ["host-2"] },
      { "name": "n3", "host_ids": ["host-3"] }
    ]
  }
}'
  1. switchover-database-name
cp1-req switchover-database-node testdb n1 '{}'
HTTP/1.1 200 OK
Content-Length: 215
Content-Type: application/json
Date: Thu, 14 May 2026 18:01:04 GMT

{
  task: {
    created_at: "2026-05-14T18:01:04Z"
    database_id: "testdb"
    entity_id: "testdb"
    node_name: "n1"
    scope: "database"
    status: "pending"
    task_id: "019e27a6-363c-7687-826f-10ef082903e9"
    type: "switchover"
  }
}

cp-psql -i testdb-n1-2kqnsx2a -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT slot_name, active, failover FROM pg_replication_slots;
        slot_name        | active | failover 
-------------------------+--------+----------
 spk_testdb_n1_sub_n1_n2 | t      | f
 testdb_n1_689qacsi      | t      | f
 spk_testdb_n1_sub_n1_n3 | t      | f
(3 rows)

testdb=# SELECT slot_name, active, failover FROM pg_replication_slots;
        slot_name        | active | failover 
-------------------------+--------+----------
 spk_testdb_n1_sub_n1_n2 | t      | f
 testdb_n1_689qacsi      | t      | f
 spk_testdb_n1_sub_n1_n3 | t      | f
(3 rows)

testdb=# SELECT sub_name, sub_enabled FROM spock.subscription;
 sub_name  | sub_enabled 
-----------+-------------
 sub_n2_n1 | t
 sub_n3_n1 | t
(2 rows)

testdb=# INSERT INTO test VALUES (10, 'post-switchover');
INSERT 0 1

  1. Confriming data sync n2 & n3
cp-psql -i testdb-n2-9ptayhma -U admin
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |      data       
----+-----------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
(4 rows)

cp-psql -i testdb-n3-ant97dj4 -U admin

psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test;            
 id |      data       
----+-----------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
(4 rows)
  1. Swtchover-database-node
cp1-req switchover-database-node testdb n1 '{"candidate_instance_id": "testdb-n1-689qacsi"}'
HTTP/1.1 200 OK
Content-Length: 250
Content-Type: application/json
Date: Thu, 14 May 2026 18:10:39 GMT

{
  task: {
    created_at: "2026-05-14T18:10:39Z"
    database_id: "testdb"
    entity_id: "testdb"
    instance_id: "testdb-n1-689qacsi"
    node_name: "n1"
    scope: "database"
    status: "pending"
    task_id: "019e27ae-fc68-7012-9fba-1c5f62f24402"
    type: "switchover"
  }
}

  1. After switchover-database-node inserting data
cp-psql -i testdb-n1-689qacsi -U admin
psql (18.4)
Type "help" for help.

testdb=# INSERT INTO test VALUES (30, 'final-validation');
INSERT 0 1
testdb=# select * from test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)
  1. Confirming data sync
cp-psql -i testdb-n2-9ptayhma -U admin
psql (18.4)
Type "help" for help.

testdb=# select * from test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)

cp-psql -i testdb-n3-ant97dj4 -U admin
psql (18.4)
Type "help" for help.

testdb=# SELECT * FROM test;
 id |       data       
----+------------------
  1 | from-n1
  2 | from-n2
  3 | from-n3
 10 | post-switchover
 30 | final-validation
(5 rows)

Checklist

  • Tests added and updated

- Add `PeerCatchupResource` to block add-node progress until the
  new node's replication slot catches up via `spock.progress.remote_lsn`,
  preventing data loss during failover after node addition
- Advance replication origin alongside slot during add-node to
  ensure consistent origin tracking on the subscriber
- Treat disabled/down subscriptions as transient in
  `wait_for_sync_event` to avoid false failures mid-add-node
- Bump Postgres image tags to Spock 5.0.8
- Add E2E tests covering add-node replication data safety
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds replication-origin advancement for safely adding new subscriber nodes to multi-node PostgreSQL/Spock clusters. The changes introduce two new resources (PeerCatchupResource and ReplicationOriginAdvanceResource), modify existing resources to output and handle advancement state, integrate them into the node-population orchestration, validate via updated tests, and bump the Spock container image to 5.0.8.

Changes

Replication Origin Advancement for Multi-Node Topologies

Layer / File(s) Summary
PostgreSQL replication origin helpers
server/internal/postgres/create_db.go
New functions EnsureReplicationOriginExists, AdvanceReplicationOrigin, and SpockProgressReachedLSN provide replication origin management and peer progress tracking on the provider side.
PeerCatchupResource implementation
server/internal/database/peer_catchup_resource.go
PeerCatchupResource polls until a peer node's apply progress reaches a sync event LSN using SpockProgressReachedLSN, with resource metadata, identifier helpers, and resource lifecycle methods.
ReplicationOriginAdvanceResource implementation
server/internal/database/replication_origin_advance_resource.go
ReplicationOriginAdvanceResource advances the replication origin on the subscriber node to the LSN computed by ReplicationSlotAdvanceFromCTSResource, ensuring apply resumes from the correct position.
Existing resource output and status handling
server/internal/database/replication_slot_advance_from_cts_resource.go, server/internal/database/wait_for_sync_event_resource.go
ReplicationSlotAdvanceFromCTSResource outputs AdvancedToLSN for subscriber consumption and ignores it during diffing. WaitForSyncEventResource treats disabled/down subscription states as transient and retryable.
Resource orchestration in PopulateNode
server/internal/database/operations/populate_nodes.go, server/internal/database/resources.go
PopulateNode wires PeerCatchupResourceIdentifier into sync dependencies. addPeerResources creates both PeerCatchupResource and ReplicationOriginAdvanceResource. RegisterResourceTypes registers both new resource types.
Unit and golden test expectations
server/internal/database/operations/populate_nodes_test.go, server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
Test expectations updated with PeerCatchupResource entries, PeerCatchupResourceIdentifier dependencies, and ReplicationOriginAdvanceResource replacing ReplicationSlotAdvanceFromCTSResource in multi-node scenarios.
End-to-end replication origin test
e2e/add_node_data_safety_test.go
New TestAddNodeOriginAdvanced verifies replication origin advancement after adding a subscriber node by provisioning a 3-node topology, writing data for WAL advancement, and asserting the new subscriber's origin progresses past zero-LSN.
Container image version update
server/internal/orchestrator/swarm/images.go
Spock container image tag updated to 5.0.8-standard-1 for PgEdge versions 16.13/5, 17.9/5, and 18.3/5.

🐰 New resources hop into place,
Peers catch up with grace and pace,
Origins advance with care,
Multi-node replication, fair!
Spock 5.0.8 leads the way,
Safer clusters every day. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main objective: hardening the add-node workflow for data safety.
Description check ✅ Passed The description follows the template with clear Summary, Changes, Testing, and Checklist sections; all required information is present and detailed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 14, 2026

Up to standards ✅

🟢 Issues 1 medium

Results:
1 new issue

Category Results
Complexity 1 medium

View in Codacy

🟢 Metrics 36 complexity · 0 duplication

Metric Results
Complexity 36
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/database/replication_slot_advance_from_cts_resource.go`:
- Around line 151-153: Reset r.AdvancedToLSN to its zero value at the start of
the Create method to avoid leaking a previous run's LSN when early returns
occur; locate the Create function on the ReplicationSlotAdvanceFromCTSResource
receiver and add a single line that clears r.AdvancedToLSN before any
processing/early-return logic so only a successful path sets r.AdvancedToLSN =
targetLSN.

In `@server/internal/database/wait_for_sync_event_resource.go`:
- Around line 104-107: The transient-status branch sleeps with
time.Sleep(pollInterval), which blocks cancellation; change it to a
context-aware wait by replacing the sleep with a select that waits on
time.After(pollInterval) and ctx.Done() (e.g., select { case
<-time.After(pollInterval): continue; case <-ctx.Done(): return ctx.Err() }),
referencing the branch handling postgres.SubStatusDisabled and
postgres.SubStatusDown and using the existing pollInterval and the function's
ctx to abort promptly on cancellation.

In `@server/internal/orchestrator/swarm/images.go`:
- Line 39: The Docker image tags referencing Spock 5.0.8 are invalid and will
break deployments; update the usages of PgEdgeImage and any other imageTag(cfg,
"...-spock5.0.8-standard-1") calls to either revert to the known-working tags
(e.g., the previous non-spock tag variants) or ensure you build and push the
three images (docker.io/pgedge/pgedge-postgres:16.13-spock5.0.8-standard-1,
:17.9-spock5.0.8-standard-1, :18.3-spock5.0.8-standard-1) before merging so
imageTag(cfg, "...") points to existing images.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dca34b4e-bbb1-4f28-871f-b276c368fcf5

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc438d and ca0900c.

📒 Files selected for processing (11)
  • e2e/add_node_data_safety_test.go
  • server/internal/database/operations/golden_test/TestUpdateDatabase/two_nodes_to_three_nodes_with_populate.json
  • server/internal/database/operations/populate_nodes.go
  • server/internal/database/operations/populate_nodes_test.go
  • server/internal/database/peer_catchup_resource.go
  • server/internal/database/replication_origin_advance_resource.go
  • server/internal/database/replication_slot_advance_from_cts_resource.go
  • server/internal/database/resources.go
  • server/internal/database/wait_for_sync_event_resource.go
  • server/internal/orchestrator/swarm/images.go
  • server/internal/postgres/create_db.go

Comment thread server/internal/database/wait_for_sync_event_resource.go
Comment thread server/internal/orchestrator/swarm/images.go Outdated
@tsivaprasad tsivaprasad marked this pull request as draft May 14, 2026 13:28
@tsivaprasad tsivaprasad marked this pull request as ready for review May 15, 2026 11:41
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/internal/orchestrator/swarm/images.go (1)

78-78: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Default version points to non-existent Spock 5.0.8 image.

The default version now points to 18.4/5, which uses the spock5.0.8-standard-1 image that doesn't exist yet. This means any new deployments using the default version will fail until the images from postgres-images PR #20 are published. Consider keeping the default at 18.3/5 (which uses the existing spock5.0.6-standard-2 image) until the 5.0.8 images are verified in production.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/orchestrator/swarm/images.go` at line 78, The defaultVersion
assignment in versions.defaultVersion using ds.MustPgEdgeVersion("18.4", "5")
points to a non-existent Spock 5.0.8 image; change the call to use the verified
image tag by setting ds.MustPgEdgeVersion("18.3", "5") instead (update the
versions.defaultVersion initializer where ds.MustPgEdgeVersion is invoked) so
new deployments continue to use the existing spock5.0.6-standard-2 image until
the 5.0.8 images are published.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Around line 38-39: The code changes downgrade several image tags from
spock5.0.7 to spock5.0.6 (seen where imageTag(...) is used, e.g., PgEdgeImage),
which conflicts with the PR goal to update to Spock 5.0.8; either update these
imageTag invocations to the intended spock5.0.8-standard-<variant> (or restore
them to spock5.0.7 if that was the intent) across all occurrences, or if the
downgrade to spock5.0.6 is deliberate, add a short rationale to the PR
description documenting why 5.0.7/5.0.8 were not used; search for imageTag(...
"spock5.0.6-standard-2") and fix the tag strings or document the decision
accordingly (e.g., for PgEdgeImage and the other image variables that use
imageTag).

---

Duplicate comments:
In `@server/internal/orchestrator/swarm/images.go`:
- Line 78: The defaultVersion assignment in versions.defaultVersion using
ds.MustPgEdgeVersion("18.4", "5") points to a non-existent Spock 5.0.8 image;
change the call to use the verified image tag by setting
ds.MustPgEdgeVersion("18.3", "5") instead (update the versions.defaultVersion
initializer where ds.MustPgEdgeVersion is invoked) so new deployments continue
to use the existing spock5.0.6-standard-2 image until the 5.0.8 images are
published.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 857aa726-a07a-4979-ad33-103cc0fefa36

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee14f7 and 24c56a6.

📒 Files selected for processing (4)
  • e2e/add_node_data_safety_test.go
  • e2e/minor_version_upgrade_test.go
  • server/internal/database/peer_catchup_resource.go
  • server/internal/orchestrator/swarm/images.go
✅ Files skipped from review due to trivial changes (1)
  • e2e/minor_version_upgrade_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • e2e/add_node_data_safety_test.go
  • server/internal/database/peer_catchup_resource.go

Comment on lines +38 to +39
PgEdgeImage: imageTag(cfg, "16.13-spock5.0.6-standard-2"),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify the downgrade from Spock 5.0.7 to 5.0.6 is intentional.

The PR objectives state "Update Postgres/Spock to Spock 5.0.8," but these lines downgrade existing versions (16.13, 17.9, 18.3) from spock5.0.7-standard-1 to spock5.0.6-standard-2. This appears inconsistent with the stated goal. If the downgrade is intentional (e.g., 5.0.7 had critical issues), please document the rationale in the PR description.

Also applies to: 55-56, 72-73

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/orchestrator/swarm/images.go` around lines 38 - 39, The code
changes downgrade several image tags from spock5.0.7 to spock5.0.6 (seen where
imageTag(...) is used, e.g., PgEdgeImage), which conflicts with the PR goal to
update to Spock 5.0.8; either update these imageTag invocations to the intended
spock5.0.8-standard-<variant> (or restore them to spock5.0.7 if that was the
intent) across all occurrences, or if the downgrade to spock5.0.6 is deliberate,
add a short rationale to the PR description documenting why 5.0.7/5.0.8 were not
used; search for imageTag(... "spock5.0.6-standard-2") and fix the tag strings
or document the decision accordingly (e.g., for PgEdgeImage and the other image
variables that use imageTag).

@tsivaprasad tsivaprasad merged commit 1f2c4e6 into main May 18, 2026
3 checks passed
@tsivaprasad tsivaprasad deleted the PLAT-520-harden-zodan-workflow-to-match-spock-data-loss-fixes branch May 18, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants