feat: update spec when db versions change by jason-lynch · Pull Request #377 · pgEdge/control-plane

jason-lynch · 2026-05-04T21:26:19Z

Summary

A database spec can fall out of sync with its running instances when a user upgrades packages outside of our system. We expect this to happen in systemd clusters from typical system package upgrade workflows. So, we need a way to update the database spec when we observe version changes in the running instances.

This PR adds a new monitor that loops over all databases and reconciles the observed versions with those in the instance record and the spec. This monitor uses the existing election system to ensure that only one host runs the reconciliation process.

Changes

This PR is split into granular commits. I highly recommend reviewing the commits separately rather than looking at the entire diff at once. The most notable of the commits are:

fix: use most recently fetched primary instance id
- This is an existing bug that I ran into while testing the scenarios in this PR description and the cluster test. Before, we were just picking the first non-empty primary instance ID in the node resource. This could cause a problem if there was a switchover after that primary instance field was updated. This change accounts for that in a backward-compatible way.
feat: update host in host monitor
- This small change makes it so we update the list of supported versions when the packages change on the host.
feat: record versions for replica instances
- This was a necessary change for us to determine whether all instances in a node have been updated. It's also a usability improvement to show when a replica has been updated.
feat: add service method to reconcile db versions
- This commit contains the core logic of this feature.
feat: add databases monitor
- This is the mechanism that invokes the core logic of this feature.
docs: minor postgres version upgrades for systemd
- This is a small doc update to describe how we can accommodate minor version upgrades ahead of supporting package management in the control plane API. Externally-invoked major version upgrades are not supported.

Testing

I've added a new cluster test that exercises this feature. You can also test this by hand using the instructions in this section.

Using the dev-lima environment

# reset your environment
make dev-lima-reset
# deploy the fixture updates to add the 'pgedge old' repos
make dev-lima-deploy

# switch to the dev-lima environment
use-dev-lima

# downgrade to the previous version of postgres 18 on hosts 1-3
cp1-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp2-ssh 'sudo dnf downgrade -y pgedge-postgresql18'
cp3-ssh 'sudo dnf downgrade -y pgedge-postgresql18'

# start the dev-lima environment in one terminal
make dev-lima-run

# then in another terminal
# switch to the dev-lima environment
use-dev-lima

# initialize the cluster
cp-init

# check the supported versions for hosts 1-3
# you should see 18.2 listed for all three.
cp1-req list-hosts | jq '.hosts[] 
        | select(.id | IN("host-1", "host-2", "host-3"))
        | {id, supported_pgedge_versions}'

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on host-3
# you might see messages like 'Shared connection to 127.0.0.1 closed.', but they
# should still exit 0 and the message can be ignored.
cp3-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp3-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-3:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-3")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# next, upgrade host-2
cp2-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp2-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-2:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-2")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
cp1-ssh 'sudo dnf upgrade -y pgedge-postgresql18'
cp1-ssh 'sudo systemctl try-restart patroni-*'

# within 15 seconds, you should see version 18.3 in the supported versions for
# host-1:
cp1-req list-hosts | jq '.hosts[] 
        | select(.id == "host-1")
        | {id, supported_pgedge_versions}'

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. however, these updates
# won't actually change anything.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-lima-reset

Using the compose environment

# reset your environment
make dev-reset

# ensure you're using the compose environment
use-compose

# start the compose environment in one terminal
make dev-watch

# then in another terminal
# ensure you're using the compose environment
use-compose

# initialize the cluster
cp-init

# create a database using hosts 1-3
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "postgres_version": "18.2",
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# check the instance and spec versions
# you should see that each instance is running 18.2 and the version is set at
# the top-level field in the database spec.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-3 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n2-ant97dj4

# within 30 seconds, you should see a message in the control plane server logs
# that says it detected an updated instance version and an updated node version.
# you should also see the instance is reporting an updated version in the
# databases API and that the database spec has been updated with a version
# override to 18.3 for n2.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# upgrade postgres on the host-2 instance
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-9ptayhma

# within 30 seconds, you should see that the host-2 instance is reporting an
# updated version, however the spec should look the same because n1 is not yet
# fully upgraded.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# finally, upgrade host-1
docker service update \
    --image=ghcr.io/pgedge/pgedge-postgres:18.3-spock5.0.6-standard-1 \
    postgres-storefront-n1-689qacsi

# within 30 seconds, you should see that the host-1 instance is reporting an
# updated version. You should also see that the spec is updated with version
# 18.3 in the top-level version field, and no more per-node overrides.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# perform a no-op update
# we still expect to see some resource updates in the server logs because the
# version number is written into a few resource types. this will trigger a
# patroni reload because the swarm patroni config resource includes the entire
# instance spec. it will also trigger a switchover because the docker CLI sorts
# volume mounts. however, these updates won't change the database versions
# or other configuration.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] },
      { "name": "n2", "host_ids": ["host-3"] }
    ]
  }
}
EOF

# wait 5 seconds to ensure the instance monitor interval has elapsed, then
# re-fetch the updated database. it look identical to the previous get-database
# request with all instances running 18.3 and 18.3 set in the top-level spec
# field.
cp1-req get-database storefront | jq '{
    instances : [.instances[] | {id, host_id, node_name, postgres }],
    spec: (.spec | { postgres_version, spock_version, nodes }) }'

# clean up by running
make dev-reset

PLAT-512

coderabbitai · 2026-05-04T21:26:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a leader-backed periodic databases monitor that reconciles instance Postgres/Spock versions into stored instance records and specs, introduces PgEdge version parsing/normalization and Parse/MustParse APIs, wires monitoring/election and logging, extends storage txn comparisons, and includes tests, docs, changelog entries, and installer/test infra tweaks.

Database Version Reconciliation & Monitoring

Layer / File(s)	Summary
Version Parsing / Data Shape `server/internal/ds/versions.go`, `server/internal/ds/versions_test.go`	Introduce `Version.MajorMinorVersion()` and `PgEdgeVersion.Normalize()`; add `ParsePgEdgeVersion`/`MustParsePgEdgeVersion` and update call-sites/tests.
Config & Logging Consts `server/internal/config/config.go`, `server/internal/logging/factory.go`	Add `DatabasesMonitorIntervalSeconds` config (default 30) and `ComponentDatabaseService` logging component.
Storage Transaction Support `server/internal/storage/interface.go`, `server/internal/storage/txn.go`	Add `Txn.AddConditions(...)` to interface and `txn` implementation to collect/include etcd `clientv3.Cmp` conditions in commits.
Reconciliation Core `server/internal/database/reconcile_versions.go`, `server/internal/database/reconcile_versions_test.go`	Add `Service.ReconcileAllDatabaseVersions` and `ReconcileVersions` which reconcile instance pgEdge versions, derive observed node versions, stage instance/spec updates, and commit guarded transactions; tests exercise scenarios.
Service Logging & Wiring `server/internal/database/service.go`, `server/internal/database/provide.go`	Add `logger` to `Service`; `NewService` accepts `loggerFactory` and provider injects it; replace `NewPgEdgeVersion` usages with `ParsePgEdgeVersion`.
Monitor Implementation `server/internal/monitor/databases_monitor.go`	Add `DatabasesMonitor` that periodically calls reconciliation when its `election.Candidate` is leader; interval from config.
Monitor Integration `server/internal/monitor/provide.go`, `server/internal/monitor/service.go`	Provider resolves `*election.Service`, creates a candidate and passes it to `NewService`; `Service` conditionally constructs, starts, and stops `databasesMonitor`.
Monitor & Host/Instance Adjustments `server/internal/monitor/instance_monitor.go`, `server/internal/monitor/host_monitor.go`	Instance monitor always refreshes DB-connection-derived fields; host monitor runs `UpdateHost` then `UpdateHostStatus` each cycle.
Instance Primary Tracking & Node Selection `server/internal/database/instance_resource.go`, `server/internal/database/node_resource.go`	Track `PrimaryInstanceIDUpdatedAt` on `InstanceResource` and select node primary by newest timestamp across instances.
Instance Model & Store Helpers `server/internal/database/instance.go`, `server/internal/database/instance_store.go`	Add `InstanceStatus.IsStale()` and `InstanceStore.Update()` op builder; storedToInstance uses `IsStale`.
Spec Normalization & Parsing Updates `server/internal/database/spec.go`, `server/internal/workflows/plan_update.go`	Add `Spec.NormalizePostgresVersions()` and switch parsing to `ds.ParsePgEdgeVersion` in spec/node parsing and plan generation.
Tests & Integration `clustertest/external_upgrade_test.go`, `clustertest/utils_test.go`, `clustertest/host_test.go`, `server/internal/*_test.go`	Add `TestExternalUpgrade`, `TestReconcileVersions`, `dockerCmd` helper, update tests to new parse APIs, adjust utilities and cluster timeouts.
Docs, Changelog & Installer `docs/installation/configuration.md`, `docs/installation/systemd.md`, `changes/unreleased/*.yaml`, `lima/roles/install_prerequisites/tasks/main.yaml`	Document `databases_monitor_interval_seconds`, add manual systemd Postgres minor-upgrade procedure and changelog entries; add old RPM repos and ansible idempotency fixes.
Misc Infra & Timeouts `Makefile`, `e2e/cancel_task_test.go`	Increase cluster test timeout and add small sleep in cancel test.

🐰 I sniff the versions, one by one,
Leader-bound I watch what hosts have done;
Instances change, the spec learns through,
Manual upgrades whisper, then come true —
A soft hop, and the cluster hums anew.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: update spec when db versions change' clearly describes the main feature: automatic database spec updates when instance versions change due to external upgrades.
Description check	✅ Passed	The PR description provides a comprehensive Summary section explaining the problem and solution, detailed Changes listing all commits, comprehensive Testing instructions, and all checklist items addressed including tests, docs, issue linking, and changelog entry.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/PLAT-512/update-database-version-ii

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-05-04T21:27:16Z

Up to standards ✅

🟢 Issues 3 medium

Results:
3 new issues

Category Results

Complexity 3 medium

View in Codacy

🟢 Metrics 66 complexity · 3 duplication

Metric Results

Complexity 66

Duplication 3

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

clustertest/utils_test.go (1)
163-176: ⚡ Quick win

dockerCmd: add a context/timeout and surface captured output on failure

Two independent issues:

No timeout — exec.Command has no deadline. A hung Docker daemon or a blocking docker service update will stall the entire test suite indefinitely. Compare with buildImage, which wraps its exec.Command in a 300 s CommandContext. The test receives t testing.TB so a t.Context() or an explicit background timeout can be threaded through.

Docker output dropped on failure — stderr/stdout are captured into out, but require.NoError(t, err) only surfaces the OS error string (e.g., "exit status 1"). The actual Docker error message is silently discarded, making CI failures much harder to diagnose.
🔧 Proposed fix
-func dockerCmd(t testing.TB, args ...string) string {
+func dockerCmd(t testing.TB, ctx context.Context, args ...string) string {
 	t.Helper()
 
 	tLogf(t, "executing command: docker %s", strings.Join(args, " "))
 
 	var out strings.Builder
-	cmd := exec.Command("docker", args...)
+	cmd := exec.CommandContext(ctx, "docker", args...)
 	cmd.Stdout = &out
 	cmd.Stderr = &out
 	err := cmd.Run()
-	require.NoError(t, err)
+	require.NoError(t, err, "docker %s output:\n%s", strings.Join(args, " "), out.String())
 
 	return strings.TrimSpace(out.String())
 }
Callers can pass t.Context() for a test-lifetime bound, or a tighter context.WithTimeout(context.Background(), 60*time.Second) for specific operations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/utils_test.go` around lines 163 - 176, The dockerCmd function
lacks a context/timeout and discards Docker output on failure; update dockerCmd
to accept/use a context (use t.Context() when available or wrap with
context.WithTimeout(ctx, 60*time.Second)) and run exec.CommandContext instead of
exec.Command, keep capturing stdout/stderr into the out buffer, and when the
command returns an error include the trimmed out.String() in the test failure
(e.g., via require.NoErrorf/require.Failf or t.Fatalf) so the Docker error
message is surfaced; reference the dockerCmd function and model the pattern
after buildImage which uses CommandContext and a 300s timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 140-142: The tLogf call in the error handling for database
deletion uses the fmt.Errorf-specific %w verb which tLogf (a printf-style
wrapper around t.Logf) doesn't support; update the format string in the error
log that references databaseID and err (the call using tLogf) to use %v for the
error value instead of %w so the error message prints correctly (i.e., change
the format argument in the tLogf call that logs "failed to delete database '%s':
..." to use %v for err).

In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 18-33: The two yum_repository tasks named "Add pgEdge old
repository" and "Add pgEdge noarch old repository" disable package signature
verification by setting gpgcheck: no and hardcode OS major as 9 in baseurl;
update these tasks to (1) avoid gpgcheck: no — instead set repo_gpgcheck: no if
you only want to skip repo metadata but keep package RPM signature verification,
or keep gpgcheck and add sslverify: yes to assert HTTPS validation, and (2) make
the baseurl dynamic by replacing the literal 9 with
{{ansible_facts['distribution_major_version']}} so the repository URL matches
the host OS major version. Ensure the updated fields are applied in the tasks
with those task names.

In `@server/internal/monitor/provide.go`:
- Around line 60-61: The code unconditionally creates an election candidate (via
electionSvc.NewCandidate) which opens an etcd watch and is never stopped if
DatabasesMonitorIntervalSeconds == 0 because DatabasesMonitor is not created and
candidate.Start/Stop are never called; to fix, avoid creating the candidate when
the databases monitor is disabled by only constructing the candidate when
cfg.DatabasesMonitorIntervalSeconds > 0 (i.e., move the electionSvc.NewCandidate
call behind that conditional), or alternatively change NewCandidate to defer
starting/creating the watch until Candidate.Start is invoked and ensure
Candidate.Stop can safely close any resources even if Start was never called;
update provide.go to create/pass a nil/no-op candidate to NewService when the
monitor is disabled and reference electionSvc.NewCandidate, Candidate.Start,
Candidate.Stop, DatabasesMonitor and DatabasesMonitorIntervalSeconds in your
change.

---

Nitpick comments:
In `@clustertest/utils_test.go`:
- Around line 163-176: The dockerCmd function lacks a context/timeout and
discards Docker output on failure; update dockerCmd to accept/use a context (use
t.Context() when available or wrap with context.WithTimeout(ctx,
60*time.Second)) and run exec.CommandContext instead of exec.Command, keep
capturing stdout/stderr into the out buffer, and when the command returns an
error include the trimmed out.String() in the test failure (e.g., via
require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is
surfaced; reference the dockerCmd function and model the pattern after
buildImage which uses CommandContext and a 300s timeout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da1e0e34-a22e-484a-9dde-4138a08808f5

📥 Commits

Reviewing files that changed from the base of the PR and between ea89e93 and 99a663b.

📒 Files selected for processing (32)

changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/config/config.go
server/internal/database/instance.go
server/internal/database/instance_store.go
server/internal/database/provide.go
server/internal/database/reconcile_versions.go
server/internal/database/reconcile_versions_test.go
server/internal/database/service.go
server/internal/database/spec.go
server/internal/database/status.go
server/internal/ds/versions.go
server/internal/ds/versions_test.go
server/internal/host/host_test.go
server/internal/logging/factory.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/host_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
server/internal/monitor/service.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/orchestrator/swarm/images.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go
server/internal/storage/interface.go
server/internal/storage/txn.go
server/internal/workflows/plan_update.go

💤 Files with no reviewable changes (1)

server/internal/database/status.go

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

e2e/cancel_task_test.go (1)

54-58: ⚡ Quick win

Replace fixed sleep with condition-based waiting to reduce flakiness

Line 57 introduces a hard-coded delay that can still race in CI and makes the test slower than needed. Prefer waiting until the task reaches a cancelable state (or until CancelDatabaseTask succeeds) with a bounded timeout.

Proposed test-stability refactor

-	// TODO: even after many attempts at fixing this, rapidly cancelling a task
-	// still occasionally causes problems. This sleep is a workaround until
-	// we're able to track down the issue.
-	time.Sleep(500 * time.Millisecond)
+	// Wait until cancellation becomes reliably accepted instead of sleeping a fixed duration.
+	require.Eventually(t, func() bool {
+		_, err := fixture.Client.CancelDatabaseTask(t.Context(), &controlplane.CancelDatabaseTaskPayload{
+			DatabaseID: database.ID,
+			TaskID:     controlplane.Identifier(creation_task.TaskID),
+		})
+		return err == nil
+	}, 10*time.Second, 200*time.Millisecond, "task did not become cancelable in time")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/cancel_task_test.go` around lines 54 - 58, Replace the fixed time.Sleep
with a bounded, condition-based wait: inside the test (cancel_task_test) poll in
a loop with a short backoff until either CancelDatabaseTask returns success or
the task's status indicates it is cancellable (use the existing
CancelDatabaseTask and the task-status getter used elsewhere, e.g.,
GetDatabaseTask/GetTaskStatus), breaking on success or when a context timeout
(e.g., 2–5s) elapses; return/fail the test on timeout. Ensure the loop uses a
small sleep (10–50ms) between attempts to avoid busy-waiting and that the
timeout is enforced via context.WithTimeout or time.After to keep the test
bounded.

server/internal/database/node_resource.go (1)

86-95: 💤 Low value

Optional: extract the comparison into a tiny helper for readability.

Not blocking — the loop body now mixes "skip" filtering with "is-newer" tracking. A small helper (e.g., isMoreRecent(prevID string, prevAt, curAt time.Time) bool) or a comment-anchored latestPrimary struct would make intent obvious if this pattern grows (e.g., once the version-reconciliation monitor lands and similar timestamp-driven picks appear elsewhere).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/node_resource.go` around lines 86 - 95, Extract the
"is newer" logic into a small helper to improve readability: create a helper
like isMoreRecent(prevID string, prevAt, curAt time.Time) bool (or a tiny
latestPrimary struct with an UpdateIfNewer method) and replace the inline
condition in the loop that compares instance.PrimaryInstanceIDUpdatedAt with
primaryInstanceUpdatedAt and sets primaryInstanceID/primaryInstanceUpdatedAt;
keep the existing skip/continue for empty PrimaryInstanceID and call the helper
to decide whether to assign the current instance's PrimaryInstanceID and
PrimaryInstanceIDUpdatedAt.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 264-265: The call to waitForTaskComplete (waiting on
updateResp.Task.TaskID) assigns err but never checks it; add an assertion
immediately after that call to fail the test if err is non-nil (e.g., use
t.Fatalf/t.Fatalf-like or require.NoError) so a timeout or failure is not
ignored before the subsequent assertions that depend on the task completing.
Ensure the check references the err returned from waitForTaskComplete(ctx,
cluster.Client(), databaseID, updateResp.Task.TaskID, 3*time.Minute).
- Around line 134-145: Replace the unbounded ctx := context.Background() used
for the cleanup DeleteDatabase call with a bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>); defer
cancel()) so cluster.Client().DeleteDatabase(...) cannot hang indefinitely;
update the DeleteDatabase call site that uses ctx and add the necessary time
import, referencing the ctx variable and the DeleteDatabase method and
databaseID/testConfig.skipCleanup flow.

In `@docs/installation/systemd.md`:
- Around line 248-250: Replace the fenced code blocks inside the numbered steps
with indented code blocks (prefix each command line with four spaces) so
markdownlint MD046 no longer flags nested fenced blocks; specifically update the
snippet containing "sudo dnf upgrade pgedge-postgresql18" and the other
occurrences referenced (around the blocks at the next two locations) to use
indented code formatting instead of ``` fences.
- Around line 28-30: Change the phrase "Minor versions upgrades" to "Minor
version upgrades" in the limitations paragraph (the sentence currently beginning
"Database upgrades are not supported via the API. Minor versions upgrades can be
performed manually..."); update that sentence to read "Minor version upgrades
can be performed manually by a system administrator." to fix the grammar.
- Around line 268-269: The sentence hardcodes "30 seconds" even though that
delay is controlled by the configuration key databases_monitor_interval_seconds;
update the wording in docs/installation/systemd.md to reference the configured
value instead of a hardcoded time (e.g., say "up to the configured
databases_monitor_interval_seconds (defaults to 30s)" or "up to the configured
databases_monitor_interval_seconds value"), and ensure the doc mentions where to
change that setting so readers know it is configurable.

---

Nitpick comments:
In `@e2e/cancel_task_test.go`:
- Around line 54-58: Replace the fixed time.Sleep with a bounded,
condition-based wait: inside the test (cancel_task_test) poll in a loop with a
short backoff until either CancelDatabaseTask returns success or the task's
status indicates it is cancellable (use the existing CancelDatabaseTask and the
task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus),
breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail
the test on timeout. Ensure the loop uses a small sleep (10–50ms) between
attempts to avoid busy-waiting and that the timeout is enforced via
context.WithTimeout or time.After to keep the test bounded.

In `@server/internal/database/node_resource.go`:
- Around line 86-95: Extract the "is newer" logic into a small helper to improve
readability: create a helper like isMoreRecent(prevID string, prevAt, curAt
time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method)
and replace the inline condition in the loop that compares
instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets
primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for
empty PrimaryInstanceID and call the helper to decide whether to assign the
current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 746ee34d-9c87-4823-8c88-e174405bfca8

📥 Commits

Reviewing files that changed from the base of the PR and between 0efec38 and 2742e61.

📒 Files selected for processing (36)

Makefile
changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
e2e/cancel_task_test.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/config/config.go
server/internal/database/instance.go
server/internal/database/instance_resource.go
server/internal/database/instance_store.go
server/internal/database/node_resource.go
server/internal/database/provide.go
server/internal/database/reconcile_versions.go
server/internal/database/reconcile_versions_test.go
server/internal/database/service.go
server/internal/database/spec.go
server/internal/database/status.go
server/internal/ds/versions.go
server/internal/ds/versions_test.go
server/internal/host/host_test.go
server/internal/logging/factory.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/host_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
server/internal/monitor/service.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/orchestrator/swarm/images.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go
server/internal/storage/interface.go
server/internal/storage/txn.go
server/internal/workflows/plan_update.go

💤 Files with no reviewable changes (1)

server/internal/database/status.go

✅ Files skipped from review due to trivial changes (7)

server/internal/storage/interface.go
changes/unreleased/Changed-20260504-152348.yaml
changes/unreleased/Added-20260504-150235.yaml
server/internal/workflows/plan_update.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/host/host_test.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go

🚧 Files skipped from review as they are similar to previous changes (15)

server/internal/orchestrator/swarm/images.go
server/internal/database/instance_store.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
clustertest/host_test.go
server/internal/database/provide.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/database/reconcile_versions.go
server/internal/database/instance.go
server/internal/database/service.go
server/internal/monitor/host_monitor.go
server/internal/database/spec.go
server/internal/database/reconcile_versions_test.go
server/internal/ds/versions.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick win

Fixed time.Sleep calls are fragile after async Docker service updates.

docker service update is asynchronous — the new task may still be starting when sleepDuration expires. Within 5 s the test requires: the container to restart with the new image, the instance monitor to report the new version, and the databases monitor (interval = 3 s) to reconcile the spec. In a slow CI environment this is a tight race that can produce intermittent failures.

require.Eventually with a generous deadline and a short poll interval would remove the dependency on a fixed sleep and give clear assertion failure messages when the deadline expires:
require.Eventually(t, func() bool {
    db, err = cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{
        DatabaseID: databaseID,
    })
    if err != nil {
        return false
    }
    // inline the version checks; return false on mismatch
    return specAndInstanceVersionsMatch(db, ...)
}, 30*time.Second, 500*time.Millisecond)
Apply the same pattern at Lines 168, 198, 225, and 270.

Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/external_upgrade_test.go` at line 168, Replace the fragile
time.Sleep(sleepDuration) usages with require.Eventually polling: for each
occurrence (the calls using sleepDuration at the sites referencing databaseID,
ctx, cluster.Client().GetDatabase and specAndInstanceVersionsMatch), call
require.Eventually with a generous timeout (e.g. 30s) and short tick (e.g.
500ms) and inside the predicate perform the cluster.Client().GetDatabase(ctx,
&controlplane.GetDatabasePayload{DatabaseID: databaseID}), return false on error
or version mismatches, and true only when specAndInstanceVersionsMatch(...) and
instance version checks pass; apply this replacement to the four locations that
currently use time.Sleep (the ones around lines referencing sleepDuration,
databaseID, GetDatabase and specAndInstanceVersionsMatch).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 55: The assertion uses require.Equal with arguments reversed; change the
call so the expected value client.InstanceStateAvailable is the second-to-last
parameter and the runtime value instance.State is the last (i.e., call
require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures
show "expected" as the known constant and "actual" as the observed
instance.State.

---

Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: Replace the fragile time.Sleep(sleepDuration) usages with
require.Eventually polling: for each occurrence (the calls using sleepDuration
at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and
specAndInstanceVersionsMatch), call require.Eventually with a generous timeout
(e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the
cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID:
databaseID}), return false on error or version mismatches, and true only when
specAndInstanceVersionsMatch(...) and instance version checks pass; apply this
replacement to the four locations that currently use time.Sleep (the ones around
lines referencing sleepDuration, databaseID, GetDatabase and
specAndInstanceVersionsMatch).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4038ff5-5bcb-44e5-a2bc-3448c7337c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 2742e61 and 25094c9.

📒 Files selected for processing (9)

Makefile
changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
lima/roles/install_prerequisites/tasks/main.yaml

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

server/internal/database/reconcile_versions.go (2)
99-148: 💤 Low value

Document/contain the in-place mutation of inputs.

reconcileInstanceVersions mutates instance.PgEdgeVersion on the caller-supplied slice elements (line 141), and reconcileNodeVersions mutates node.PostgresVersion and uses the same spec pointer for updatedSpec (line 194). The caller ReconcileAllDatabaseVersions then passes those same objects to s.store.*.Update(...), which works, but the public ReconcileVersions signature suggests it returns new updated objects. A future caller (or the unit test, where tc.spec is shared across re-runs if subtests were retried) could be surprised.

Either rename/document the contract ("mutates inputs in place; returned values are aliases of inputs") or clone before mutating.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/reconcile_versions.go` around lines 99 - 148,
reconcileInstanceVersions and reconcileNodeVersions currently mutate
caller-supplied objects in place (e.g., setting instance.PgEdgeVersion in
reconcileInstanceVersions and modifying node.PostgresVersion in
reconcileNodeVersions), yet ReconcileVersions' signature implies it returns new
updated objects; to fix, clone the inputs before changing them: deep-copy the
incoming instances slice elements in reconcileInstanceVersions (create new
StoredInstance values) and clone the StoredSpec (and its Node entries) inside
reconcileNodeVersions (or at the start of ReconcileVersions) so mutations apply
to the clones and not the caller's objects, then return those cloned/updated
values from ReconcileVersions; update references to instance, node, and spec in
the functions (reconcileInstanceVersions, reconcileNodeVersions,
ReconcileVersions) to operate on the clones.
60-71: 💤 Low value

N+1 storage read inside the per-instance loop.

For each updated instance we issue a separate InstanceSpec.GetByKey call. For databases with many instances all reporting a new version simultaneously (e.g., a fleet-wide systemd upgrade), this serializes O(N) etcd round-trips before the txn commit. Consider fetching all instance specs for the database once (via a GetByDatabaseID-style API, mirroring the instance/status fetches above) and indexing locally by InstanceID.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/reconcile_versions.go` around lines 60 - 71, N+1
read: the loop calls s.store.InstanceSpec.GetByKey per instance causing many
etcd round-trips; instead add or use a bulk fetch (e.g., GetByDatabaseID) to
load all InstanceSpec rows for the database once, build a map indexed by
InstanceID, then inside the loop replace the GetByKey call with a map lookup and
only append s.store.InstanceSpec.Update(instanceSpec) when the lookup yields a
spec to modify; update reconcile_versions.go to call the new/existing
s.store.InstanceSpec.GetByDatabaseID(ctx, instance.DatabaseID) (or equivalent),
index results by InstanceID, and use that map when setting
instanceSpec.Spec.PgEdgeVersion and preparing ops, falling back to
storage.ErrNotFound handling only for unexpected misses.
clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick win

Test timing may be flaky relative to monitor intervals.

sleepDuration is 5s and PGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS=3, but the default InstanceMonitorRefreshInterval is 5s. After triggering an external upgrade, a 5-second sleep may not reliably allow (a) the per-instance status monitor to observe the new version, and (b) the leader-elected databases monitor to subsequently reconcile and persist the spec update. Consider increasing sleepDuration (e.g., 8–10s) or polling with a deadline on the version assertions to reduce flakiness in CI.

Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustertest/external_upgrade_test.go` at line 168, The 5s sleep using
sleepDuration is flaky vs monitor intervals
(PGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS and InstanceMonitorRefreshInterval);
update the test to either increase sleepDuration to ~8–10s or, preferably,
replace the fixed time.Sleep(sleepDuration) calls with a polling loop that
checks the instance/cluster version with a deadline (context with timeout) and
retries until the expected version is observed, updating all occurrences of
sleepDuration used at the points referenced (the three additional occurrences)
so the per-instance status monitor and leader-elected databases monitor have
time to observe and persist the spec change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 138: Update the log message in the test helper call tLogf so it reads
"skipping cleanup for database '%s'" instead of "skipping cleaning for database
'%s'"; locate the tLogf invocation in external_upgrade_test.go (the call that
currently logs databaseID) and change the string literal accordingly.

In `@server/internal/database/reconcile_versions.go`:
- Around line 169-210: reconcileNodeVersions currently only marks updatedSpec
when a Postgres version mismatch is found, so Spock-only drifts are ignored;
modify the function to track Spock changes separately (e.g., introduce a boolean
spockChanged) and set updatedSpec = spec when either a Postgres change or a
consistent Spock change is detected; keep using commonSpockVersion and
spockMatches to detect a consistent Spock value across observed nodes, and after
the loop if spockChanged && spockMatches && commonSpockVersion != "" ensure
updatedSpec is set (if not already), call NormalizePostgresVersions() if needed,
and assign updatedSpec.SpockVersion = commonSpockVersion.

---

Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: The 5s sleep using sleepDuration is flaky vs monitor intervals
(PGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS and InstanceMonitorRefreshInterval);
update the test to either increase sleepDuration to ~8–10s or, preferably,
replace the fixed time.Sleep(sleepDuration) calls with a polling loop that
checks the instance/cluster version with a deadline (context with timeout) and
retries until the expected version is observed, updating all occurrences of
sleepDuration used at the points referenced (the three additional occurrences)
so the per-instance status monitor and leader-elected databases monitor have
time to observe and persist the spec change.

In `@server/internal/database/reconcile_versions.go`:
- Around line 99-148: reconcileInstanceVersions and reconcileNodeVersions
currently mutate caller-supplied objects in place (e.g., setting
instance.PgEdgeVersion in reconcileInstanceVersions and modifying
node.PostgresVersion in reconcileNodeVersions), yet ReconcileVersions' signature
implies it returns new updated objects; to fix, clone the inputs before changing
them: deep-copy the incoming instances slice elements in
reconcileInstanceVersions (create new StoredInstance values) and clone the
StoredSpec (and its Node entries) inside reconcileNodeVersions (or at the start
of ReconcileVersions) so mutations apply to the clones and not the caller's
objects, then return those cloned/updated values from ReconcileVersions; update
references to instance, node, and spec in the functions
(reconcileInstanceVersions, reconcileNodeVersions, ReconcileVersions) to operate
on the clones.
- Around line 60-71: N+1 read: the loop calls s.store.InstanceSpec.GetByKey per
instance causing many etcd round-trips; instead add or use a bulk fetch (e.g.,
GetByDatabaseID) to load all InstanceSpec rows for the database once, build a
map indexed by InstanceID, then inside the loop replace the GetByKey call with a
map lookup and only append s.store.InstanceSpec.Update(instanceSpec) when the
lookup yields a spec to modify; update reconcile_versions.go to call the
new/existing s.store.InstanceSpec.GetByDatabaseID(ctx, instance.DatabaseID) (or
equivalent), index results by InstanceID, and use that map when setting
instanceSpec.Spec.PgEdgeVersion and preparing ops, falling back to
storage.ErrNotFound handling only for unexpected misses.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b18f0d0-f0cc-44ca-b2bb-10e299e0a9d5

📥 Commits

Reviewing files that changed from the base of the PR and between 25094c9 and e23e4e4.

📒 Files selected for processing (36)

Makefile
changes/unreleased/Added-20260504-150235.yaml
changes/unreleased/Changed-20260504-152348.yaml
clustertest/external_upgrade_test.go
clustertest/host_test.go
clustertest/utils_test.go
docs/installation/configuration.md
docs/installation/systemd.md
e2e/cancel_task_test.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/config/config.go
server/internal/database/instance.go
server/internal/database/instance_resource.go
server/internal/database/instance_store.go
server/internal/database/node_resource.go
server/internal/database/provide.go
server/internal/database/reconcile_versions.go
server/internal/database/reconcile_versions_test.go
server/internal/database/service.go
server/internal/database/spec.go
server/internal/database/status.go
server/internal/ds/versions.go
server/internal/ds/versions_test.go
server/internal/host/host_test.go
server/internal/logging/factory.go
server/internal/monitor/databases_monitor.go
server/internal/monitor/host_monitor.go
server/internal/monitor/instance_monitor.go
server/internal/monitor/provide.go
server/internal/monitor/service.go
server/internal/orchestrator/common/patroni_config_generator_test.go
server/internal/orchestrator/swarm/images.go
server/internal/orchestrator/swarm/rag_instance_resources_test.go
server/internal/storage/interface.go
server/internal/storage/txn.go
server/internal/workflows/plan_update.go

💤 Files with no reviewable changes (1)

server/internal/database/status.go

✅ Files skipped from review due to trivial changes (4)

changes/unreleased/Added-20260504-150235.yaml
e2e/cancel_task_test.go
changes/unreleased/Changed-20260504-152348.yaml
server/internal/orchestrator/common/patroni_config_generator_test.go

🚧 Files skipped from review as they are similar to previous changes (20)

clustertest/host_test.go
server/internal/workflows/plan_update.go
server/internal/logging/factory.go
Makefile
server/internal/storage/interface.go
server/internal/orchestrator/swarm/images.go
server/internal/monitor/provide.go
server/internal/database/node_resource.go
server/internal/config/config.go
lima/roles/install_prerequisites/tasks/main.yaml
server/internal/host/host_test.go
server/internal/database/spec.go
server/internal/ds/versions_test.go
clustertest/utils_test.go
server/internal/monitor/service.go
server/internal/database/service.go
server/internal/database/provide.go
server/internal/ds/versions.go
server/internal/monitor/databases_monitor.go
server/internal/database/instance_store.go

Instances are updated sequentially and switchovers can happen during that sequence. We want to use the most recently fetched primary instance ID in the node resource to account for those switchovers. This change is backward compatible and does not need any migration. PLAT-512

This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512

Improves the ds.PgEdgeVersion initializers by: - Renaming `NewPgEdgeVersion`, which parsed a `PgEdgeVersion` from strings to `ParsePgEdgeVersion`. - Add a method to normalize `PgEdgeVersion`s so that the Postgres version contains major and minor components and the Spock version contains just a major component. This matches the way that versions are currently provided in our API. PLAT-512

Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512

Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512

PLAT-512

Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512

Adds a new database service method to reconcile the Postgres and Spock versions that we observe through the instance monitors with the versions in the database spec. The domain logic of this process is split into a standalone function with unit tests to enforce its behavior. The service method uses a transaction with additional constraints to handle conflicts with any database operations that run while we're performing this check. The summary of this new behavior is that when an instance is updated, we update its corresponding `Instance` record in our database. If all of the instances have been updated for a node, we update the node's entry in the database spec. If all of the nodes have been updated, we normalize the database spec by setting the top-level version fields and zeroing out the node-level version fields. PLAT-512

Adds a 'databases' monitor to periodically run the database version reconciliation process. The monitor's interval is configurable and can be set to '0' to disable it entirely. This setting is useful in testing and in clusters with a large number of databases where the operator knows that the database versions will never change outside of the Control Plane API. PLAT-512

Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512

pgEdge only makes the latest package versions available in the `release` repositories. Older versions of packages are only available in the `release_old` repositories. This commit adds the `release_old` repositories to the `dev-lima` environment so that we can install older versions of Postgres and other components in this environment. PLAT-512

Adds the new configuration setting and logging component to the configuration guide. PLAT-512

Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512

Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go

Comment thread lima/roles/install_prerequisites/tasks/main.yaml

Comment thread server/internal/monitor/provide.go

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch 3 times, most recently from 2742e61 to 2ba5138 Compare May 5, 2026 19:09

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go Outdated

Comment thread clustertest/external_upgrade_test.go

Comment thread docs/installation/systemd.md Outdated

Comment thread docs/installation/systemd.md

Comment thread docs/installation/systemd.md

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 2ba5138 to 25094c9 Compare May 5, 2026 19:42

jason-lynch marked this pull request as ready for review May 5, 2026 21:20

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go Outdated

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from 25094c9 to e23e4e4 Compare May 7, 2026 13:40

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread clustertest/external_upgrade_test.go Outdated

Comment thread server/internal/database/reconcile_versions.go

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from e23e4e4 to d2aa2c0 Compare May 8, 2026 11:49

jason-lynch added 14 commits May 11, 2026 07:41

test: add brief sleep to cancel task test

84ef96a

This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512

feat: add conditions to etcd txn

4e56f61

Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512

feat: update host in host monitor

cdc5e9c

Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512

chore: remove empty file

edaf9b4

PLAT-512

feat: record versions for replica instances

c01e907

Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512

test: add cluster test for external db upgrades

2bcf229

Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512

docs: update configuration doc

7f97991

Adds the new configuration setting and logging component to the configuration guide. PLAT-512

docs: minor postgres version upgrades for systemd

fead535

Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512

docs: changelog entries

8c65137

Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512

jason-lynch force-pushed the feat/PLAT-512/update-database-version-ii branch from d2aa2c0 to 8c65137 Compare May 11, 2026 11:50

Conversation

jason-lynch commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Using the dev-lima environment

Using the compose environment

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

codacy-production Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jason-lynch commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codacy-production Bot commented May 4, 2026 •

edited

Loading