feat: update spec when db versions change#377
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a leader-backed periodic databases monitor that reconciles instance Postgres/Spock versions into stored instance records and specs, introduces PgEdge version parsing/normalization and Parse/MustParse APIs, wires monitoring/election and logging, extends storage txn comparisons, and includes tests, docs, changelog entries, and installer/test infra tweaks. Database Version Reconciliation & Monitoring
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 3 medium |
🟢 Metrics 66 complexity · 3 duplication
Metric Results Complexity 66 Duplication 3
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
clustertest/utils_test.go (1)
163-176: ⚡ Quick win
dockerCmd: add a context/timeout and surface captured output on failureTwo independent issues:
No timeout —
exec.Commandhas no deadline. A hung Docker daemon or a blockingdocker service updatewill stall the entire test suite indefinitely. Compare withbuildImage, which wraps itsexec.Commandin a 300 sCommandContext. The test receivest testing.TBso at.Context()or an explicit background timeout can be threaded through.Docker output dropped on failure — stderr/stdout are captured into
out, butrequire.NoError(t, err)only surfaces the OS error string (e.g.,"exit status 1"). The actual Docker error message is silently discarded, making CI failures much harder to diagnose.🔧 Proposed fix
-func dockerCmd(t testing.TB, args ...string) string { +func dockerCmd(t testing.TB, ctx context.Context, args ...string) string { t.Helper() tLogf(t, "executing command: docker %s", strings.Join(args, " ")) var out strings.Builder - cmd := exec.Command("docker", args...) + cmd := exec.CommandContext(ctx, "docker", args...) cmd.Stdout = &out cmd.Stderr = &out err := cmd.Run() - require.NoError(t, err) + require.NoError(t, err, "docker %s output:\n%s", strings.Join(args, " "), out.String()) return strings.TrimSpace(out.String()) }Callers can pass
t.Context()for a test-lifetime bound, or a tightercontext.WithTimeout(context.Background(), 60*time.Second)for specific operations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/utils_test.go` around lines 163 - 176, The dockerCmd function lacks a context/timeout and discards Docker output on failure; update dockerCmd to accept/use a context (use t.Context() when available or wrap with context.WithTimeout(ctx, 60*time.Second)) and run exec.CommandContext instead of exec.Command, keep capturing stdout/stderr into the out buffer, and when the command returns an error include the trimmed out.String() in the test failure (e.g., via require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is surfaced; reference the dockerCmd function and model the pattern after buildImage which uses CommandContext and a 300s timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 140-142: The tLogf call in the error handling for database
deletion uses the fmt.Errorf-specific %w verb which tLogf (a printf-style
wrapper around t.Logf) doesn't support; update the format string in the error
log that references databaseID and err (the call using tLogf) to use %v for the
error value instead of %w so the error message prints correctly (i.e., change
the format argument in the tLogf call that logs "failed to delete database '%s':
..." to use %v for err).
In `@lima/roles/install_prerequisites/tasks/main.yaml`:
- Around line 18-33: The two yum_repository tasks named "Add pgEdge old
repository" and "Add pgEdge noarch old repository" disable package signature
verification by setting gpgcheck: no and hardcode OS major as 9 in baseurl;
update these tasks to (1) avoid gpgcheck: no — instead set repo_gpgcheck: no if
you only want to skip repo metadata but keep package RPM signature verification,
or keep gpgcheck and add sslverify: yes to assert HTTPS validation, and (2) make
the baseurl dynamic by replacing the literal 9 with
{{ansible_facts['distribution_major_version']}} so the repository URL matches
the host OS major version. Ensure the updated fields are applied in the tasks
with those task names.
In `@server/internal/monitor/provide.go`:
- Around line 60-61: The code unconditionally creates an election candidate (via
electionSvc.NewCandidate) which opens an etcd watch and is never stopped if
DatabasesMonitorIntervalSeconds == 0 because DatabasesMonitor is not created and
candidate.Start/Stop are never called; to fix, avoid creating the candidate when
the databases monitor is disabled by only constructing the candidate when
cfg.DatabasesMonitorIntervalSeconds > 0 (i.e., move the electionSvc.NewCandidate
call behind that conditional), or alternatively change NewCandidate to defer
starting/creating the watch until Candidate.Start is invoked and ensure
Candidate.Stop can safely close any resources even if Start was never called;
update provide.go to create/pass a nil/no-op candidate to NewService when the
monitor is disabled and reference electionSvc.NewCandidate, Candidate.Start,
Candidate.Stop, DatabasesMonitor and DatabasesMonitorIntervalSeconds in your
change.
---
Nitpick comments:
In `@clustertest/utils_test.go`:
- Around line 163-176: The dockerCmd function lacks a context/timeout and
discards Docker output on failure; update dockerCmd to accept/use a context (use
t.Context() when available or wrap with context.WithTimeout(ctx,
60*time.Second)) and run exec.CommandContext instead of exec.Command, keep
capturing stdout/stderr into the out buffer, and when the command returns an
error include the trimmed out.String() in the test failure (e.g., via
require.NoErrorf/require.Failf or t.Fatalf) so the Docker error message is
surfaced; reference the dockerCmd function and model the pattern after
buildImage which uses CommandContext and a 300s timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: da1e0e34-a22e-484a-9dde-4138a08808f5
📒 Files selected for processing (32)
changes/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mdlima/roles/install_prerequisites/tasks/main.yamlserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_store.goserver/internal/database/provide.goserver/internal/database/reconcile_versions.goserver/internal/database/reconcile_versions_test.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/database/status.goserver/internal/ds/versions.goserver/internal/ds/versions_test.goserver/internal/host/host_test.goserver/internal/logging/factory.goserver/internal/monitor/databases_monitor.goserver/internal/monitor/host_monitor.goserver/internal/monitor/instance_monitor.goserver/internal/monitor/provide.goserver/internal/monitor/service.goserver/internal/orchestrator/common/patroni_config_generator_test.goserver/internal/orchestrator/swarm/images.goserver/internal/orchestrator/swarm/rag_instance_resources_test.goserver/internal/storage/interface.goserver/internal/storage/txn.goserver/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
- server/internal/database/status.go
2742e61 to
2ba5138
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
e2e/cancel_task_test.go (1)
54-58: ⚡ Quick winReplace fixed sleep with condition-based waiting to reduce flakiness
Line 57 introduces a hard-coded delay that can still race in CI and makes the test slower than needed. Prefer waiting until the task reaches a cancelable state (or until
CancelDatabaseTasksucceeds) with a bounded timeout.Proposed test-stability refactor
- // TODO: even after many attempts at fixing this, rapidly cancelling a task - // still occasionally causes problems. This sleep is a workaround until - // we're able to track down the issue. - time.Sleep(500 * time.Millisecond) + // Wait until cancellation becomes reliably accepted instead of sleeping a fixed duration. + require.Eventually(t, func() bool { + _, err := fixture.Client.CancelDatabaseTask(t.Context(), &controlplane.CancelDatabaseTaskPayload{ + DatabaseID: database.ID, + TaskID: controlplane.Identifier(creation_task.TaskID), + }) + return err == nil + }, 10*time.Second, 200*time.Millisecond, "task did not become cancelable in time")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/cancel_task_test.go` around lines 54 - 58, Replace the fixed time.Sleep with a bounded, condition-based wait: inside the test (cancel_task_test) poll in a loop with a short backoff until either CancelDatabaseTask returns success or the task's status indicates it is cancellable (use the existing CancelDatabaseTask and the task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus), breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail the test on timeout. Ensure the loop uses a small sleep (10–50ms) between attempts to avoid busy-waiting and that the timeout is enforced via context.WithTimeout or time.After to keep the test bounded.server/internal/database/node_resource.go (1)
86-95: 💤 Low valueOptional: extract the comparison into a tiny helper for readability.
Not blocking — the loop body now mixes "skip" filtering with "is-newer" tracking. A small helper (e.g.,
isMoreRecent(prevID string, prevAt, curAt time.Time) bool) or a comment-anchoredlatestPrimarystruct would make intent obvious if this pattern grows (e.g., once the version-reconciliation monitor lands and similar timestamp-driven picks appear elsewhere).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/database/node_resource.go` around lines 86 - 95, Extract the "is newer" logic into a small helper to improve readability: create a helper like isMoreRecent(prevID string, prevAt, curAt time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method) and replace the inline condition in the loop that compares instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for empty PrimaryInstanceID and call the helper to decide whether to assign the current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Around line 264-265: The call to waitForTaskComplete (waiting on
updateResp.Task.TaskID) assigns err but never checks it; add an assertion
immediately after that call to fail the test if err is non-nil (e.g., use
t.Fatalf/t.Fatalf-like or require.NoError) so a timeout or failure is not
ignored before the subsequent assertions that depend on the task completing.
Ensure the check references the err returned from waitForTaskComplete(ctx,
cluster.Client(), databaseID, updateResp.Task.TaskID, 3*time.Minute).
- Around line 134-145: Replace the unbounded ctx := context.Background() used
for the cleanup DeleteDatabase call with a bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), <reasonable duration>); defer
cancel()) so cluster.Client().DeleteDatabase(...) cannot hang indefinitely;
update the DeleteDatabase call site that uses ctx and add the necessary time
import, referencing the ctx variable and the DeleteDatabase method and
databaseID/testConfig.skipCleanup flow.
In `@docs/installation/systemd.md`:
- Around line 248-250: Replace the fenced code blocks inside the numbered steps
with indented code blocks (prefix each command line with four spaces) so
markdownlint MD046 no longer flags nested fenced blocks; specifically update the
snippet containing "sudo dnf upgrade pgedge-postgresql18" and the other
occurrences referenced (around the blocks at the next two locations) to use
indented code formatting instead of ``` fences.
- Around line 28-30: Change the phrase "Minor versions upgrades" to "Minor
version upgrades" in the limitations paragraph (the sentence currently beginning
"Database upgrades are not supported via the API. Minor versions upgrades can be
performed manually..."); update that sentence to read "Minor version upgrades
can be performed manually by a system administrator." to fix the grammar.
- Around line 268-269: The sentence hardcodes "30 seconds" even though that
delay is controlled by the configuration key databases_monitor_interval_seconds;
update the wording in docs/installation/systemd.md to reference the configured
value instead of a hardcoded time (e.g., say "up to the configured
databases_monitor_interval_seconds (defaults to 30s)" or "up to the configured
databases_monitor_interval_seconds value"), and ensure the doc mentions where to
change that setting so readers know it is configurable.
---
Nitpick comments:
In `@e2e/cancel_task_test.go`:
- Around line 54-58: Replace the fixed time.Sleep with a bounded,
condition-based wait: inside the test (cancel_task_test) poll in a loop with a
short backoff until either CancelDatabaseTask returns success or the task's
status indicates it is cancellable (use the existing CancelDatabaseTask and the
task-status getter used elsewhere, e.g., GetDatabaseTask/GetTaskStatus),
breaking on success or when a context timeout (e.g., 2–5s) elapses; return/fail
the test on timeout. Ensure the loop uses a small sleep (10–50ms) between
attempts to avoid busy-waiting and that the timeout is enforced via
context.WithTimeout or time.After to keep the test bounded.
In `@server/internal/database/node_resource.go`:
- Around line 86-95: Extract the "is newer" logic into a small helper to improve
readability: create a helper like isMoreRecent(prevID string, prevAt, curAt
time.Time) bool (or a tiny latestPrimary struct with an UpdateIfNewer method)
and replace the inline condition in the loop that compares
instance.PrimaryInstanceIDUpdatedAt with primaryInstanceUpdatedAt and sets
primaryInstanceID/primaryInstanceUpdatedAt; keep the existing skip/continue for
empty PrimaryInstanceID and call the helper to decide whether to assign the
current instance's PrimaryInstanceID and PrimaryInstanceIDUpdatedAt.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 746ee34d-9c87-4823-8c88-e174405bfca8
📒 Files selected for processing (36)
Makefilechanges/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mde2e/cancel_task_test.golima/roles/install_prerequisites/tasks/main.yamlserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_resource.goserver/internal/database/instance_store.goserver/internal/database/node_resource.goserver/internal/database/provide.goserver/internal/database/reconcile_versions.goserver/internal/database/reconcile_versions_test.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/database/status.goserver/internal/ds/versions.goserver/internal/ds/versions_test.goserver/internal/host/host_test.goserver/internal/logging/factory.goserver/internal/monitor/databases_monitor.goserver/internal/monitor/host_monitor.goserver/internal/monitor/instance_monitor.goserver/internal/monitor/provide.goserver/internal/monitor/service.goserver/internal/orchestrator/common/patroni_config_generator_test.goserver/internal/orchestrator/swarm/images.goserver/internal/orchestrator/swarm/rag_instance_resources_test.goserver/internal/storage/interface.goserver/internal/storage/txn.goserver/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
- server/internal/database/status.go
✅ Files skipped from review due to trivial changes (7)
- server/internal/storage/interface.go
- changes/unreleased/Changed-20260504-152348.yaml
- changes/unreleased/Added-20260504-150235.yaml
- server/internal/workflows/plan_update.go
- lima/roles/install_prerequisites/tasks/main.yaml
- server/internal/host/host_test.go
- server/internal/orchestrator/swarm/rag_instance_resources_test.go
🚧 Files skipped from review as they are similar to previous changes (15)
- server/internal/orchestrator/swarm/images.go
- server/internal/database/instance_store.go
- server/internal/monitor/databases_monitor.go
- server/internal/monitor/instance_monitor.go
- server/internal/monitor/provide.go
- clustertest/host_test.go
- server/internal/database/provide.go
- server/internal/orchestrator/common/patroni_config_generator_test.go
- server/internal/database/reconcile_versions.go
- server/internal/database/instance.go
- server/internal/database/service.go
- server/internal/monitor/host_monitor.go
- server/internal/database/spec.go
- server/internal/database/reconcile_versions_test.go
- server/internal/ds/versions.go
2ba5138 to
25094c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick winFixed
time.Sleepcalls are fragile after async Docker service updates.
docker service updateis asynchronous — the new task may still be starting whensleepDurationexpires. Within 5 s the test requires: the container to restart with the new image, the instance monitor to report the new version, and the databases monitor (interval = 3 s) to reconcile the spec. In a slow CI environment this is a tight race that can produce intermittent failures.
require.Eventuallywith a generous deadline and a short poll interval would remove the dependency on a fixed sleep and give clear assertion failure messages when the deadline expires:require.Eventually(t, func() bool { db, err = cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{ DatabaseID: databaseID, }) if err != nil { return false } // inline the version checks; return false on mismatch return specAndInstanceVersionsMatch(db, ...) }, 30*time.Second, 500*time.Millisecond)Apply the same pattern at Lines 168, 198, 225, and 270.
Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/external_upgrade_test.go` at line 168, Replace the fragile time.Sleep(sleepDuration) usages with require.Eventually polling: for each occurrence (the calls using sleepDuration at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and specAndInstanceVersionsMatch), call require.Eventually with a generous timeout (e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID: databaseID}), return false on error or version mismatches, and true only when specAndInstanceVersionsMatch(...) and instance version checks pass; apply this replacement to the four locations that currently use time.Sleep (the ones around lines referencing sleepDuration, databaseID, GetDatabase and specAndInstanceVersionsMatch).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 55: The assertion uses require.Equal with arguments reversed; change the
call so the expected value client.InstanceStateAvailable is the second-to-last
parameter and the runtime value instance.State is the last (i.e., call
require.Equal(t, client.InstanceStateAvailable, instance.State)) so failures
show "expected" as the known constant and "actual" as the observed
instance.State.
---
Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: Replace the fragile time.Sleep(sleepDuration) usages with
require.Eventually polling: for each occurrence (the calls using sleepDuration
at the sites referencing databaseID, ctx, cluster.Client().GetDatabase and
specAndInstanceVersionsMatch), call require.Eventually with a generous timeout
(e.g. 30s) and short tick (e.g. 500ms) and inside the predicate perform the
cluster.Client().GetDatabase(ctx, &controlplane.GetDatabasePayload{DatabaseID:
databaseID}), return false on error or version mismatches, and true only when
specAndInstanceVersionsMatch(...) and instance version checks pass; apply this
replacement to the four locations that currently use time.Sleep (the ones around
lines referencing sleepDuration, databaseID, GetDatabase and
specAndInstanceVersionsMatch).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a4038ff5-5bcb-44e5-a2bc-3448c7337c6d
📒 Files selected for processing (9)
Makefilechanges/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mdlima/roles/install_prerequisites/tasks/main.yaml
25094c9 to
e23e4e4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
server/internal/database/reconcile_versions.go (2)
99-148: 💤 Low valueDocument/contain the in-place mutation of inputs.
reconcileInstanceVersionsmutatesinstance.PgEdgeVersionon the caller-supplied slice elements (line 141), andreconcileNodeVersionsmutatesnode.PostgresVersionand uses the samespecpointer forupdatedSpec(line 194). The callerReconcileAllDatabaseVersionsthen passes those same objects tos.store.*.Update(...), which works, but the publicReconcileVersionssignature suggests it returns new updated objects. A future caller (or the unit test, wheretc.specis shared across re-runs if subtests were retried) could be surprised.Either rename/document the contract ("mutates inputs in place; returned values are aliases of inputs") or clone before mutating.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/database/reconcile_versions.go` around lines 99 - 148, reconcileInstanceVersions and reconcileNodeVersions currently mutate caller-supplied objects in place (e.g., setting instance.PgEdgeVersion in reconcileInstanceVersions and modifying node.PostgresVersion in reconcileNodeVersions), yet ReconcileVersions' signature implies it returns new updated objects; to fix, clone the inputs before changing them: deep-copy the incoming instances slice elements in reconcileInstanceVersions (create new StoredInstance values) and clone the StoredSpec (and its Node entries) inside reconcileNodeVersions (or at the start of ReconcileVersions) so mutations apply to the clones and not the caller's objects, then return those cloned/updated values from ReconcileVersions; update references to instance, node, and spec in the functions (reconcileInstanceVersions, reconcileNodeVersions, ReconcileVersions) to operate on the clones.
60-71: 💤 Low valueN+1 storage read inside the per-instance loop.
For each updated instance we issue a separate
InstanceSpec.GetByKeycall. For databases with many instances all reporting a new version simultaneously (e.g., a fleet-wide systemd upgrade), this serializes O(N) etcd round-trips before the txn commit. Consider fetching all instance specs for the database once (via aGetByDatabaseID-style API, mirroring the instance/status fetches above) and indexing locally byInstanceID.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/internal/database/reconcile_versions.go` around lines 60 - 71, N+1 read: the loop calls s.store.InstanceSpec.GetByKey per instance causing many etcd round-trips; instead add or use a bulk fetch (e.g., GetByDatabaseID) to load all InstanceSpec rows for the database once, build a map indexed by InstanceID, then inside the loop replace the GetByKey call with a map lookup and only append s.store.InstanceSpec.Update(instanceSpec) when the lookup yields a spec to modify; update reconcile_versions.go to call the new/existing s.store.InstanceSpec.GetByDatabaseID(ctx, instance.DatabaseID) (or equivalent), index results by InstanceID, and use that map when setting instanceSpec.Spec.PgEdgeVersion and preparing ops, falling back to storage.ErrNotFound handling only for unexpected misses.clustertest/external_upgrade_test.go (1)
168-168: ⚡ Quick winTest timing may be flaky relative to monitor intervals.
sleepDurationis 5s andPGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS=3, but the defaultInstanceMonitorRefreshIntervalis 5s. After triggering an external upgrade, a 5-second sleep may not reliably allow (a) the per-instance status monitor to observe the new version, and (b) the leader-elected databases monitor to subsequently reconcile and persist the spec update. Consider increasingsleepDuration(e.g., 8–10s) or polling with a deadline on the version assertions to reduce flakiness in CI.Also applies to: 198-198, 225-225, 270-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clustertest/external_upgrade_test.go` at line 168, The 5s sleep using sleepDuration is flaky vs monitor intervals (PGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS and InstanceMonitorRefreshInterval); update the test to either increase sleepDuration to ~8–10s or, preferably, replace the fixed time.Sleep(sleepDuration) calls with a polling loop that checks the instance/cluster version with a deadline (context with timeout) and retries until the expected version is observed, updating all occurrences of sleepDuration used at the points referenced (the three additional occurrences) so the per-instance status monitor and leader-elected databases monitor have time to observe and persist the spec change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@clustertest/external_upgrade_test.go`:
- Line 138: Update the log message in the test helper call tLogf so it reads
"skipping cleanup for database '%s'" instead of "skipping cleaning for database
'%s'"; locate the tLogf invocation in external_upgrade_test.go (the call that
currently logs databaseID) and change the string literal accordingly.
In `@server/internal/database/reconcile_versions.go`:
- Around line 169-210: reconcileNodeVersions currently only marks updatedSpec
when a Postgres version mismatch is found, so Spock-only drifts are ignored;
modify the function to track Spock changes separately (e.g., introduce a boolean
spockChanged) and set updatedSpec = spec when either a Postgres change or a
consistent Spock change is detected; keep using commonSpockVersion and
spockMatches to detect a consistent Spock value across observed nodes, and after
the loop if spockChanged && spockMatches && commonSpockVersion != "" ensure
updatedSpec is set (if not already), call NormalizePostgresVersions() if needed,
and assign updatedSpec.SpockVersion = commonSpockVersion.
---
Nitpick comments:
In `@clustertest/external_upgrade_test.go`:
- Line 168: The 5s sleep using sleepDuration is flaky vs monitor intervals
(PGEDGE_DATABASES_MONITOR_INTERVAL_SECONDS and InstanceMonitorRefreshInterval);
update the test to either increase sleepDuration to ~8–10s or, preferably,
replace the fixed time.Sleep(sleepDuration) calls with a polling loop that
checks the instance/cluster version with a deadline (context with timeout) and
retries until the expected version is observed, updating all occurrences of
sleepDuration used at the points referenced (the three additional occurrences)
so the per-instance status monitor and leader-elected databases monitor have
time to observe and persist the spec change.
In `@server/internal/database/reconcile_versions.go`:
- Around line 99-148: reconcileInstanceVersions and reconcileNodeVersions
currently mutate caller-supplied objects in place (e.g., setting
instance.PgEdgeVersion in reconcileInstanceVersions and modifying
node.PostgresVersion in reconcileNodeVersions), yet ReconcileVersions' signature
implies it returns new updated objects; to fix, clone the inputs before changing
them: deep-copy the incoming instances slice elements in
reconcileInstanceVersions (create new StoredInstance values) and clone the
StoredSpec (and its Node entries) inside reconcileNodeVersions (or at the start
of ReconcileVersions) so mutations apply to the clones and not the caller's
objects, then return those cloned/updated values from ReconcileVersions; update
references to instance, node, and spec in the functions
(reconcileInstanceVersions, reconcileNodeVersions, ReconcileVersions) to operate
on the clones.
- Around line 60-71: N+1 read: the loop calls s.store.InstanceSpec.GetByKey per
instance causing many etcd round-trips; instead add or use a bulk fetch (e.g.,
GetByDatabaseID) to load all InstanceSpec rows for the database once, build a
map indexed by InstanceID, then inside the loop replace the GetByKey call with a
map lookup and only append s.store.InstanceSpec.Update(instanceSpec) when the
lookup yields a spec to modify; update reconcile_versions.go to call the
new/existing s.store.InstanceSpec.GetByDatabaseID(ctx, instance.DatabaseID) (or
equivalent), index results by InstanceID, and use that map when setting
instanceSpec.Spec.PgEdgeVersion and preparing ops, falling back to
storage.ErrNotFound handling only for unexpected misses.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9b18f0d0-f0cc-44ca-b2bb-10e299e0a9d5
📒 Files selected for processing (36)
Makefilechanges/unreleased/Added-20260504-150235.yamlchanges/unreleased/Changed-20260504-152348.yamlclustertest/external_upgrade_test.goclustertest/host_test.goclustertest/utils_test.godocs/installation/configuration.mddocs/installation/systemd.mde2e/cancel_task_test.golima/roles/install_prerequisites/tasks/main.yamlserver/internal/config/config.goserver/internal/database/instance.goserver/internal/database/instance_resource.goserver/internal/database/instance_store.goserver/internal/database/node_resource.goserver/internal/database/provide.goserver/internal/database/reconcile_versions.goserver/internal/database/reconcile_versions_test.goserver/internal/database/service.goserver/internal/database/spec.goserver/internal/database/status.goserver/internal/ds/versions.goserver/internal/ds/versions_test.goserver/internal/host/host_test.goserver/internal/logging/factory.goserver/internal/monitor/databases_monitor.goserver/internal/monitor/host_monitor.goserver/internal/monitor/instance_monitor.goserver/internal/monitor/provide.goserver/internal/monitor/service.goserver/internal/orchestrator/common/patroni_config_generator_test.goserver/internal/orchestrator/swarm/images.goserver/internal/orchestrator/swarm/rag_instance_resources_test.goserver/internal/storage/interface.goserver/internal/storage/txn.goserver/internal/workflows/plan_update.go
💤 Files with no reviewable changes (1)
- server/internal/database/status.go
✅ Files skipped from review due to trivial changes (4)
- changes/unreleased/Added-20260504-150235.yaml
- e2e/cancel_task_test.go
- changes/unreleased/Changed-20260504-152348.yaml
- server/internal/orchestrator/common/patroni_config_generator_test.go
🚧 Files skipped from review as they are similar to previous changes (20)
- clustertest/host_test.go
- server/internal/workflows/plan_update.go
- server/internal/logging/factory.go
- Makefile
- server/internal/storage/interface.go
- server/internal/orchestrator/swarm/images.go
- server/internal/monitor/provide.go
- server/internal/database/node_resource.go
- server/internal/config/config.go
- lima/roles/install_prerequisites/tasks/main.yaml
- server/internal/host/host_test.go
- server/internal/database/spec.go
- server/internal/ds/versions_test.go
- clustertest/utils_test.go
- server/internal/monitor/service.go
- server/internal/database/service.go
- server/internal/database/provide.go
- server/internal/ds/versions.go
- server/internal/monitor/databases_monitor.go
- server/internal/database/instance_store.go
e23e4e4 to
d2aa2c0
Compare
Instances are updated sequentially and switchovers can happen during that sequence. We want to use the most recently fetched primary instance ID in the node resource to account for those switchovers. This change is backward compatible and does not need any migration. PLAT-512
This test still causes issues occasionally due to how quickly it cancels the task after starting it. This sleep is a workaround until we're able to find and fix the underlying issue. PLAT-512
Improves the ds.PgEdgeVersion initializers by: - Renaming `NewPgEdgeVersion`, which parsed a `PgEdgeVersion` from strings to `ParsePgEdgeVersion`. - Add a method to normalize `PgEdgeVersion`s so that the Postgres version contains major and minor components and the Spock version contains just a major component. This matches the way that versions are currently provided in our API. PLAT-512
Adds an `AddConditions` method to `storage.Txn` to make it possible to add additional constraints to a transaction. This enables us to enforce other constraints that are outside of the values we're updating. PLAT-512
Modifies the existing host monitor to also update the host record periodically. This accounts for system changes that happen while the Control Plane is running, such as system package updates. PLAT-512
PLAT-512
Updates the existing instance monitor to query and record the Postgres and Spock versions from replica instances. We'll need this to determine when a node has been updated by an external process, such as a system package update. PLAT-512
Adds a new database service method to reconcile the Postgres and Spock versions that we observe through the instance monitors with the versions in the database spec. The domain logic of this process is split into a standalone function with unit tests to enforce its behavior. The service method uses a transaction with additional constraints to handle conflicts with any database operations that run while we're performing this check. The summary of this new behavior is that when an instance is updated, we update its corresponding `Instance` record in our database. If all of the instances have been updated for a node, we update the node's entry in the database spec. If all of the nodes have been updated, we normalize the database spec by setting the top-level version fields and zeroing out the node-level version fields. PLAT-512
Adds a 'databases' monitor to periodically run the database version reconciliation process. The monitor's interval is configurable and can be set to '0' to disable it entirely. This setting is useful in testing and in clusters with a large number of databases where the operator knows that the database versions will never change outside of the Control Plane API. PLAT-512
Adds a cluster test to verify the behavior of the new database version reconciliation process. PLAT-512
pgEdge only makes the latest package versions available in the `release` repositories. Older versions of packages are only available in the `release_old` repositories. This commit adds the `release_old` repositories to the `dev-lima` environment so that we can install older versions of Postgres and other components in this environment. PLAT-512
Adds the new configuration setting and logging component to the configuration guide. PLAT-512
Updates the systemd document to describe the Postgres minor version upgrade process. PLAT-512
Adds changelog entries for db version reconciliation and the instance monitor change. PLAT-512
d2aa2c0 to
8c65137
Compare
Summary
A database spec can fall out of sync with its running instances when a user upgrades packages outside of our system. We expect this to happen in systemd clusters from typical system package upgrade workflows. So, we need a way to update the database spec when we observe version changes in the running instances.
This PR adds a new monitor that loops over all databases and reconciles the observed versions with those in the instance record and the spec. This monitor uses the existing election system to ensure that only one host runs the reconciliation process.
Changes
This PR is split into granular commits. I highly recommend reviewing the commits separately rather than looking at the entire diff at once. The most notable of the commits are:
fix: use most recently fetched primary instance idfeat: update host in host monitorfeat: record versions for replica instancesfeat: add service method to reconcile db versionsfeat: add databases monitordocs: minor postgres version upgrades for systemdTesting
I've added a new cluster test that exercises this feature. You can also test this by hand using the instructions in this section.
Using the dev-lima environment
Using the compose environment
PLAT-512