Skip to content

fix(nats): stream replication, file storage for jobs, and connection resilience#2858

Open
nicacioliveira wants to merge 5 commits intomainfrom
fix/nats-stream-replication
Open

fix(nats): stream replication, file storage for jobs, and connection resilience#2858
nicacioliveira wants to merge 5 commits intomainfrom
fix/nats-stream-replication

Conversation

@nicacioliveira
Copy link
Contributor

@nicacioliveira nicacioliveira commented Mar 25, 2026

Summary

  • All JetStream streams and KV buckets were created with num_replicas: 1, meaning data only lived on the NATS leader. With a 3-node cluster, a leader change would lose in-memory content. All streams and buckets now use num_replicas: 3.
  • AUTOMATION_JOBS was using Memory storage — jobs were lost during rolling deploys. Renamed to AUTOMATION_JOBS_FILE with StorageType.File, backed by the EFS volume already mounted on each NATS pod. The old stream stays on the cluster with no consumers and can be cleaned up when convenient: nats stream delete AUTOMATION_JOBS.
  • connect() had no options configured. Added timeout: 10s to fail fast on startup, reconnect backoff with 2s wait + 1s jitter to avoid thundering herd on pod restarts, and pingInterval/maxPingOut to detect zombie TCP connections early.
  • Added statement_timeout: 30s to the Postgres pool. Without it, a slow query could hold a connection indefinitely, exhaust the pool, and cause the readiness probe to fail.

Expected improvements

  • No more job loss during deploys or NATS leader changes
  • NATS connection timeouts no longer hang pod startup
  • Slow queries no longer block the readiness probe and trigger pod restarts
  • NATS cluster data correctly replicated across all 3 nodes

Summary by cubic

Improves durability and startup reliability by replicating all JetStream data to 3 nodes and persisting automation jobs on file-backed storage. Adds resilient nats connection settings and a 30s Postgres statement timeout to prevent job loss and startup hangs.

  • Bug Fixes

    • Set 3 replicas for all JetStream streams and KV buckets to avoid data loss on leader changes.
    • Persist automation jobs with StorageType.File and 3 replicas so jobs survive rolling deploys.
    • Harden nats connection: 10s connect timeout, infinite reconnect with 2s wait + 1s jitter, and ping checks (30s interval, 3 max).
    • Add Postgres statement_timeout=30s to stop runaway queries and keep readiness healthy.
  • Migration

    • Stream renamed to AUTOMATION_JOBS_FILE (consumer: automation-worker-file). The old AUTOMATION_JOBS stream remains unused and can be removed when convenient: nats stream delete AUTOMATION_JOBS.

Written for commit 8e29611. Summary will update on new commits.

decobot added 5 commits March 24, 2026 21:59
- timeout: 10s to fail fast on startup if NATS unreachable
- reconnect with 2s wait + 1s jitter to avoid thundering herd on pod restarts
- pingInterval/maxPingOut to detect dead TCP connections early

fix(db): add statement_timeout 30s to prevent runaway queries from
exhausting the connection pool and blocking the readiness probe
Switches AUTOMATION_JOBS from StorageType.Memory to StorageType.File so
jobs survive pod restarts during rolling deploys. Handles the storage-type
migration automatically on startup by detecting the mismatch and recreating
the stream — safe because the previous stream was in-memory with no durable data.
…rage

Renames the stream and consumer to avoid conflict with the existing
Memory stream in production. The old AUTOMATION_JOBS stream stops
receiving messages and expires naturally on the NATS cluster.
@github-actions
Copy link
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Contributor

Release Options

Suggested: Patch (2.202.5) — based on fix: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.202.5-alpha.1
🎉 Patch 2.202.5
❤️ Minor 2.203.0
🚀 Major 3.0.0

Current version: 2.202.4

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: minor).

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/nats/connection.ts">

<violation number="1" location="apps/mesh/src/nats/connection.ts:42">
P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.

(Based on your team's feedback about treating NATS as a soft dependency during startup.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

nc = await connect({
servers: url,
// Fail fast on startup if NATS is unreachable
timeout: 10_000,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.

(Based on your team's feedback about treating NATS as a soft dependency during startup.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/connection.ts, line 42:

<comment>Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.

(Based on your team's feedback about treating NATS as a soft dependency during startup.) </comment>

<file context>
@@ -36,7 +36,20 @@ export function createNatsConnectionProvider(): NatsConnectionProvider {
+      nc = await connect({
+        servers: url,
+        // Fail fast on startup if NATS is unreachable
+        timeout: 10_000,
+        // After initial connect, reconnect automatically with backoff + jitter
+        // so pods don't all hammer NATS simultaneously after a restart
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant