fix(nats): stream replication, file storage for jobs, and connection resilience by nicacioliveira · Pull Request #2858 · decocms/studio

nicacioliveira · 2026-03-25T01:15:19Z

Summary

All JetStream streams and KV buckets were created with num_replicas: 1, meaning data only lived on the NATS leader. With a 3-node cluster, a leader change would lose in-memory content. All streams and buckets now use num_replicas: 3.
AUTOMATION_JOBS was using Memory storage — jobs were lost during rolling deploys. Renamed to AUTOMATION_JOBS_FILE with StorageType.File, backed by the EFS volume already mounted on each NATS pod. The old stream stays on the cluster with no consumers and can be cleaned up when convenient: nats stream delete AUTOMATION_JOBS.
connect() had no options configured. Added timeout: 10s to fail fast on startup, reconnect backoff with 2s wait + 1s jitter to avoid thundering herd on pod restarts, and pingInterval/maxPingOut to detect zombie TCP connections early.
Added statement_timeout: 30s to the Postgres pool. Without it, a slow query could hold a connection indefinitely, exhaust the pool, and cause the readiness probe to fail.

Expected improvements

No more job loss during deploys or NATS leader changes
NATS connection timeouts no longer hang pod startup
Slow queries no longer block the readiness probe and trigger pod restarts
NATS cluster data correctly replicated across all 3 nodes

Summary by cubic

Improves durability and startup reliability by replicating all JetStream data to 3 nodes and persisting automation jobs on file-backed storage. Adds resilient nats connection settings and a 30s Postgres statement timeout to prevent job loss and startup hangs.

Bug Fixes
- Set 3 replicas for all JetStream streams and KV buckets to avoid data loss on leader changes.
- Persist automation jobs with StorageType.File and 3 replicas so jobs survive rolling deploys.
- Harden nats connection: 10s connect timeout, infinite reconnect with 2s wait + 1s jitter, and ping checks (30s interval, 3 max).
- Add Postgres statement_timeout=30s to stop runaway queries and keep readiness healthy.
Migration
- Stream renamed to AUTOMATION_JOBS_FILE (consumer: automation-worker-file). The old AUTOMATION_JOBS stream remains unused and can be removed when convenient: nats stream delete AUTOMATION_JOBS.

^{Written for commit 8e29611. Summary will update on new commits.}

…uckets

- timeout: 10s to fail fast on startup if NATS unreachable - reconnect with 2s wait + 1s jitter to avoid thundering herd on pod restarts - pingInterval/maxPingOut to detect dead TCP connections early fix(db): add statement_timeout 30s to prevent runaway queries from exhausting the connection pool and blocking the readiness probe

Switches AUTOMATION_JOBS from StorageType.Memory to StorageType.File so jobs survive pod restarts during rolling deploys. Handles the storage-type migration automatically on startup by detecting the mismatch and recreating the stream — safe because the previous stream was in-memory with no durable data.

…rage Renames the stream and consumer to avoid conflict with the existing Memory stream in production. The old AUTOMATION_JOBS stream stops receiving messages and expires naturally on the NATS cluster.

github-actions · 2026-03-25T01:15:27Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-03-25T01:15:33Z

Release Options

Suggested: Patch (2.202.5) — based on fix: prefix

React with an emoji to override the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.202.5-alpha.1`
🎉	Patch	`2.202.5`
❤️	Minor	`2.203.0`
🚀	Major	`3.0.0`

Current version: 2.202.4

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: minor).

cubic-dev-ai

1 issue found across 7 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/nats/connection.ts">

<violation number="1" location="apps/mesh/src/nats/connection.ts:42">
P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.

(Based on your team's feedback about treating NATS as a soft dependency during startup.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-25T01:18:12Z

apps/mesh/src/nats/connection.ts

+      nc = await connect({
+        servers: url,
+        // Fail fast on startup if NATS is unreachable
+        timeout: 10_000,


P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.

(Based on your team's feedback about treating NATS as a soft dependency during startup.)

View Feedback

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/connection.ts, line 42: <comment>Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery. (Based on your team's feedback about treating NATS as a soft dependency during startup.) </comment> <file context> @@ -36,7 +36,20 @@ export function createNatsConnectionProvider(): NatsConnectionProvider { + nc = await connect({ + servers: url, + // Fail fast on startup if NATS is unreachable + timeout: 10_000, + // After initial connect, reconnect automatically with backoff + jitter + // so pods don't all hammer NATS simultaneously after a restart </file context>

decobot added 5 commits March 24, 2026 21:59

fix(nats): set replication factor 3 on all JetStream streams and KV b…

dc0f010

…uckets

fix(automations): rename stream to AUTOMATION_JOBS_FILE with file sto…

bbe3425

…rage Renames the stream and consumer to avoid conflict with the existing Memory stream in production. The old AUTOMATION_JOBS stream stops receiving messages and expires naturally on the NATS cluster.

docs(automations): clarify old AUTOMATION_JOBS stream cleanup in comment

8e29611

cubic-dev-ai bot reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nats): stream replication, file storage for jobs, and connection resilience#2858

fix(nats): stream replication, file storage for jobs, and connection resilience#2858
nicacioliveira wants to merge 5 commits intomainfrom
fix/nats-stream-replication

nicacioliveira commented Mar 25, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicacioliveira commented Mar 25, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expected improvements

Summary by cubic

Uh oh!

github-actions bot commented Mar 25, 2026

🧪 Benchmark

Uh oh!

github-actions bot commented Mar 25, 2026

Release Options

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nicacioliveira commented Mar 25, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Mar 25, 2026 •

edited

Loading