fix(nats): stream replication, file storage for jobs, and connection resilience#2858
fix(nats): stream replication, file storage for jobs, and connection resilience#2858nicacioliveira wants to merge 5 commits intomainfrom
Conversation
- timeout: 10s to fail fast on startup if NATS unreachable - reconnect with 2s wait + 1s jitter to avoid thundering herd on pod restarts - pingInterval/maxPingOut to detect dead TCP connections early fix(db): add statement_timeout 30s to prevent runaway queries from exhausting the connection pool and blocking the readiness probe
Switches AUTOMATION_JOBS from StorageType.Memory to StorageType.File so jobs survive pod restarts during rolling deploys. Handles the storage-type migration automatically on startup by detecting the mismatch and recreating the stream — safe because the previous stream was in-memory with no durable data.
…rage Renames the stream and consumer to avoid conflict with the existing Memory stream in production. The old AUTOMATION_JOBS stream stops receiving messages and expires naturally on the NATS cluster.
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
There was a problem hiding this comment.
1 issue found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/nats/connection.ts">
<violation number="1" location="apps/mesh/src/nats/connection.ts:42">
P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.
(Based on your team's feedback about treating NATS as a soft dependency during startup.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| nc = await connect({ | ||
| servers: url, | ||
| // Fail fast on startup if NATS is unreachable | ||
| timeout: 10_000, |
There was a problem hiding this comment.
P2: Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.
(Based on your team's feedback about treating NATS as a soft dependency during startup.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/connection.ts, line 42:
<comment>Avoid hard fail-fast behavior for NATS startup here; this timeout makes transient NATS unavailability block/abort startup instead of allowing degraded mode with later onReady recovery.
(Based on your team's feedback about treating NATS as a soft dependency during startup.) </comment>
<file context>
@@ -36,7 +36,20 @@ export function createNatsConnectionProvider(): NatsConnectionProvider {
+ nc = await connect({
+ servers: url,
+ // Fail fast on startup if NATS is unreachable
+ timeout: 10_000,
+ // After initial connect, reconnect automatically with backoff + jitter
+ // so pods don't all hammer NATS simultaneously after a restart
</file context>
Summary
num_replicas: 1, meaning data only lived on the NATS leader. With a 3-node cluster, a leader change would lose in-memory content. All streams and buckets now usenum_replicas: 3.AUTOMATION_JOBSwas using Memory storage — jobs were lost during rolling deploys. Renamed toAUTOMATION_JOBS_FILEwithStorageType.File, backed by the EFS volume already mounted on each NATS pod. The old stream stays on the cluster with no consumers and can be cleaned up when convenient:nats stream delete AUTOMATION_JOBS.connect()had no options configured. Addedtimeout: 10sto fail fast on startup, reconnect backoff with 2s wait + 1s jitter to avoid thundering herd on pod restarts, andpingInterval/maxPingOutto detect zombie TCP connections early.statement_timeout: 30sto the Postgres pool. Without it, a slow query could hold a connection indefinitely, exhaust the pool, and cause the readiness probe to fail.Expected improvements
Summary by cubic
Improves durability and startup reliability by replicating all JetStream data to 3 nodes and persisting automation jobs on file-backed storage. Adds resilient
natsconnection settings and a 30s Postgres statement timeout to prevent job loss and startup hangs.Bug Fixes
StorageType.Fileand 3 replicas so jobs survive rolling deploys.natsconnection: 10s connect timeout, infinite reconnect with 2s wait + 1s jitter, and ping checks (30s interval, 3 max).statement_timeout=30sto stop runaway queries and keep readiness healthy.Migration
AUTOMATION_JOBS_FILE(consumer:automation-worker-file). The oldAUTOMATION_JOBSstream remains unused and can be removed when convenient:nats stream delete AUTOMATION_JOBS.Written for commit 8e29611. Summary will update on new commits.