feat(nats): non-blocking NATS startup with /readyz endpoint#2855
feat(nats): non-blocking NATS startup with /readyz endpoint#2855
Conversation
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Release OptionsSuggested: Minor ( React with an emoji to override the release type:
Current version:
|
There was a problem hiding this comment.
11 issues found across 14 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts">
<violation number="1" location="apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts:74">
P1: Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.
(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.) [FEEDBACK_USED]</violation>
</file>
<file name="apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts">
<violation number="1" location="apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts:35">
P1: Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>
</file>
<file name="apps/mesh/src/nats/pod-heartbeat.ts">
<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:66">
P1: Returning early in `onPodDeath` drops watcher registration permanently when NATS is unavailable at startup, so pod-death recovery never activates after reconnect.</violation>
</file>
<file name="apps/mesh/src/automations/job-stream.ts">
<violation number="1" location="apps/mesh/src/automations/job-stream.ts:102">
P1: `publish()` silently drops jobs when JetStream is unavailable, which can lose automation executions without surfacing an error.
(Based on your team's feedback about treating NATS/JetStream failures as fail-fast rather than graceful no-op fallbacks.) [FEEDBACK_USED]</violation>
</file>
<file name="apps/mesh/src/nats/connection.ts">
<violation number="1" location="apps/mesh/src/nats/connection.ts:99">
P1: `stopped` is never reset to `false`, so re-calling `init()` after `drain()` silently no-ops. Since `drain()` resets `initialized = false` (inviting reuse), add `stopped = false` at the top of `init()`.</violation>
</file>
<file name="apps/mesh/src/event-bus/nats-sse-broadcast.ts">
<violation number="1" location="apps/mesh/src/event-bus/nats-sse-broadcast.ts:45">
P1: Returning early when `getConnection()` is null makes SSE cross-pod subscription permanently skipped if NATS is down at startup. Add a retry/deferred re-start path so subscription is established once NATS becomes available.</violation>
</file>
<file name="apps/mesh/src/api/app.ts">
<violation number="1" location="apps/mesh/src/api/app.ts:263">
P1: Do not fire-and-forget NATS initialization in production startup; this should fail fast when NATS is unavailable.
(Based on your team's feedback about treating NATS/JetStream as a hard startup dependency.) [FEEDBACK_USED]</violation>
<violation number="2" location="apps/mesh/src/api/app.ts:474">
P1: `/readyz` should include NATS in the readiness gate when NATS is a required runtime dependency; otherwise pods can be marked ready in a partially broken state.
(Based on your team's feedback about treating NATS/JetStream as a hard dependency in runtime environments.) [FEEDBACK_USED]</violation>
<violation number="3" location="apps/mesh/src/api/app.ts:854">
P0: Do not swallow automation JetStream startup failures; this can leave the cron/automation worker disabled while the server appears healthy.
(Based on your team's feedback about failing fast when JetStream stream/consumer initialization fails.) [FEEDBACK_USED]</violation>
</file>
<file name="apps/mesh/src/event-bus/nats-notify.ts">
<violation number="1" location="apps/mesh/src/event-bus/nats-notify.ts:34">
P2: Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>
<violation number="2" location="apps/mesh/src/event-bus/nats-notify.ts:56">
P2: Avoid silently dropping notifications when the NATS connection is missing; this should surface as a failure so missing NATS doesn’t leave the event bus in a degraded state.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| async init(): Promise<void> { | ||
| const nc = this.options.getConnection(); | ||
| if (!nc) return; // NATS not ready — stream buffer disabled |
There was a problem hiding this comment.
P1: Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.
(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts, line 74:
<comment>Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.
(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.) </comment>
<file context>
@@ -71,6 +71,7 @@ export class NatsStreamBuffer implements StreamBuffer {
async init(): Promise<void> {
const nc = this.options.getConnection();
+ if (!nc) return; // NATS not ready — stream buffer disabled
const jsm = await nc.jetstreamManager();
</file context>
| if (!nc) return; // NATS not ready — stream buffer disabled | |
| if (!nc) throw new Error("NATS connection not initialized"); |
There was a problem hiding this comment.
Intentional — this PR makes NATS a soft dependency. When NATS is unavailable at startup, init() returns early (stream buffering degraded). The caller in app.ts registers natsProvider.onReady(() => streamBuffer.init()) to re-initialize when NATS connects. Throwing here would crash the server at startup when NATS is down, which is exactly what this PR fixes.
There was a problem hiding this comment.
Thanks for the feedback! I've saved this as a new learning to improve future reviews.
| this.sub = this.options.getConnection().subscribe(CANCEL_SUBJECT); | ||
|
|
||
| const nc = this.options.getConnection(); | ||
| if (!nc) return; // NATS not ready — local cancel only |
There was a problem hiding this comment.
P1: Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts, line 35:
<comment>Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) </comment>
<file context>
@@ -30,7 +30,11 @@ export class NatsCancelBroadcast implements CancelBroadcast {
- this.sub = this.options.getConnection().subscribe(CANCEL_SUBJECT);
+
+ const nc = this.options.getConnection();
+ if (!nc) return; // NATS not ready — local cancel only
+
+ this.sub = nc.subscribe(CANCEL_SUBJECT);
</file context>
There was a problem hiding this comment.
Intentional — local cancellation still works without NATS. Cross-pod broadcast activates when NATS connects via the onReady() hook in app.ts. Throwing here would crash the startup sequence when NATS is temporarily unavailable.
There was a problem hiding this comment.
Thanks for the feedback! I've updated an existing learning with this new information.
| this.sub = this.options.getConnection().subscribe(SUBJECT); | ||
|
|
||
| const nc = this.options.getConnection(); | ||
| if (!nc) return; // NATS not ready — polling strategy is safety net |
There was a problem hiding this comment.
P2: Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/event-bus/nats-notify.ts, line 34:
<comment>Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.
(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) </comment>
<file context>
@@ -28,9 +28,12 @@ export class NatsNotifyStrategy implements NotifyStrategy {
- this.sub = this.options.getConnection().subscribe(SUBJECT);
+
+ const nc = this.options.getConnection();
+ if (!nc) return; // NATS not ready — polling strategy is safety net
+
+ this.sub = nc.subscribe(SUBJECT);
</file context>
| if (!nc) return; // NATS not ready — polling strategy is safety net | |
| if (!nc) { | |
| throw new Error("NATS connection not initialized"); | |
| } |
There was a problem hiding this comment.
Intentional — start() is called immediately as best-effort, then again via the onReady() hook in createEventBus() when NATS connects. The PollingStrategy runs alongside as a safety net for event delivery regardless of NATS availability. Throwing here would crash the server during NATS-unavailable startup.
There was a problem hiding this comment.
Thanks for the feedback! I've saved this as a new learning to improve future reviews.
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/nats/pod-heartbeat.ts">
<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:42">
P1: A failed KV init is memoized forever, preventing future `init()` retries after transient NATS errors.</violation>
</file>
<file name="apps/mesh/src/automations/job-stream.ts">
<violation number="1" location="apps/mesh/src/automations/job-stream.ts:103">
P1: Throwing when NATS is not ready breaks soft-dependency behavior and can drop scheduled automation executions after trigger state is already advanced.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
2 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/env.ts">
<violation number="1" location="apps/mesh/src/env.ts:40">
P2: Filter out empty NATS URL entries after splitting to avoid injecting invalid endpoints into runtime config.</violation>
</file>
<file name="apps/mesh/src/api/app.ts">
<violation number="1" location="apps/mesh/src/api/app.ts:1415">
P2: Shutdown uses the global `currentEventBus` instead of the app instance's `eventBus`, which can stop the wrong worker when multiple app instances exist (tests/HMR).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/api/index.test.ts">
<violation number="1" location="apps/mesh/src/api/index.test.ts:90">
P2: The readiness test was switched to `/health/ready`, so it no longer verifies the documented `/readyz` endpoint contract.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…ook, injectable connectFn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ook, injectable connectFn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sh() methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (soft) health checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /readyz endpoint actively checks PostgreSQL connectivity (hard dependency) and reports NATS status (soft). The liveness probe stays on /health. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
drain() sets stopped=true but init() never reset it, so connectWithRetry() immediately exited its while-loop after a drain/reinit cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NatsCancelBroadcast, NatsNotifyStrategy, and NatsSSEBroadcast called start() once at boot but never re-invoked it when NATS connected later. This left their subscriptions permanently inactive when NATS was unavailable at startup — the exact scenario this branch handles. Added onReady hooks so all three re-establish subscriptions once NATS connects. Made start() accept optional callbacks to support re-invocation with the previously stored handler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… init - job-stream: publish() throws instead of silently dropping jobs - job-stream: startConsumer() guards against duplicate consumer loops - pod-heartbeat: init() deduplicates concurrent calls via stored promise - pod-heartbeat: start() guards against double start - pod-heartbeat: onPodDeath defers callback when KV not ready, activates in start() - app.ts: chain podHeartbeat.start() after init() resolves instead of sync call Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…robes - Add SIGTERM/SIGINT handlers with phased shutdown (mark unready → drain traffic → stop workers → flush telemetry → close DB) - Add /health/live (liveness) and /health/ready (readiness with shutdown gate) endpoints - Keep /readyz as legacy endpoint with same shutdown-aware behavior - Update Helm probes to use new endpoints, add terminationGracePeriodSeconds: 60 - Transform NATS_URL env to support comma-separated URLs for multi-node clusters - Add docker-compose.dev.yml with PostgreSQL + 3-node NATS JetStream cluster - Bump Helm chart to 0.1.41 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ready Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd /health/ready only Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cc2740d to
ad78c06
Compare
What is this contribution about?
Makes the Hono server start immediately even when NATS is temporarily unavailable, and adds a proper
/readyzreadiness endpoint that checks service health.Problem: Previously, the server would crash at startup if NATS was unreachable —
createApp()calledawait natsProvider.init()which threw on connection failure. This blocked the HTTP server from binding to a port.Solution:
NatsConnectionProvider.init()is now fire-and-forget with background exponential backoff retry (100ms → 3s max, with jitter)nullconnections gracefully (pub/sub, JetStream KV, streams)onReady(callback)hook lets consumers self-initialize when NATS connectsonReadyhooks to re-establish subscriptions after deferred connectionGET /readyzendpoint: PostgreSQL is a hard dependency (503 if down), NATS is soft (reported but doesn't block 200)/health→/readyzScreenshots/Demonstration
N/A — backend-only changes.
How to Test
Start server without NATS running:
Expected: Server starts and binds to port. Console shows NATS retry warnings.
Hit readiness endpoint (NATS down):
Expected:
200 {"status":"ready","services":{"postgres":{"status":"up"},"nats":{"status":"down"}}}Start NATS, then re-check:
Expected:
200 {"status":"ready","services":{"postgres":{"status":"up"},"nats":{"status":"up"}}}Run tests:
bun test apps/mesh/src/nats/connection.test.ts apps/mesh/src/api/index.test.tsExpected: All 14 tests pass.
Migration Notes
/healthto/readyz. The new endpoint checks PostgreSQL connectivity and returns 503 if the database is unreachable./healthendpoint is unchanged.Review Checklist