Skip to content

feat(nats): non-blocking NATS startup with /readyz endpoint#2855

Open
tlgimenes wants to merge 16 commits intomainfrom
tlgimenes/nats-pg-readiness
Open

feat(nats): non-blocking NATS startup with /readyz endpoint#2855
tlgimenes wants to merge 16 commits intomainfrom
tlgimenes/nats-pg-readiness

Conversation

@tlgimenes
Copy link
Contributor

@tlgimenes tlgimenes commented Mar 24, 2026

What is this contribution about?

Makes the Hono server start immediately even when NATS is temporarily unavailable, and adds a proper /readyz readiness endpoint that checks service health.

Problem: Previously, the server would crash at startup if NATS was unreachable — createApp() called await natsProvider.init() which threw on connection failure. This blocked the HTTP server from binding to a port.

Solution:

  • NatsConnectionProvider.init() is now fire-and-forget with background exponential backoff retry (100ms → 3s max, with jitter)
  • All NATS consumers handle null connections gracefully (pub/sub, JetStream KV, streams)
  • An onReady(callback) hook lets consumers self-initialize when NATS connects
  • All NATS-dependent subscribers (notify strategy, SSE broadcast, cancel broadcast) have onReady hooks to re-establish subscriptions after deferred connection
  • Post-connection reconnect uses the NATS client's built-in mechanism
  • New GET /readyz endpoint: PostgreSQL is a hard dependency (503 if down), NATS is soft (reported but doesn't block 200)
  • Helm readiness probe updated from /health/readyz

Screenshots/Demonstration

N/A — backend-only changes.

How to Test

  1. Start server without NATS running:

    bun run --cwd=apps/mesh dev:server

    Expected: Server starts and binds to port. Console shows NATS retry warnings.

  2. Hit readiness endpoint (NATS down):

    curl http://localhost:3000/readyz

    Expected: 200 {"status":"ready","services":{"postgres":{"status":"up"},"nats":{"status":"down"}}}

  3. Start NATS, then re-check:

    curl http://localhost:3000/readyz

    Expected: 200 {"status":"ready","services":{"postgres":{"status":"up"},"nats":{"status":"up"}}}

  4. Run tests:

    bun test apps/mesh/src/nats/connection.test.ts apps/mesh/src/api/index.test.ts

    Expected: All 14 tests pass.

Migration Notes

  • Helm chart: Readiness probe path changed from /health to /readyz. The new endpoint checks PostgreSQL connectivity and returns 503 if the database is unreachable.
  • No database migrations required.
  • The /health endpoint is unchanged.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • Documentation is updated (if needed)
  • No breaking changes

@github-actions
Copy link
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

Release Options

Suggested: Minor (2.203.0) — based on feat: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.202.5-alpha.1
🎉 Patch 2.202.5
❤️ Minor 2.203.0
🚀 Major 3.0.0

Current version: 2.202.4

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: minor).

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 14 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts:74">
P1: Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.

(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts">

<violation number="1" location="apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts:35">
P1: Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:66">
P1: Returning early in `onPodDeath` drops watcher registration permanently when NATS is unavailable at startup, so pod-death recovery never activates after reconnect.</violation>
</file>

<file name="apps/mesh/src/automations/job-stream.ts">

<violation number="1" location="apps/mesh/src/automations/job-stream.ts:102">
P1: `publish()` silently drops jobs when JetStream is unavailable, which can lose automation executions without surfacing an error.

(Based on your team's feedback about treating NATS/JetStream failures as fail-fast rather than graceful no-op fallbacks.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/nats/connection.ts">

<violation number="1" location="apps/mesh/src/nats/connection.ts:99">
P1: `stopped` is never reset to `false`, so re-calling `init()` after `drain()` silently no-ops. Since `drain()` resets `initialized = false` (inviting reuse), add `stopped = false` at the top of `init()`.</violation>
</file>

<file name="apps/mesh/src/event-bus/nats-sse-broadcast.ts">

<violation number="1" location="apps/mesh/src/event-bus/nats-sse-broadcast.ts:45">
P1: Returning early when `getConnection()` is null makes SSE cross-pod subscription permanently skipped if NATS is down at startup. Add a retry/deferred re-start path so subscription is established once NATS becomes available.</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:263">
P1: Do not fire-and-forget NATS initialization in production startup; this should fail fast when NATS is unavailable.

(Based on your team's feedback about treating NATS/JetStream as a hard startup dependency.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/api/app.ts:474">
P1: `/readyz` should include NATS in the readiness gate when NATS is a required runtime dependency; otherwise pods can be marked ready in a partially broken state.

(Based on your team's feedback about treating NATS/JetStream as a hard dependency in runtime environments.) [FEEDBACK_USED]</violation>

<violation number="3" location="apps/mesh/src/api/app.ts:854">
P0: Do not swallow automation JetStream startup failures; this can leave the cron/automation worker disabled while the server appears healthy.

(Based on your team's feedback about failing fast when JetStream stream/consumer initialization fails.) [FEEDBACK_USED]</violation>
</file>

<file name="apps/mesh/src/event-bus/nats-notify.ts">

<violation number="1" location="apps/mesh/src/event-bus/nats-notify.ts:34">
P2: Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>

<violation number="2" location="apps/mesh/src/event-bus/nats-notify.ts:56">
P2: Avoid silently dropping notifications when the NATS connection is missing; this should surface as a failure so missing NATS doesn’t leave the event bus in a degraded state.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


async init(): Promise<void> {
const nc = this.options.getConnection();
if (!nc) return; // NATS not ready — stream buffer disabled
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.

(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.)

View Feedback: 1 2

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/api/routes/decopilot/nats-stream-buffer.ts, line 74:

<comment>Do not silently return when NATS is unavailable here; fail initialization so JetStream-dependent stream buffering is not disabled in a partially running state.

(Based on your team's feedback about treating NATS/JetStream startup as fail-fast for dependent components.) </comment>

<file context>
@@ -71,6 +71,7 @@ export class NatsStreamBuffer implements StreamBuffer {
 
   async init(): Promise<void> {
     const nc = this.options.getConnection();
+    if (!nc) return; // NATS not ready — stream buffer disabled
     const jsm = await nc.jetstreamManager();
 
</file context>
Suggested change
if (!nc) return; // NATS not ready — stream buffer disabled
if (!nc) throw new Error("NATS connection not initialized");
Fix with Cubic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — this PR makes NATS a soft dependency. When NATS is unavailable at startup, init() returns early (stream buffering degraded). The caller in app.ts registers natsProvider.onReady(() => streamBuffer.init()) to re-initialize when NATS connects. Throwing here would crash the server at startup when NATS is down, which is exactly what this PR fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've saved this as a new learning to improve future reviews.

this.sub = this.options.getConnection().subscribe(CANCEL_SUBJECT);

const nc = this.options.getConnection();
if (!nc) return; // NATS not ready — local cancel only
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/api/routes/decopilot/nats-cancel-broadcast.ts, line 35:

<comment>Do not silently no-op when NATS is missing; this component should fail fast instead of degrading to local-only cancellation.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) </comment>

<file context>
@@ -30,7 +30,11 @@ export class NatsCancelBroadcast implements CancelBroadcast {
-    this.sub = this.options.getConnection().subscribe(CANCEL_SUBJECT);
+
+    const nc = this.options.getConnection();
+    if (!nc) return; // NATS not ready — local cancel only
+
+    this.sub = nc.subscribe(CANCEL_SUBJECT);
</file context>
Fix with Cubic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — local cancellation still works without NATS. Cross-pod broadcast activates when NATS connects via the onReady() hook in app.ts. Throwing here would crash the startup sequence when NATS is temporarily unavailable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've updated an existing learning with this new information.

this.sub = this.options.getConnection().subscribe(SUBJECT);

const nc = this.options.getConnection();
if (!nc) return; // NATS not ready — polling strategy is safety net
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/event-bus/nats-notify.ts, line 34:

<comment>Do not silently skip startup when the NATS connection is missing. This should fail fast so missing/misconfigured NATS is surfaced immediately instead of disabling the notify strategy.

(Based on your team's feedback about treating NATS/JetStream as a required fail-fast dependency.) </comment>

<file context>
@@ -28,9 +28,12 @@ export class NatsNotifyStrategy implements NotifyStrategy {
-    this.sub = this.options.getConnection().subscribe(SUBJECT);
+
+    const nc = this.options.getConnection();
+    if (!nc) return; // NATS not ready — polling strategy is safety net
+
+    this.sub = nc.subscribe(SUBJECT);
</file context>
Suggested change
if (!nc) return; // NATS not ready — polling strategy is safety net
if (!nc) {
throw new Error("NATS connection not initialized");
}
Fix with Cubic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — start() is called immediately as best-effort, then again via the onReady() hook in createEventBus() when NATS connects. The PollingStrategy runs alongside as a safety net for event delivery regardless of NATS availability. Throwing here would crash the server during NATS-unavailable startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've saved this as a new learning to improve future reviews.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:42">
P1: A failed KV init is memoized forever, preventing future `init()` retries after transient NATS errors.</violation>
</file>

<file name="apps/mesh/src/automations/job-stream.ts">

<violation number="1" location="apps/mesh/src/automations/job-stream.ts:103">
P1: Throwing when NATS is not ready breaks soft-dependency behavior and can drop scheduled automation executions after trigger state is already advanced.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/env.ts">

<violation number="1" location="apps/mesh/src/env.ts:40">
P2: Filter out empty NATS URL entries after splitting to avoid injecting invalid endpoints into runtime config.</violation>
</file>

<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:1415">
P2: Shutdown uses the global `currentEventBus` instead of the app instance's `eventBus`, which can stop the wrong worker when multiple app instances exist (tests/HMR).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/index.test.ts">

<violation number="1" location="apps/mesh/src/api/index.test.ts:90">
P2: The readiness test was switched to `/health/ready`, so it no longer verifies the documented `/readyz` endpoint contract.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

tlgimenes and others added 15 commits March 25, 2026 07:12
…ook, injectable connectFn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ook, injectable connectFn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sh() methods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (soft) health checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /readyz endpoint actively checks PostgreSQL connectivity (hard dependency)
and reports NATS status (soft). The liveness probe stays on /health.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
drain() sets stopped=true but init() never reset it, so connectWithRetry()
immediately exited its while-loop after a drain/reinit cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NatsCancelBroadcast, NatsNotifyStrategy, and NatsSSEBroadcast called
start() once at boot but never re-invoked it when NATS connected later.
This left their subscriptions permanently inactive when NATS was
unavailable at startup — the exact scenario this branch handles.

Added onReady hooks so all three re-establish subscriptions once NATS
connects. Made start() accept optional callbacks to support re-invocation
with the previously stored handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… init

- job-stream: publish() throws instead of silently dropping jobs
- job-stream: startConsumer() guards against duplicate consumer loops
- pod-heartbeat: init() deduplicates concurrent calls via stored promise
- pod-heartbeat: start() guards against double start
- pod-heartbeat: onPodDeath defers callback when KV not ready, activates in start()
- app.ts: chain podHeartbeat.start() after init() resolves instead of sync call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…robes

- Add SIGTERM/SIGINT handlers with phased shutdown (mark unready → drain traffic → stop workers → flush telemetry → close DB)
- Add /health/live (liveness) and /health/ready (readiness with shutdown gate) endpoints
- Keep /readyz as legacy endpoint with same shutdown-aware behavior
- Update Helm probes to use new endpoints, add terminationGracePeriodSeconds: 60
- Transform NATS_URL env to support comma-separated URLs for multi-node clusters
- Add docker-compose.dev.yml with PostgreSQL + 3-node NATS JetStream cluster
- Bump Helm chart to 0.1.41

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ready

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd /health/ready only

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tlgimenes tlgimenes force-pushed the tlgimenes/nats-pg-readiness branch from cc2740d to ad78c06 Compare March 25, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant