Skip to content

feat(server): add graceful shutdown on SIGTERM/SIGINT#2787

Closed
tlgimenes wants to merge 2 commits intomainfrom
tlgimenes/graceful-shutdown
Closed

feat(server): add graceful shutdown on SIGTERM/SIGINT#2787
tlgimenes wants to merge 2 commits intomainfrom
tlgimenes/graceful-shutdown

Conversation

@tlgimenes
Copy link
Contributor

@tlgimenes tlgimenes commented Mar 20, 2026

What is this contribution about?

Adds graceful shutdown handling when the MCP Mesh server receives SIGTERM (K8s pod termination) or SIGINT (Ctrl+C). Previously, all resources — database connections, NATS subscriptions, event bus workers, in-flight telemetry — were abandoned without cleanup.

The shutdown sequence follows a strict order: stop HTTP servers → stop workers in parallel (EventBus, SSE hub, cron, decopilot) → drain NATS (after all consumers stopped) → flush telemetry → close database. A 10-second force-exit timeout prevents the process from hanging indefinitely.

Two-file change using Object.assign(app, { shutdown }) to preserve backward compatibility with existing tests.

How to Test

  1. bun run dev, then press Ctrl+C
  2. Should see [shutdown] Received SIGINT...[shutdown] Stopping workers...[shutdown] Cleanup complete.
  3. bun run dev, then in another terminal kill -TERM <pid> — same clean shutdown behavior
  4. bun run check passes, bun test apps/mesh/src/api/ passes (317 tests)

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working
  • No breaking changes

Summary by cubic

Add graceful shutdown on SIGTERM/SIGINT and separate liveness/readiness probes. Ensures clean shutdown, prevents lost data, and improves K8s draining during rollouts.

  • New Features
    • Shutdown sequence: stop HTTP server first (force-close SSE), stop workers in parallel (EventBus, SSE hub, cron, decopilot), drain NATS, flush telemetry, then close the DB.
    • Added /health/live and /health/ready; readiness checks DB and NATS, returns 503 during shutdown via app.markShuttingDown(). /health kept for compatibility.
    • Installed SIGTERM/SIGINT handlers with a 2s pre-stop delay and a 55s force-exit timeout; exposes app.shutdown().
    • NATS_URL now accepts comma-separated URLs for cluster failover.
    • Helm chart bumped to 0.1.41 with new probe paths and terminationGracePeriodSeconds=60; added deploy/docker-compose/docker-compose.dev.yml for local Postgres and a 3-node NATS cluster.

Written for commit fee6fc4. Summary will update on new commits.

@github-actions
Copy link
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 20, 2026

Release Options

Suggested: Minor (2.203.0) — based on feat: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.202.5-alpha.1
🎉 Patch 2.202.5
❤️ Minor 2.203.0
🚀 Major 3.0.0

Current version: 2.202.4

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: minor).

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/api/app.ts">

<violation number="1" location="apps/mesh/src/api/app.ts:1156">
P2: Wrap cleanup callbacks in a deferred promise (`Promise.resolve().then(...)`) so synchronous throws are captured by `Promise.allSettled` instead of aborting shutdown early.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

tlgimenes and others added 2 commits March 24, 2026 20:28
Clean up all resources in order when the process receives a termination
signal: stop workers, drain NATS, flush telemetry, close database.
Prevents orphaned connections and lost telemetry on K8s pod termination.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… shutdown

- Add /health/live (liveness) and /health/ready (readiness) endpoints
- /health kept for backwards compatibility
- /health/ready checks DB and NATS connectivity, returns 503 during shutdown
- Expose markShuttingDown() separately so readiness returns 503 before
  server.stop() is called — gives K8s ~2s to drain traffic
- NATS_URL now accepts comma-separated URLs for cluster failover
- Bump helm chart to 0.1.41 with updated probe paths and terminationGracePeriodSeconds=60
- Add deploy/docker-compose/docker-compose.dev.yml with 3-node NATS cluster
  and PostgreSQL for local development
@nicacioliveira nicacioliveira force-pushed the tlgimenes/graceful-shutdown branch from 480d49e to fee6fc4 Compare March 25, 2026 00:53
@tlgimenes
Copy link
Contributor Author

merged by #2855

@tlgimenes tlgimenes closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant