Problem
Production deployments need to distinguish between:
- Liveness: Is the gateway process responsive? (restart if not)
- Readiness: Can the gateway serve traffic? (remove from load balancer if not)
This applies across all orchestration platforms — Kubernetes readiness probes, ECS health checks, docker-compose healthcheck, Consul/Nomad service health, and any load balancer that routes based on HTTP health endpoints. Without these, operators must rely on TCP checks or wait for client timeouts to detect backend failures.
Currently, agentic-api has no health check endpoints.
Proposal
GET /health
- Returns
200 OK unconditionally when the gateway process is listening
- Use for liveness checks (detect crashed/hung gateway process)
- ECS: container health check; K8s: liveness probe; ALB/NLB: target group health
GET /ready
- Probes the configured LLM backend's
/health endpoint (with 2s timeout)
- Returns
200 OK if backend is reachable and healthy
- Returns
503 Service Unavailable if backend is down/unreachable
- Use for readiness checks (prevent routing to gateway when backend is unavailable)
- ECS: ALB target group health check path; K8s: readiness probe; Consul: service check
Deployment Examples
Kubernetes:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2
ECS (task definition):
{
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8000/ready || exit 1"],
"interval": 10,
"timeout": 3,
"retries": 3,
"startPeriod": 30
}
}
docker-compose:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/ready"]
interval: 10s
timeout: 3s
retries: 3
start_period: 30s
Background
This follows the design discussion in vllm-project/vllm#36960, where the community identified the need for separate liveness/readiness checks after observing that GPU page faults can leave processes alive but unable to serve inference requests. While vLLM is addressing backend health detection, the gateway layer needs its own checks to handle network failures, misconfigurations, and backend unavailability — regardless of the orchestration platform.
Dependencies
Implementation
I have a working implementation (~30 LOC + integration tests) ready to submit as a PR once #29 merges. Happy to open the PR now for early review if preferred.
Problem
Production deployments need to distinguish between:
This applies across all orchestration platforms — Kubernetes readiness probes, ECS health checks, docker-compose
healthcheck, Consul/Nomad service health, and any load balancer that routes based on HTTP health endpoints. Without these, operators must rely on TCP checks or wait for client timeouts to detect backend failures.Currently, agentic-api has no health check endpoints.
Proposal
GET /health200 OKunconditionally when the gateway process is listeningGET /ready/healthendpoint (with 2s timeout)200 OKif backend is reachable and healthy503 Service Unavailableif backend is down/unreachableDeployment Examples
Kubernetes:
ECS (task definition):
{ "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8000/ready || exit 1"], "interval": 10, "timeout": 3, "retries": 3, "startPeriod": 30 } }docker-compose:
Background
This follows the design discussion in vllm-project/vllm#36960, where the community identified the need for separate liveness/readiness checks after observing that GPU page faults can leave processes alive but unable to serve inference requests. While vLLM is addressing backend health detection, the gateway layer needs its own checks to handle network failures, misconfigurations, and backend unavailability — regardless of the orchestration platform.
Dependencies
Implementation
I have a working implementation (~30 LOC + integration tests) ready to submit as a PR once #29 merges. Happy to open the PR now for early review if preferred.