-
-
Notifications
You must be signed in to change notification settings - Fork 203
Description
RFC: Prometheus Metrics & Health Endpoints
Phase: 1 — Security Hardening & Enterprise Foundation
Priority: P0 — Critical
Estimated Effort: Low
Problem Statement
Authorizer's health endpoint (/health) returns a plain "OK" string with no component status. There are no observability metrics — no Prometheus endpoint, no request latency tracking, no auth-specific counters. Keycloak has full Prometheus/Grafana support. This is essential for production deployments and Kubernetes environments.
Current health handler (internal/http_handlers/health.go):
func (h *httpProvider) HealthHandler() gin.HandlerFunc {
return func(c *gin.Context) {
c.String(http.StatusOK, "OK")
}
}Current Architecture Context
- HTTP framework: Gin on port 8080 (configurable via
--http-port) - Config already has
--metrics-port=8081flag defined but unused - No Prometheus/OpenMetrics library in
go.mod - Memory store has Redis and DB-backed implementations
- Storage provider has no health-check methods
- Routes defined in
internal/server/http_routes.go
Proposed Solution
1. Prometheus Metrics
Library: prometheus/client_golang — the de-facto standard Go Prometheus client.
New package: internal/metrics/
var (
// Auth counters
LoginTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "authorizer_login_total",
Help: "Total login attempts by method and status",
}, []string{"method", "status"}) // method=password|otp|magic_link|social, status=success|failure
SignupTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "authorizer_signup_total",
Help: "Total signup attempts by method and status",
}, []string{"method", "status"})
TokenIssuedTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "authorizer_token_issued_total",
Help: "Total tokens issued by type",
}, []string{"type"}) // type=access_token|refresh_token|id_token
ActiveSessions = promauto.NewGauge(prometheus.GaugeOpts{
Name: "authorizer_active_sessions",
Help: "Current number of active sessions",
})
FailedLoginTotal = promauto.NewCounter(prometheus.CounterOpts{
Name: "authorizer_failed_login_total",
Help: "Total failed login attempts (for alerting)",
})
AccountLockoutsTotal = promauto.NewCounter(prometheus.CounterOpts{
Name: "authorizer_account_lockouts_total",
Help: "Total account lockout events",
})
// Request metrics
RequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "authorizer_request_duration_seconds",
Help: "HTTP request latency by endpoint and method",
Buckets: prometheus.DefBuckets,
}, []string{"endpoint", "method", "status_code"})
DBQueryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "authorizer_db_query_duration_seconds",
Help: "Database query latency by operation",
Buckets: prometheus.DefBuckets,
}, []string{"operation"}) // operation=add_user|get_user_by_email|list_users|...
// MFA metrics
MFAVerificationTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "authorizer_mfa_verification_total",
Help: "MFA verification attempts by type and status",
}, []string{"type", "status"}) // type=totp, status=success|failure
// Webhook metrics
WebhookDeliveryTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "authorizer_webhook_delivery_total",
Help: "Webhook delivery attempts by event and status",
}, []string{"event", "status"}) // status=success|failure
WebhookDeliveryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "authorizer_webhook_delivery_duration_seconds",
Help: "Webhook delivery latency",
Buckets: prometheus.DefBuckets,
}, []string{"event"})
)Metrics middleware for Gin (internal/http_handlers/metrics_middleware.go):
func MetricsMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
c.Next()
duration := time.Since(start).Seconds()
metrics.RequestDuration.WithLabelValues(
c.FullPath(), // endpoint pattern, not actual path (avoids cardinality explosion)
c.Request.Method,
strconv.Itoa(c.Writer.Status()),
).Observe(duration)
}
}Instrumentation points — add counter increments to:
internal/graphql/login.go—LoginTotal,FailedLoginTotalinternal/graphql/signup.go—SignupTotalinternal/token/token.go—TokenIssuedTotalinternal/events/events.go—WebhookDeliveryTotal,WebhookDeliveryDuration- Auth handler functions —
MFAVerificationTotal
2. Metrics Server
Separate port — metrics served on --metrics-port=8081 (already defined in config, just unused).
Why separate port: Security best practice — /metrics should not be exposed on the public-facing port. In Kubernetes, the metrics port is typically only accessible within the cluster via ServiceMonitor.
// In cmd/root.go, after main server setup:
if cfg.MetricsPort > 0 {
metricsMux := http.NewServeMux()
metricsMux.Handle("/metrics", promhttp.Handler())
go http.ListenAndServe(fmt.Sprintf(":%d", cfg.MetricsPort), metricsMux)
}3. Enhanced Health Endpoint
Replace the current /health with a JSON response:
func (h *httpProvider) HealthHandler() gin.HandlerFunc {
return func(c *gin.Context) {
health := map[string]interface{}{
"status": "healthy",
"uptime": time.Since(startTime).String(),
"version": version,
}
// Check database
if err := h.deps.StorageProvider.HealthCheck(ctx); err != nil {
health["status"] = "degraded"
health["db"] = "error"
} else {
health["db"] = "ok"
}
// Check Redis (if configured)
if h.deps.MemoryStore != nil {
if err := h.deps.MemoryStore.HealthCheck(ctx); err != nil {
health["status"] = "degraded"
health["redis"] = "error"
} else {
health["redis"] = "ok"
}
}
statusCode := http.StatusOK
if health["status"] == "degraded" {
statusCode = http.StatusServiceUnavailable
}
c.JSON(statusCode, health)
}
}Response example:
{
"status": "healthy",
"uptime": "72h15m30s",
"version": "2.0.0",
"db": "ok",
"redis": "ok"
}New interface methods needed:
// On storage.Provider:
HealthCheck(ctx context.Context) error
// On memory_store.Provider:
HealthCheck(ctx context.Context) error- SQL providers:
db.Raw("SELECT 1").Error - MongoDB:
client.Ping(ctx, nil) - Redis:
client.Ping(ctx).Err() - Other NoSQL: provider-specific ping
4. Kubernetes Probes
New endpoints on the main server port:
GET /healthz → Liveness probe (is the process alive?)
GET /readyz → Readiness probe (can it serve traffic?)
Liveness (/healthz): Always returns 200 if the process is running. No dependency checks — if the process can respond, it's alive. Kubernetes restarts the pod only if this fails.
Readiness (/readyz): Returns 200 only if all dependencies (DB, Redis) are healthy. Kubernetes removes the pod from the Service endpoints if this fails — no traffic routed to unhealthy pods.
func (h *httpProvider) LivenessHandler() gin.HandlerFunc {
return func(c *gin.Context) {
c.JSON(http.StatusOK, gin.H{"status": "alive"})
}
}
func (h *httpProvider) ReadinessHandler() gin.HandlerFunc {
return func(c *gin.Context) {
// Same logic as enhanced HealthHandler
// Returns 503 if any dependency is unhealthy
}
}CLI Configuration Flags
--metrics-port=8081 # Port for /metrics endpoint (0 = disabled)
--enable-health-check-details=true # Include component status in /health (disable for minimal response)
Migration Strategy
- Add
prometheus/client_golangtogo.mod - Create
internal/metrics/package with metric definitions - Add
HealthCheck()method to storage and memory store provider interfaces (all 13+ DB implementations) - Add metrics middleware to Gin router
- Instrument auth handlers with counter increments
- Add
/healthz,/readyzroutes - Start metrics server on
--metrics-port
Grafana Dashboard
Ship a reference Grafana dashboard JSON (deploy/grafana/authorizer-dashboard.json) with panels for:
- Login success/failure rate over time
- Signup rate
- Request latency percentiles (p50, p95, p99)
- Active sessions gauge
- Failed login alerts
- Account lockout events
- DB query latency
Testing Plan
- Unit tests for metric increments on auth events
- Integration test: verify
/metricsendpoint returns Prometheus format - Integration test:
/healthreturns component status - Integration test:
/readyzreturns 503 when DB is down - Verify no cardinality explosion (use
c.FullPath()notc.Request.URL.Path)
References
- Prometheus Go Client
- Kubernetes Probes
- Keycloak Metrics
- RED Method (Rate, Errors, Duration)