RFC: Prometheus Metrics & Health Endpoints

## RFC: Prometheus Metrics & Health Endpoints

**Phase**: 1 — Security Hardening & Enterprise Foundation  
**Priority**: P0 — Critical  
**Estimated Effort**: Low  

---

### Problem Statement

Authorizer's health endpoint (`/health`) returns a plain `"OK"` string with no component status. There are no observability metrics — no Prometheus endpoint, no request latency tracking, no auth-specific counters. Keycloak has full Prometheus/Grafana support. This is essential for production deployments and Kubernetes environments.

Current health handler (`internal/http_handlers/health.go`):
```go
func (h *httpProvider) HealthHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        c.String(http.StatusOK, "OK")
    }
}
```

---

### Current Architecture Context

- HTTP framework: Gin on port 8080 (configurable via `--http-port`)
- Config already has `--metrics-port=8081` flag defined but **unused**
- No Prometheus/OpenMetrics library in `go.mod`
- Memory store has Redis and DB-backed implementations
- Storage provider has no health-check methods
- Routes defined in `internal/server/http_routes.go`

---

### Proposed Solution

#### 1. Prometheus Metrics

**Library**: `prometheus/client_golang` — the de-facto standard Go Prometheus client.

**New package**: `internal/metrics/`

```go
var (
    // Auth counters
    LoginTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_login_total",
        Help: "Total login attempts by method and status",
    }, []string{"method", "status"})  // method=password|otp|magic_link|social, status=success|failure

    SignupTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_signup_total",
        Help: "Total signup attempts by method and status",
    }, []string{"method", "status"})

    TokenIssuedTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_token_issued_total",
        Help: "Total tokens issued by type",
    }, []string{"type"})  // type=access_token|refresh_token|id_token

    ActiveSessions = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "authorizer_active_sessions",
        Help: "Current number of active sessions",
    })

    FailedLoginTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "authorizer_failed_login_total",
        Help: "Total failed login attempts (for alerting)",
    })

    AccountLockoutsTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "authorizer_account_lockouts_total",
        Help: "Total account lockout events",
    })

    // Request metrics
    RequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_request_duration_seconds",
        Help:    "HTTP request latency by endpoint and method",
        Buckets: prometheus.DefBuckets,
    }, []string{"endpoint", "method", "status_code"})

    DBQueryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_db_query_duration_seconds",
        Help:    "Database query latency by operation",
        Buckets: prometheus.DefBuckets,
    }, []string{"operation"})  // operation=add_user|get_user_by_email|list_users|...

    // MFA metrics
    MFAVerificationTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_mfa_verification_total",
        Help: "MFA verification attempts by type and status",
    }, []string{"type", "status"})  // type=totp, status=success|failure

    // Webhook metrics
    WebhookDeliveryTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_webhook_delivery_total",
        Help: "Webhook delivery attempts by event and status",
    }, []string{"event", "status"})  // status=success|failure

    WebhookDeliveryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_webhook_delivery_duration_seconds",
        Help:    "Webhook delivery latency",
        Buckets: prometheus.DefBuckets,
    }, []string{"event"})
)
```

**Metrics middleware** for Gin (`internal/http_handlers/metrics_middleware.go`):
```go
func MetricsMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        c.Next()
        duration := time.Since(start).Seconds()
        
        metrics.RequestDuration.WithLabelValues(
            c.FullPath(),      // endpoint pattern, not actual path (avoids cardinality explosion)
            c.Request.Method,
            strconv.Itoa(c.Writer.Status()),
        ).Observe(duration)
    }
}
```

**Instrumentation points** — add counter increments to:
- `internal/graphql/login.go` — `LoginTotal`, `FailedLoginTotal`
- `internal/graphql/signup.go` — `SignupTotal`
- `internal/token/token.go` — `TokenIssuedTotal`
- `internal/events/events.go` — `WebhookDeliveryTotal`, `WebhookDeliveryDuration`
- Auth handler functions — `MFAVerificationTotal`

#### 2. Metrics Server

**Separate port** — metrics served on `--metrics-port=8081` (already defined in config, just unused).

**Why separate port**: Security best practice — `/metrics` should not be exposed on the public-facing port. In Kubernetes, the metrics port is typically only accessible within the cluster via ServiceMonitor.

```go
// In cmd/root.go, after main server setup:
if cfg.MetricsPort > 0 {
    metricsMux := http.NewServeMux()
    metricsMux.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(fmt.Sprintf(":%d", cfg.MetricsPort), metricsMux)
}
```

#### 3. Enhanced Health Endpoint

**Replace** the current `/health` with a JSON response:

```go
func (h *httpProvider) HealthHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        health := map[string]interface{}{
            "status": "healthy",
            "uptime": time.Since(startTime).String(),
            "version": version,
        }
        
        // Check database
        if err := h.deps.StorageProvider.HealthCheck(ctx); err != nil {
            health["status"] = "degraded"
            health["db"] = "error"
        } else {
            health["db"] = "ok"
        }
        
        // Check Redis (if configured)
        if h.deps.MemoryStore != nil {
            if err := h.deps.MemoryStore.HealthCheck(ctx); err != nil {
                health["status"] = "degraded"
                health["redis"] = "error"
            } else {
                health["redis"] = "ok"
            }
        }
        
        statusCode := http.StatusOK
        if health["status"] == "degraded" {
            statusCode = http.StatusServiceUnavailable
        }
        c.JSON(statusCode, health)
    }
}
```

**Response example:**
```json
{
    "status": "healthy",
    "uptime": "72h15m30s",
    "version": "2.0.0",
    "db": "ok",
    "redis": "ok"
}
```

**New interface methods needed:**
```go
// On storage.Provider:
HealthCheck(ctx context.Context) error

// On memory_store.Provider:
HealthCheck(ctx context.Context) error
```

- SQL providers: `db.Raw("SELECT 1").Error`
- MongoDB: `client.Ping(ctx, nil)`
- Redis: `client.Ping(ctx).Err()`
- Other NoSQL: provider-specific ping

#### 4. Kubernetes Probes

**New endpoints** on the main server port:

```
GET /healthz   → Liveness probe (is the process alive?)
GET /readyz    → Readiness probe (can it serve traffic?)
```

**Liveness** (`/healthz`): Always returns 200 if the process is running. No dependency checks — if the process can respond, it's alive. Kubernetes restarts the pod only if this fails.

**Readiness** (`/readyz`): Returns 200 only if all dependencies (DB, Redis) are healthy. Kubernetes removes the pod from the Service endpoints if this fails — no traffic routed to unhealthy pods.

```go
func (h *httpProvider) LivenessHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{"status": "alive"})
    }
}

func (h *httpProvider) ReadinessHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        // Same logic as enhanced HealthHandler
        // Returns 503 if any dependency is unhealthy
    }
}
```

---

### CLI Configuration Flags

```
--metrics-port=8081                        # Port for /metrics endpoint (0 = disabled)
--enable-health-check-details=true         # Include component status in /health (disable for minimal response)
```

---

### Migration Strategy

1. Add `prometheus/client_golang` to `go.mod`
2. Create `internal/metrics/` package with metric definitions
3. Add `HealthCheck()` method to storage and memory store provider interfaces (all 13+ DB implementations)
4. Add metrics middleware to Gin router
5. Instrument auth handlers with counter increments
6. Add `/healthz`, `/readyz` routes
7. Start metrics server on `--metrics-port`

---

### Grafana Dashboard

Ship a reference Grafana dashboard JSON (`deploy/grafana/authorizer-dashboard.json`) with panels for:
- Login success/failure rate over time
- Signup rate
- Request latency percentiles (p50, p95, p99)
- Active sessions gauge
- Failed login alerts
- Account lockout events
- DB query latency

---

### Testing Plan

- Unit tests for metric increments on auth events
- Integration test: verify `/metrics` endpoint returns Prometheus format
- Integration test: `/health` returns component status
- Integration test: `/readyz` returns 503 when DB is down
- Verify no cardinality explosion (use `c.FullPath()` not `c.Request.URL.Path`)

---

### References

- [Prometheus Go Client](https://github.com/prometheus/client_golang)
- [Kubernetes Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
- [Keycloak Metrics](https://www.keycloak.org/server/configuration-metrics)
- [RED Method](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/) (Rate, Errors, Duration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Prometheus Metrics & Health Endpoints #506