Skip to content

RFC: Prometheus Metrics & Health Endpoints #506

@lakhansamani

Description

@lakhansamani

RFC: Prometheus Metrics & Health Endpoints

Phase: 1 — Security Hardening & Enterprise Foundation
Priority: P0 — Critical
Estimated Effort: Low


Problem Statement

Authorizer's health endpoint (/health) returns a plain "OK" string with no component status. There are no observability metrics — no Prometheus endpoint, no request latency tracking, no auth-specific counters. Keycloak has full Prometheus/Grafana support. This is essential for production deployments and Kubernetes environments.

Current health handler (internal/http_handlers/health.go):

func (h *httpProvider) HealthHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        c.String(http.StatusOK, "OK")
    }
}

Current Architecture Context

  • HTTP framework: Gin on port 8080 (configurable via --http-port)
  • Config already has --metrics-port=8081 flag defined but unused
  • No Prometheus/OpenMetrics library in go.mod
  • Memory store has Redis and DB-backed implementations
  • Storage provider has no health-check methods
  • Routes defined in internal/server/http_routes.go

Proposed Solution

1. Prometheus Metrics

Library: prometheus/client_golang — the de-facto standard Go Prometheus client.

New package: internal/metrics/

var (
    // Auth counters
    LoginTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_login_total",
        Help: "Total login attempts by method and status",
    }, []string{"method", "status"})  // method=password|otp|magic_link|social, status=success|failure

    SignupTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_signup_total",
        Help: "Total signup attempts by method and status",
    }, []string{"method", "status"})

    TokenIssuedTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_token_issued_total",
        Help: "Total tokens issued by type",
    }, []string{"type"})  // type=access_token|refresh_token|id_token

    ActiveSessions = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "authorizer_active_sessions",
        Help: "Current number of active sessions",
    })

    FailedLoginTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "authorizer_failed_login_total",
        Help: "Total failed login attempts (for alerting)",
    })

    AccountLockoutsTotal = promauto.NewCounter(prometheus.CounterOpts{
        Name: "authorizer_account_lockouts_total",
        Help: "Total account lockout events",
    })

    // Request metrics
    RequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_request_duration_seconds",
        Help:    "HTTP request latency by endpoint and method",
        Buckets: prometheus.DefBuckets,
    }, []string{"endpoint", "method", "status_code"})

    DBQueryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_db_query_duration_seconds",
        Help:    "Database query latency by operation",
        Buckets: prometheus.DefBuckets,
    }, []string{"operation"})  // operation=add_user|get_user_by_email|list_users|...

    // MFA metrics
    MFAVerificationTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_mfa_verification_total",
        Help: "MFA verification attempts by type and status",
    }, []string{"type", "status"})  // type=totp, status=success|failure

    // Webhook metrics
    WebhookDeliveryTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "authorizer_webhook_delivery_total",
        Help: "Webhook delivery attempts by event and status",
    }, []string{"event", "status"})  // status=success|failure

    WebhookDeliveryDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "authorizer_webhook_delivery_duration_seconds",
        Help:    "Webhook delivery latency",
        Buckets: prometheus.DefBuckets,
    }, []string{"event"})
)

Metrics middleware for Gin (internal/http_handlers/metrics_middleware.go):

func MetricsMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        c.Next()
        duration := time.Since(start).Seconds()
        
        metrics.RequestDuration.WithLabelValues(
            c.FullPath(),      // endpoint pattern, not actual path (avoids cardinality explosion)
            c.Request.Method,
            strconv.Itoa(c.Writer.Status()),
        ).Observe(duration)
    }
}

Instrumentation points — add counter increments to:

  • internal/graphql/login.goLoginTotal, FailedLoginTotal
  • internal/graphql/signup.goSignupTotal
  • internal/token/token.goTokenIssuedTotal
  • internal/events/events.goWebhookDeliveryTotal, WebhookDeliveryDuration
  • Auth handler functions — MFAVerificationTotal

2. Metrics Server

Separate port — metrics served on --metrics-port=8081 (already defined in config, just unused).

Why separate port: Security best practice — /metrics should not be exposed on the public-facing port. In Kubernetes, the metrics port is typically only accessible within the cluster via ServiceMonitor.

// In cmd/root.go, after main server setup:
if cfg.MetricsPort > 0 {
    metricsMux := http.NewServeMux()
    metricsMux.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(fmt.Sprintf(":%d", cfg.MetricsPort), metricsMux)
}

3. Enhanced Health Endpoint

Replace the current /health with a JSON response:

func (h *httpProvider) HealthHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        health := map[string]interface{}{
            "status": "healthy",
            "uptime": time.Since(startTime).String(),
            "version": version,
        }
        
        // Check database
        if err := h.deps.StorageProvider.HealthCheck(ctx); err != nil {
            health["status"] = "degraded"
            health["db"] = "error"
        } else {
            health["db"] = "ok"
        }
        
        // Check Redis (if configured)
        if h.deps.MemoryStore != nil {
            if err := h.deps.MemoryStore.HealthCheck(ctx); err != nil {
                health["status"] = "degraded"
                health["redis"] = "error"
            } else {
                health["redis"] = "ok"
            }
        }
        
        statusCode := http.StatusOK
        if health["status"] == "degraded" {
            statusCode = http.StatusServiceUnavailable
        }
        c.JSON(statusCode, health)
    }
}

Response example:

{
    "status": "healthy",
    "uptime": "72h15m30s",
    "version": "2.0.0",
    "db": "ok",
    "redis": "ok"
}

New interface methods needed:

// On storage.Provider:
HealthCheck(ctx context.Context) error

// On memory_store.Provider:
HealthCheck(ctx context.Context) error
  • SQL providers: db.Raw("SELECT 1").Error
  • MongoDB: client.Ping(ctx, nil)
  • Redis: client.Ping(ctx).Err()
  • Other NoSQL: provider-specific ping

4. Kubernetes Probes

New endpoints on the main server port:

GET /healthz   → Liveness probe (is the process alive?)
GET /readyz    → Readiness probe (can it serve traffic?)

Liveness (/healthz): Always returns 200 if the process is running. No dependency checks — if the process can respond, it's alive. Kubernetes restarts the pod only if this fails.

Readiness (/readyz): Returns 200 only if all dependencies (DB, Redis) are healthy. Kubernetes removes the pod from the Service endpoints if this fails — no traffic routed to unhealthy pods.

func (h *httpProvider) LivenessHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{"status": "alive"})
    }
}

func (h *httpProvider) ReadinessHandler() gin.HandlerFunc {
    return func(c *gin.Context) {
        // Same logic as enhanced HealthHandler
        // Returns 503 if any dependency is unhealthy
    }
}

CLI Configuration Flags

--metrics-port=8081                        # Port for /metrics endpoint (0 = disabled)
--enable-health-check-details=true         # Include component status in /health (disable for minimal response)

Migration Strategy

  1. Add prometheus/client_golang to go.mod
  2. Create internal/metrics/ package with metric definitions
  3. Add HealthCheck() method to storage and memory store provider interfaces (all 13+ DB implementations)
  4. Add metrics middleware to Gin router
  5. Instrument auth handlers with counter increments
  6. Add /healthz, /readyz routes
  7. Start metrics server on --metrics-port

Grafana Dashboard

Ship a reference Grafana dashboard JSON (deploy/grafana/authorizer-dashboard.json) with panels for:

  • Login success/failure rate over time
  • Signup rate
  • Request latency percentiles (p50, p95, p99)
  • Active sessions gauge
  • Failed login alerts
  • Account lockout events
  • DB query latency

Testing Plan

  • Unit tests for metric increments on auth events
  • Integration test: verify /metrics endpoint returns Prometheus format
  • Integration test: /health returns component status
  • Integration test: /readyz returns 503 when DB is down
  • Verify no cardinality explosion (use c.FullPath() not c.Request.URL.Path)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions