feat(observability): make the stack production-ready by devantler · Pull Request #1604 · devantler-tech/platform

devantler · 2026-05-27T21:49:25Z

Why

The observability stack was deliberately minimal — alerting-only, no Grafana, no remote-write, single ephemeral Prometheus, in-cluster-only alerting (documented in docs/dr/alerting.md). This makes it production-ready while staying self-hosted (no SaaS metrics tier). Direction was agreed interactively; the work reverses several of those documented choices, so docs/dr/alerting.md is rewritten to match.

Done in phases; all phases are in this PR. Happy to split into separate PRs if preferred.

Phase 0 — Capacity

ksail.prod.yaml: static workers: 3 → 4. The new always-on tier (Grafana/Loki/persistent Prometheus+Alertmanager) belongs on guaranteed static capacity, not autoscaler nodes the autoscaler reclaims. The 4th worker auto-joins Longhorn via the uniform worker node-label; longhorn_replica_count stays 3.
Restore alertmanager_replicas 1 → 2 (was a #1585 stabilization trim; prod now has headroom per #1601).

Phase 1 — Resilient alerting

Alerts → Slack via native slack_configs (api_url_file).
Dead-man's-switch: the always-firing Watchdog is routed to a heartbeat receiver that POSTs to an external monitor every ~50s. If the cluster/alerting pipeline dies, the monitor (e.g. healthchecks.io) notifies Slack out-of-band — the one failure in-cluster alerting can't cover. URL is Flux-substituted with an invalid default, so local/CI stay quiet.
Re-enable curated defaultRules (Watchdog + self-monitoring + kubernetesApps/storage/node), disabling groups for unscraped control-plane components and the noisy overcommit alerts. platform-critical.yaml (Velero/CNPG/Flux/cert/autoscaler) is unchanged.

Phase 2 — Durability

hetzner overlay: persistent hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus memory limit → 1.5Gi (VPA manages requests only, so the limit is the real OOM guard).
Off-cluster backup: Velero's daily *-namespace backup already covers monitoring, so the new PVCs ship to R2 at 24h RPO with no Velero change.

Phase 3 — Centralized logs

Loki (single-binary, 7d retention; hcloud PVC in prod, ephemeral local) + Alloy DaemonSet shipper (node-local discovery → no log duplication; tails via the API, no privileged hostPath).

Phase 4 — Visibility & access

Grafana enabled (anonymous Admin behind the SSO gate, Prometheus + Loki datasources, provisioned dashboards, ephemeral).
Expose Grafana / Prometheus / Alertmanager / OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Manual steps required before this fully works in prod

The agent cannot edit *.enc.yaml. After merge, set the per-cluster secrets (see docs/dr/alerting.md):

sops --set '["stringData"]["alertmanager_webhook_url"] "<slack-incoming-webhook>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["alertmanager_heartbeat_url"] "<healthchecks.io-ping-url>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml

Then create the Slack #platform-alerts incoming webhook and the healthchecks.io check (period ~5m, grace ~10m, Slack integration). Until set, alerts/heartbeat degrade gracefully to invalid URLs (no breakage).

Validation

kubectl kustomize k8s/clusters/local/ and …/prod/ build.
ksail workload validate and ksail --config ksail.prod.yaml workload validate → 259 files validated.
Loki 6.55.0 / Alloy 1.8.2 values helm template-verified (service names/ports, RBAC for pods/log).
Full Talos+Docker system test runs in CI.

Risks

Memory: heaviest additions (Grafana/Loki/Alloy) land on the new 4th worker; CI's system test will surface scheduling pressure.
healthchecks.io is the one external dependency (in-cluster Grafana can't cover full-cluster-down). A self-hosted GitHub Actions probe is noted as an alternative in the docs.

🤖 Generated with Claude Code

Harden the per-cluster observability stack across alerting, durability, logs and access, replacing the previous alerting-only / no-Grafana / no-remote-write posture documented in docs/dr/alerting.md. Capacity: - ksail.prod.yaml: 4th static worker for the always-on observability tier. - Restore alertmanager_replicas 1 -> 2 (was a stabilization trim). Alerting: - Route alerts to Slack via native slack_configs (api_url_file). - Add a Watchdog dead-man's-switch: the always-firing alert is pushed to an external heartbeat monitor (Flux-substituted URL, invalid default so local/CI stay quiet) that notifies Slack if the cluster goes down. - Re-enable curated chart defaultRules (Watchdog + self-monitoring + workload health), disabling groups for unscraped control-plane components and the two overcommit alerts; keep platform-critical.yaml. Durability (hetzner overlay): - Persistent hcloud PVCs for Prometheus (20Gi) and Alertmanager (2Gi); raise the Prometheus memory limit to 1.5Gi. Velero's daily all-namespace backup already ships the monitoring namespace to R2 (24h RPO). Logs: - Add Loki (single-binary, 7d retention; hcloud PVC in prod) and Alloy (DaemonSet log shipper, node-local discovery to avoid duplication). Visibility & access: - Enable Grafana (anonymous Admin behind SSO, Prometheus + Loki datasources, provisioned dashboards). - Expose Grafana/Prometheus/Alertmanager/OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies. Docs: - Rewrite docs/dr/alerting.md for the new architecture, incl. on-call and the manual SOPS steps for the Slack webhook + heartbeat URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Converts the previously alerting-only observability stack into a production-ready self-hosted tier: adds Grafana, Loki + Alloy, persistent storage for Prometheus/Alertmanager/Loki on Hetzner, Slack alerts via native slack_configs, an external dead-man's-switch heartbeat (Watchdog → healthchecks.io), oauth2-proxy SSO HTTPRoutes for Grafana/Prometheus/Alertmanager/OpenCost, capacity bump to 4 static workers, and a full rewrite of docs/dr/alerting.md. Reverses several deliberately-minimal choices documented earlier.

Changes:

Phase 0/1 — capacity (workers: 3 → 4, alertmanager_replicas 1 → 2) and resilient alerting (Slack slack_configs, Watchdog heartbeat route, curated defaultRules).
Phase 2/3 — durable storage for Prometheus/Alertmanager/Loki via hetzner overlay PVCs; new Loki single-binary + Alloy DaemonSet log pipeline.
Phase 4 — Grafana enabled (anonymous Admin behind SSO, Loki datasource); HTTPRoutes + auth-proxy router entries + ReferenceGrant + NetworkPolicy expansions for Grafana/Prometheus/Alertmanager/OpenCost.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
ksail.prod.yaml	Bumps static workers 3→4 to host always-on observability tier.
k8s/clusters/prod/variables/variables-cluster-config-map.yaml	Restores `alertmanager_replicas` to `2`.
k8s/bases/infrastructure/external-secrets/external-secrets.yaml	Comment clarifying split between webhook URL (ESO) and heartbeat URL (Flux substitution).
k8s/bases/infrastructure/controllers/kustomization.yaml	Registers new `alloy/` and `loki/` bases.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml	Enables Grafana, switches receivers to Slack + heartbeat, enables curated `defaultRules`.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/httproute.yaml	New HTTPRoutes for Grafana/Prometheus/Alertmanager via oauth2-proxy.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/kustomization.yaml	Includes new `httproute.yaml`.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/networkpolicy.yaml	Allows oauth2-proxy ingress on 3000/9090/9093.
k8s/bases/infrastructure/controllers/loki/{helm-release,helm-repository,kustomization}.yaml	New Loki single-binary release (7d retention, ServiceMonitor on).
k8s/bases/infrastructure/controllers/alloy/{helm-release,kustomization}.yaml	New Alloy DaemonSet with node-local pod-log discovery, push to Loki.
k8s/bases/infrastructure/controllers/opencost/{httproute,kustomization,networkpolicy}.yaml	Exposes OpenCost UI via SSO and allows oauth2-proxy ingress.
k8s/bases/infrastructure/controllers/oauth2-proxy/reference-grant.yaml	Extends grant to `monitoring` and `opencost` HTTPRoutes.
k8s/bases/infrastructure/controllers/auth-proxy/config-map.yaml	Adds Traefik routers/services for grafana/prometheus/alertmanager/opencost.
k8s/bases/infrastructure/controllers/auth-proxy/networkpolicy.yaml	Adds egress from auth-proxy to monitoring/opencost upstreams.
k8s/providers/hetzner/infrastructure/controllers/kustomization.yaml	Wires in new kube-prometheus-stack and loki patches.
k8s/providers/hetzner/infrastructure/controllers/kube-prometheus-stack/patches/helm-release-patch.yaml	hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus mem limit 1.5Gi.
k8s/providers/hetzner/infrastructure/controllers/loki/patches/helm-release-patch.yaml	hcloud 10Gi PVC for Loki.
docs/dr/alerting.md	Rewrites docs to match the new production-ready posture.

Copilot AI review requested due to automatic review settings May 27, 2026 21:49

github-project-automation Bot added this to 🌊 Project Board May 27, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 27, 2026

Copilot started reviewing on behalf of devantler May 27, 2026 21:49 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): make the stack production-ready#1604

feat(observability): make the stack production-ready#1604
devantler wants to merge 1 commit into
mainfrom
claude/musing-kowalevski-8cb4ed

devantler commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 27, 2026

Why

Phase 0 — Capacity

Phase 1 — Resilient alerting

Phase 2 — Durability

Phase 3 — Centralized logs

Phase 4 — Visibility & access

Manual steps required before this fully works in prod

Validation

Risks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants