feat(observability): make the stack production-ready#1604
Draft
devantler wants to merge 1 commit into
Draft
Conversation
Harden the per-cluster observability stack across alerting, durability, logs and access, replacing the previous alerting-only / no-Grafana / no-remote-write posture documented in docs/dr/alerting.md. Capacity: - ksail.prod.yaml: 4th static worker for the always-on observability tier. - Restore alertmanager_replicas 1 -> 2 (was a stabilization trim). Alerting: - Route alerts to Slack via native slack_configs (api_url_file). - Add a Watchdog dead-man's-switch: the always-firing alert is pushed to an external heartbeat monitor (Flux-substituted URL, invalid default so local/CI stay quiet) that notifies Slack if the cluster goes down. - Re-enable curated chart defaultRules (Watchdog + self-monitoring + workload health), disabling groups for unscraped control-plane components and the two overcommit alerts; keep platform-critical.yaml. Durability (hetzner overlay): - Persistent hcloud PVCs for Prometheus (20Gi) and Alertmanager (2Gi); raise the Prometheus memory limit to 1.5Gi. Velero's daily all-namespace backup already ships the monitoring namespace to R2 (24h RPO). Logs: - Add Loki (single-binary, 7d retention; hcloud PVC in prod) and Alloy (DaemonSet log shipper, node-local discovery to avoid duplication). Visibility & access: - Enable Grafana (anonymous Admin behind SSO, Prometheus + Loki datasources, provisioned dashboards). - Expose Grafana/Prometheus/Alertmanager/OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies. Docs: - Rewrite docs/dr/alerting.md for the new architecture, incl. on-call and the manual SOPS steps for the Slack webhook + heartbeat URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Converts the previously alerting-only observability stack into a production-ready self-hosted tier: adds Grafana, Loki + Alloy, persistent storage for Prometheus/Alertmanager/Loki on Hetzner, Slack alerts via native slack_configs, an external dead-man's-switch heartbeat (Watchdog → healthchecks.io), oauth2-proxy SSO HTTPRoutes for Grafana/Prometheus/Alertmanager/OpenCost, capacity bump to 4 static workers, and a full rewrite of docs/dr/alerting.md. Reverses several deliberately-minimal choices documented earlier.
Changes:
- Phase 0/1 — capacity (
workers: 3 → 4,alertmanager_replicas 1 → 2) and resilient alerting (Slackslack_configs, Watchdog heartbeat route, curateddefaultRules). - Phase 2/3 — durable storage for Prometheus/Alertmanager/Loki via hetzner overlay PVCs; new Loki single-binary + Alloy DaemonSet log pipeline.
- Phase 4 — Grafana enabled (anonymous Admin behind SSO, Loki datasource); HTTPRoutes + auth-proxy router entries + ReferenceGrant + NetworkPolicy expansions for Grafana/Prometheus/Alertmanager/OpenCost.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ksail.prod.yaml | Bumps static workers 3→4 to host always-on observability tier. |
| k8s/clusters/prod/variables/variables-cluster-config-map.yaml | Restores alertmanager_replicas to 2. |
| k8s/bases/infrastructure/external-secrets/external-secrets.yaml | Comment clarifying split between webhook URL (ESO) and heartbeat URL (Flux substitution). |
| k8s/bases/infrastructure/controllers/kustomization.yaml | Registers new alloy/ and loki/ bases. |
| k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml | Enables Grafana, switches receivers to Slack + heartbeat, enables curated defaultRules. |
| k8s/bases/infrastructure/controllers/kube-prometheus-stack/httproute.yaml | New HTTPRoutes for Grafana/Prometheus/Alertmanager via oauth2-proxy. |
| k8s/bases/infrastructure/controllers/kube-prometheus-stack/kustomization.yaml | Includes new httproute.yaml. |
| k8s/bases/infrastructure/controllers/kube-prometheus-stack/networkpolicy.yaml | Allows oauth2-proxy ingress on 3000/9090/9093. |
| k8s/bases/infrastructure/controllers/loki/{helm-release,helm-repository,kustomization}.yaml | New Loki single-binary release (7d retention, ServiceMonitor on). |
| k8s/bases/infrastructure/controllers/alloy/{helm-release,kustomization}.yaml | New Alloy DaemonSet with node-local pod-log discovery, push to Loki. |
| k8s/bases/infrastructure/controllers/opencost/{httproute,kustomization,networkpolicy}.yaml | Exposes OpenCost UI via SSO and allows oauth2-proxy ingress. |
| k8s/bases/infrastructure/controllers/oauth2-proxy/reference-grant.yaml | Extends grant to monitoring and opencost HTTPRoutes. |
| k8s/bases/infrastructure/controllers/auth-proxy/config-map.yaml | Adds Traefik routers/services for grafana/prometheus/alertmanager/opencost. |
| k8s/bases/infrastructure/controllers/auth-proxy/networkpolicy.yaml | Adds egress from auth-proxy to monitoring/opencost upstreams. |
| k8s/providers/hetzner/infrastructure/controllers/kustomization.yaml | Wires in new kube-prometheus-stack and loki patches. |
| k8s/providers/hetzner/infrastructure/controllers/kube-prometheus-stack/patches/helm-release-patch.yaml | hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus mem limit 1.5Gi. |
| k8s/providers/hetzner/infrastructure/controllers/loki/patches/helm-release-patch.yaml | hcloud 10Gi PVC for Loki. |
| docs/dr/alerting.md | Rewrites docs to match the new production-ready posture. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The observability stack was deliberately minimal — alerting-only, no Grafana, no remote-write, single ephemeral Prometheus, in-cluster-only alerting (documented in
docs/dr/alerting.md). This makes it production-ready while staying self-hosted (no SaaS metrics tier). Direction was agreed interactively; the work reverses several of those documented choices, sodocs/dr/alerting.mdis rewritten to match.Done in phases; all phases are in this PR. Happy to split into separate PRs if preferred.
Phase 0 — Capacity
ksail.prod.yaml: staticworkers: 3 → 4. The new always-on tier (Grafana/Loki/persistent Prometheus+Alertmanager) belongs on guaranteed static capacity, not autoscaler nodes the autoscaler reclaims. The 4th worker auto-joins Longhorn via the uniform worker node-label;longhorn_replica_countstays 3.alertmanager_replicas1 → 2(was a #1585 stabilization trim; prod now has headroom per #1601).Phase 1 — Resilient alerting
slack_configs(api_url_file).Watchdogis routed to aheartbeatreceiver that POSTs to an external monitor every ~50s. If the cluster/alerting pipeline dies, the monitor (e.g. healthchecks.io) notifies Slack out-of-band — the one failure in-cluster alerting can't cover. URL is Flux-substituted with an invalid default, so local/CI stay quiet.defaultRules(Watchdog + self-monitoring +kubernetesApps/storage/node), disabling groups for unscraped control-plane components and the noisy overcommit alerts.platform-critical.yaml(Velero/CNPG/Flux/cert/autoscaler) is unchanged.Phase 2 — Durability
hcloudPVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus memory limit → 1.5Gi (VPA manages requests only, so the limit is the real OOM guard).*-namespace backup already coversmonitoring, so the new PVCs ship to R2 at 24h RPO with no Velero change.Phase 3 — Centralized logs
hcloudPVC in prod, ephemeral local) + Alloy DaemonSet shipper (node-local discovery → no log duplication; tails via the API, no privileged hostPath).Phase 4 — Visibility & access
Manual steps required before this fully works in prod
The agent cannot edit
*.enc.yaml. After merge, set the per-cluster secrets (seedocs/dr/alerting.md):Then create the Slack
#platform-alertsincoming webhook and the healthchecks.io check (period ~5m, grace ~10m, Slack integration). Until set, alerts/heartbeat degrade gracefully to invalid URLs (no breakage).Validation
kubectl kustomize k8s/clusters/local/and…/prod/build.ksail workload validateandksail --config ksail.prod.yaml workload validate→ 259 files validated.helm template-verified (service names/ports, RBAC forpods/log).Risks
🤖 Generated with Claude Code