Skip to content

feat(observability): make the stack production-ready#1604

Draft
devantler wants to merge 1 commit into
mainfrom
claude/musing-kowalevski-8cb4ed
Draft

feat(observability): make the stack production-ready#1604
devantler wants to merge 1 commit into
mainfrom
claude/musing-kowalevski-8cb4ed

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Why

The observability stack was deliberately minimal — alerting-only, no Grafana, no remote-write, single ephemeral Prometheus, in-cluster-only alerting (documented in docs/dr/alerting.md). This makes it production-ready while staying self-hosted (no SaaS metrics tier). Direction was agreed interactively; the work reverses several of those documented choices, so docs/dr/alerting.md is rewritten to match.

Done in phases; all phases are in this PR. Happy to split into separate PRs if preferred.

Phase 0 — Capacity

  • ksail.prod.yaml: static workers: 3 → 4. The new always-on tier (Grafana/Loki/persistent Prometheus+Alertmanager) belongs on guaranteed static capacity, not autoscaler nodes the autoscaler reclaims. The 4th worker auto-joins Longhorn via the uniform worker node-label; longhorn_replica_count stays 3.
  • Restore alertmanager_replicas 1 → 2 (was a #1585 stabilization trim; prod now has headroom per #1601).

Phase 1 — Resilient alerting

  • Alerts → Slack via native slack_configs (api_url_file).
  • Dead-man's-switch: the always-firing Watchdog is routed to a heartbeat receiver that POSTs to an external monitor every ~50s. If the cluster/alerting pipeline dies, the monitor (e.g. healthchecks.io) notifies Slack out-of-band — the one failure in-cluster alerting can't cover. URL is Flux-substituted with an invalid default, so local/CI stay quiet.
  • Re-enable curated defaultRules (Watchdog + self-monitoring + kubernetesApps/storage/node), disabling groups for unscraped control-plane components and the noisy overcommit alerts. platform-critical.yaml (Velero/CNPG/Flux/cert/autoscaler) is unchanged.

Phase 2 — Durability

  • hetzner overlay: persistent hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus memory limit → 1.5Gi (VPA manages requests only, so the limit is the real OOM guard).
  • Off-cluster backup: Velero's daily *-namespace backup already covers monitoring, so the new PVCs ship to R2 at 24h RPO with no Velero change.

Phase 3 — Centralized logs

  • Loki (single-binary, 7d retention; hcloud PVC in prod, ephemeral local) + Alloy DaemonSet shipper (node-local discovery → no log duplication; tails via the API, no privileged hostPath).

Phase 4 — Visibility & access

  • Grafana enabled (anonymous Admin behind the SSO gate, Prometheus + Loki datasources, provisioned dashboards, ephemeral).
  • Expose Grafana / Prometheus / Alertmanager / OpenCost behind oauth2-proxy via HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Manual steps required before this fully works in prod

The agent cannot edit *.enc.yaml. After merge, set the per-cluster secrets (see docs/dr/alerting.md):

sops --set '["stringData"]["alertmanager_webhook_url"] "<slack-incoming-webhook>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
sops --set '["stringData"]["alertmanager_heartbeat_url"] "<healthchecks.io-ping-url>"' \
  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml

Then create the Slack #platform-alerts incoming webhook and the healthchecks.io check (period ~5m, grace ~10m, Slack integration). Until set, alerts/heartbeat degrade gracefully to invalid URLs (no breakage).

Validation

  • kubectl kustomize k8s/clusters/local/ and …/prod/ build.
  • ksail workload validate and ksail --config ksail.prod.yaml workload validate259 files validated.
  • Loki 6.55.0 / Alloy 1.8.2 values helm template-verified (service names/ports, RBAC for pods/log).
  • Full Talos+Docker system test runs in CI.

Risks

  • Memory: heaviest additions (Grafana/Loki/Alloy) land on the new 4th worker; CI's system test will surface scheduling pressure.
  • healthchecks.io is the one external dependency (in-cluster Grafana can't cover full-cluster-down). A self-hosted GitHub Actions probe is noted as an alternative in the docs.

🤖 Generated with Claude Code

Harden the per-cluster observability stack across alerting, durability,
logs and access, replacing the previous alerting-only / no-Grafana /
no-remote-write posture documented in docs/dr/alerting.md.

Capacity:
- ksail.prod.yaml: 4th static worker for the always-on observability tier.
- Restore alertmanager_replicas 1 -> 2 (was a stabilization trim).

Alerting:
- Route alerts to Slack via native slack_configs (api_url_file).
- Add a Watchdog dead-man's-switch: the always-firing alert is pushed to
  an external heartbeat monitor (Flux-substituted URL, invalid default so
  local/CI stay quiet) that notifies Slack if the cluster goes down.
- Re-enable curated chart defaultRules (Watchdog + self-monitoring +
  workload health), disabling groups for unscraped control-plane
  components and the two overcommit alerts; keep platform-critical.yaml.

Durability (hetzner overlay):
- Persistent hcloud PVCs for Prometheus (20Gi) and Alertmanager (2Gi);
  raise the Prometheus memory limit to 1.5Gi. Velero's daily all-namespace
  backup already ships the monitoring namespace to R2 (24h RPO).

Logs:
- Add Loki (single-binary, 7d retention; hcloud PVC in prod) and Alloy
  (DaemonSet log shipper, node-local discovery to avoid duplication).

Visibility & access:
- Enable Grafana (anonymous Admin behind SSO, Prometheus + Loki
  datasources, provisioned dashboards).
- Expose Grafana/Prometheus/Alertmanager/OpenCost behind oauth2-proxy via
  HTTPRoutes + auth-proxy routers; extend the ReferenceGrant and the
  monitoring/opencost/auth-proxy CiliumNetworkPolicies.

Docs:
- Rewrite docs/dr/alerting.md for the new architecture, incl. on-call and
  the manual SOPS steps for the Slack webhook + heartbeat URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 21:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Converts the previously alerting-only observability stack into a production-ready self-hosted tier: adds Grafana, Loki + Alloy, persistent storage for Prometheus/Alertmanager/Loki on Hetzner, Slack alerts via native slack_configs, an external dead-man's-switch heartbeat (Watchdog → healthchecks.io), oauth2-proxy SSO HTTPRoutes for Grafana/Prometheus/Alertmanager/OpenCost, capacity bump to 4 static workers, and a full rewrite of docs/dr/alerting.md. Reverses several deliberately-minimal choices documented earlier.

Changes:

  • Phase 0/1 — capacity (workers: 3 → 4, alertmanager_replicas 1 → 2) and resilient alerting (Slack slack_configs, Watchdog heartbeat route, curated defaultRules).
  • Phase 2/3 — durable storage for Prometheus/Alertmanager/Loki via hetzner overlay PVCs; new Loki single-binary + Alloy DaemonSet log pipeline.
  • Phase 4 — Grafana enabled (anonymous Admin behind SSO, Loki datasource); HTTPRoutes + auth-proxy router entries + ReferenceGrant + NetworkPolicy expansions for Grafana/Prometheus/Alertmanager/OpenCost.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ksail.prod.yaml Bumps static workers 3→4 to host always-on observability tier.
k8s/clusters/prod/variables/variables-cluster-config-map.yaml Restores alertmanager_replicas to 2.
k8s/bases/infrastructure/external-secrets/external-secrets.yaml Comment clarifying split between webhook URL (ESO) and heartbeat URL (Flux substitution).
k8s/bases/infrastructure/controllers/kustomization.yaml Registers new alloy/ and loki/ bases.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/helm-release.yaml Enables Grafana, switches receivers to Slack + heartbeat, enables curated defaultRules.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/httproute.yaml New HTTPRoutes for Grafana/Prometheus/Alertmanager via oauth2-proxy.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/kustomization.yaml Includes new httproute.yaml.
k8s/bases/infrastructure/controllers/kube-prometheus-stack/networkpolicy.yaml Allows oauth2-proxy ingress on 3000/9090/9093.
k8s/bases/infrastructure/controllers/loki/{helm-release,helm-repository,kustomization}.yaml New Loki single-binary release (7d retention, ServiceMonitor on).
k8s/bases/infrastructure/controllers/alloy/{helm-release,kustomization}.yaml New Alloy DaemonSet with node-local pod-log discovery, push to Loki.
k8s/bases/infrastructure/controllers/opencost/{httproute,kustomization,networkpolicy}.yaml Exposes OpenCost UI via SSO and allows oauth2-proxy ingress.
k8s/bases/infrastructure/controllers/oauth2-proxy/reference-grant.yaml Extends grant to monitoring and opencost HTTPRoutes.
k8s/bases/infrastructure/controllers/auth-proxy/config-map.yaml Adds Traefik routers/services for grafana/prometheus/alertmanager/opencost.
k8s/bases/infrastructure/controllers/auth-proxy/networkpolicy.yaml Adds egress from auth-proxy to monitoring/opencost upstreams.
k8s/providers/hetzner/infrastructure/controllers/kustomization.yaml Wires in new kube-prometheus-stack and loki patches.
k8s/providers/hetzner/infrastructure/controllers/kube-prometheus-stack/patches/helm-release-patch.yaml hcloud PVCs for Prometheus (20Gi) + Alertmanager (2Gi); Prometheus mem limit 1.5Gi.
k8s/providers/hetzner/infrastructure/controllers/loki/patches/helm-release-patch.yaml hcloud 10Gi PVC for Loki.
docs/dr/alerting.md Rewrites docs to match the new production-ready posture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

2 participants