devantler-tech · devantler · May 27, 2026
@@ -1,117 +1,175 @@
-# In-cluster alerting
+# Observability
 
 `kube-prometheus-stack` (Prometheus + Alertmanager + node-exporter +
-kube-state-metrics) running per cluster, no Grafana, no remote-write, no
-SaaS. Alerts ship via webhook to a free destination (Discord channel /
-email-to-webhook bridge) — the URL is per-cluster and SOPS-encrypted.
+kube-state-metrics + Grafana) plus **Loki** (logs), **Alloy** (log
+shipper) and **OpenCost** (cost), running per cluster. Self-hosted, no
+SaaS metrics tier. The stack is production-hardened along four axes:
+
+1. **Alerts go to Slack** via Alertmanager's native `slack_configs`.
+2. **A dead-man's-switch** (the always-firing `Watchdog` alert) is pushed
+   to an external heartbeat monitor — the one failure mode in-cluster
+   alerting can never cover (the whole cluster being down).
+3. **State is persistent** — Prometheus, Alertmanager and Loki use Hetzner
+   Cloud block volumes in prod, and Velero ships the `monitoring`
+   namespace to R2 daily (24 h RPO).
+4. **Everything is reachable** behind oauth2-proxy SSO: Grafana,
+   Prometheus, Alertmanager and OpenCost.
+
+## Components
+
+| Component          | Role                                                | Persistence (prod)        |
+| ------------------ | --------------------------------------------------- | ------------------------- |
+| Prometheus         | Metrics + alert evaluation, 14 d / 5 GiB retention  | `hcloud` PVC, 20 Gi       |
+| Alertmanager       | Routing, grouping, silences (2 replicas, gossiped)  | `hcloud` PVC, 2 Gi        |
+| node-exporter      | Node metrics (DaemonSet)                            | n/a                       |
+| kube-state-metrics | Kubernetes object metrics                           | n/a                       |
+| Grafana            | Dashboards + log exploration (Prometheus + Loki DS) | ephemeral (provisioned)   |
+| Loki               | Log store, single-binary, 7 d retention             | `hcloud` PVC, 10 Gi       |
+| Alloy              | Per-node log shipper → Loki (DaemonSet)             | n/a                       |
+| OpenCost           | Cost allocation against Prometheus                  | n/a                       |
+
+Local/CI runs the same stack with persistence **off** (emptyDir) — losing
+metrics/logs on a restart is fine there. The `hcloud` PVCs are added only
+in the hetzner overlay (`k8s/providers/hetzner/.../*/patches/`), the same
+way OpenBao gets block storage.
+
+## Alert routing → Slack
+
+Alertmanager sends to a Slack incoming webhook (`slack_configs` with
+`api_url_file`). The URL is per-cluster and SOPS-encrypted, so local
+clusters get an invalid URL and stay quiet by design.
+
+- `critical` → Slack immediately, repeat every 12 h.
+- `warning`  → Slack, repeat every 24 h.
+- `critical` inhibits matching `warning` (same alertname/cluster/namespace).
+
+## Dead-man's-switch (off-cluster heartbeat)
+
+In-cluster Alertmanager cannot tell you the cluster is down — it's down
+too. To cover that, the chart's always-firing `Watchdog` alert is routed
+to a dedicated `heartbeat` receiver that POSTs to an **external** monitor
+on a tight cadence (`repeat_interval: 50s`). If the cluster — or the
+Prometheus → Alertmanager pipeline — dies, the monitor stops receiving
+pings and notifies Slack out-of-band.
+
+Recommended monitor: [healthchecks.io](https://healthchecks.io) (free,
+open-source, native Slack integration). Create a check with a ~5 min
+period and ~10 min grace, connect it to Slack, and put its ping URL in
+`alertmanager_heartbeat_url` (below). A self-hosted alternative is a
+scheduled GitHub Actions workflow that probes the public Gateway and posts
+to Slack — fully under your control, no third-party monitor.
+
+The heartbeat URL is injected by Flux substitution
+(`${alertmanager_heartbeat_url}`); unset, it defaults to an invalid URL,
+so local/CI simply never heartbeat — harmless.
+
+## Off-cluster metric/log backup
+
+There is no remote-write or SaaS mirror. Instead, the persistent
+Prometheus, Alertmanager and Loki volumes live in the `monitoring`
+namespace, which Velero's `daily-full` schedule backs up to R2 every day
+(`includedNamespaces: ["*"]`, Kopia fs-backup). Restore is the standard
+Velero flow in [runbook.md](./runbook.md). Backups are filesystem-level
+and crash-consistent (Prometheus/Loki recover via their WAL on restore);
+fine for a 24 h RPO.
+
+## Grafana
+
+Self-hosted, exposed at `grafana.${domain}` behind oauth2-proxy SSO. Since
+the route is already gated to a single GitHub user, Grafana runs with
+anonymous **Admin** and the login form disabled — whoever clears the SSO
+gate is the operator. Datasources: Prometheus (auto-wired by the chart)
+and Loki. Default Kubernetes dashboards are provisioned; the pod stays
+ephemeral because dashboards are config, not state.
 
-## Why no Grafana
-
-This is **alerting only**. Operators look at logs and `kubectl` for
-debugging; we don't run a dashboard tier on the homelab to keep the
-resource budget small (Grafana adds ~512 MiB and another HelmRelease to
-keep current).
-
-## Why no remote-write
+## What gets alerted
 
-Same reason — no external dashboard tier. Critical alerts route directly
-out of Alertmanager.
+Two sources:
 
-## What gets alerted
+1. **Curated chart default rules** (`defaultRules.create: true`). We keep
+   the well-tested groups — `general` (incl. `Watchdog`), `alertmanager`,
+   `prometheus`, `prometheusOperator`, `kubernetesApps`
+   (CrashLooping/ReplicasMismatch/…), `kubernetesStorage`, `node`,
+   `kubeStateMetrics` — and disable the groups for control-plane
+   components we don't scrape (etcd, kube-apiserver/-scheduler/-controller-
+   manager, kube-proxy, windows). `KubeCPUOvercommit` / `KubeMemoryOvercommit`
+   are disabled — guaranteed noise on a cluster that runs hot on purpose.
 
-See `k8s/bases/infrastructure/alerts/platform-critical.yaml`.
+2. **Platform-specific rules** in
+   `k8s/bases/infrastructure/alerts/platform-critical.yaml` (not in the
+   chart): Velero/CNPG backups, Flux reconciliation, cert-manager expiry,
+   cluster-autoscaler and resource-pressure.
 
 | Alert                       | Severity | Why                                       |
 | --------------------------- | -------- | ----------------------------------------- |
-| `NodeNotReady`              | critical | Single node loss; PDBs cover but you should still know |
-| `NodeDiskFillingUp`         | warning  | >90% root fs                              |
+| `Watchdog`                  | none     | Always firing → external heartbeat        |
+| `NodeNotReady`              | critical | Single node loss                          |
 | `PersistentVolumeFillingUp` | critical | >90% PVC                                  |
 | `CertificateExpiringSoon`   | warning  | <14 d to expiry, cert-manager not renewing |
 | `FluxKustomizationNotReady` | critical | Reconciliation broken >15 min             |
-| `VeleroBackupFailed`        | critical | Any failure in last hour                  |
-| `VeleroNoRecentBackup`      | critical | RPO breach -- no successful backup in 30h |
-| `CNPGNoRecentBackup`        | critical | Same, for Postgres                        |
+| `VeleroNoRecentBackup`      | critical | RPO breach — no successful backup in 30h  |
 | `CNPGClusterDegraded`       | critical | Primary alone, no streaming replica       |
 
-`defaultRules.create: false` is set on the chart so we don't drown in the
-~200 generic chart-bundled alerts that aren't useful at homelab scale.
-
-## Caveat: in-cluster Alertmanager won't fire if the whole cluster is down
-
-This is the deliberate tradeoff for "no SaaS". Mitigations:
-
-1. **Daily Velero schedule runs independently.** On next recovery,
-   you'll see the missed backup in R2.
-2. **CI restore drill** validates that `PrometheusRule` manifests are
-   accepted and the monitoring stack reconciles on every PR — so a
-   regression in the alert spec is caught before merge
-   (see [restore-drill.md](./restore-drill.md)).
-3. If true off-cluster alerting becomes necessary later, the documented
-   follow-up is to add Grafana Cloud free tier (10k metrics, ample for
-   these alerts) and configure a remote-write target in
-   `prometheus.prometheusSpec.remoteWrite`. No code restructure required.
+(plus the chart's workload/storage/self-monitoring alerts.)
 
-## Per-environment webhook URL
+## Per-environment setup (manual SOPS steps)
 
-Stored in `variables-cluster-secret.enc.yaml` as `alertmanager_webhook_url`,
-substituted into the `alertmanager-webhook` Secret at apply time.
-
-| Env   | Where to set                                  | Suggestion             |
-| ----- | --------------------------------------------- | ---------------------- |
-| local | `k8s/clusters/local/variables/variables-cluster-secret.enc.yaml` (already filled with a non-resolvable invalid URL — alerts fail to send, on purpose) | n/a |
-| prod  | same path under `clusters/prod/`              | Discord #prod-alerts   |
-
-To set:
+The Slack webhook and heartbeat URL are secrets, so they live in the
+per-cluster `variables-cluster-secret.enc.yaml` and must be set by hand
+(the agent cannot edit `*.enc.yaml`). Both are read from the `Secret`
+`variables-cluster`, which is a Flux `substituteFrom` source.
 
 ```bash
-sops --set '["stringData"]["alertmanager_webhook_url"] "<url>"' \
-  k8s/clusters/<env>/variables/variables-cluster-secret.enc.yaml
-```
+# 1. Slack incoming webhook for alert notifications.
+sops --set '["stringData"]["alertmanager_webhook_url"] "https://hooks.slack.com/services/XXX/YYY/ZZZ"' \
+  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
 
-### Discord webhook recipe
-
-1. Server settings → Integrations → Webhooks → New Webhook → copy URL.
-2. Append `/slack` to the URL — Discord accepts Slack-formatted payloads
-   natively, and Alertmanager's `slack_configs` is a closer match. Or use
-   a tiny shim (e.g. `alertmanager-discord`) — tracked as a possible
-   follow-up but not required.
-3. Drop the URL into the SOPS secret per the command above.
-
-### Email-to-webhook bridge
+# 2. External heartbeat-monitor ping URL (e.g. healthchecks.io).
+sops --set '["stringData"]["alertmanager_heartbeat_url"] "https://hc-ping.com/<uuid>"' \
+  k8s/clusters/prod/variables/variables-cluster-secret.enc.yaml
+```
 
-Free options: Mailgun (5k/mo free), Resend (3k/mo free), or AWS SES via
-its HTTPS API. Configure the same way — paste the webhook URL into the
-encrypted secret.
+Slack side: create an incoming webhook for the `#platform-alerts` channel
+(the channel in the config is cosmetic — an incoming webhook posts to the
+channel it was created for). healthchecks.io side: create the check,
+connect its Slack integration, copy the ping URL.
 
-## Local clusters
+| Env   | `alertmanager_webhook_url`        | `alertmanager_heartbeat_url`       |
+| ----- | --------------------------------- | ---------------------------------- |
+| local | invalid URL (alerts stay local)   | unset → invalid (no heartbeat)     |
+| prod  | Slack `#platform-alerts` webhook  | healthchecks.io ping URL           |
 
-Identical install, with:
+## On-call: silence and inspect
 
-- Webhook URL pointed at `http://example.invalid/no-webhook-on-local`
-  (deliberately fails). CI asserts this fail mode is acceptable — the
-  alerts still fire inside Alertmanager, the webhook just can't reach
-  anywhere. The CI restore drill verifies the monitoring stack reconciles
-  and `PrometheusRule` manifests are accepted; the lack of an external
-  destination is by design.
+- **Silence an alert** while you work: Alertmanager UI at
+  `https://alertmanager.${domain}` → Silences → New.
+- **Check why an alert fired / query metrics**: Prometheus at
+  `https://prometheus.${domain}` (Graph / Alerts / Targets), or a Grafana
+  dashboard at `https://grafana.${domain}`.
+- **Read logs**: Grafana → Explore → Loki datasource, e.g.
+  `{namespace="velero"} |= "error"`.
+- **Cost**: OpenCost at `https://opencost.${domain}`.
 
-## Tuning resource footprint
+All four are behind GitHub SSO (oauth2-proxy, `devantler` only).
 
-Current chart values:
+## Resource footprint (prod)
 
-| Component      | Requests              | Limits        |
-| -------------- | --------------------- | ------------- |
-| Prometheus     | 100m CPU / 512 Mi     | — / 1 Gi      |
-| Alertmanager   | 50m CPU / 64 Mi       | — / 128 Mi    |
-| Operator       | 50m CPU / 128 Mi      | — / 256 Mi    |
-| node-exporter  | (chart defaults)      | (chart defaults) |
-| kube-state-metrics | (chart defaults)  | (chart defaults) |
+| Component      | Requests          | Limits      |
+| -------------- | ----------------- | ----------- |
+| Prometheus     | 50m / 256 Mi      | — / 1.5 Gi  |
+| Alertmanager   | 50m / 64 Mi (×2)  | — / 128 Mi  |
+| Grafana        | 50m / 128 Mi      | — / 256 Mi  |
+| Loki           | 50m / 128 Mi      | — / 512 Mi  |
+| Alloy          | 25m / 96 Mi (×node) | — / 256 Mi |
+| Operator       | 50m / 128 Mi      | — / 256 Mi  |
 
-Total ~1 GiB committed memory. If
-this becomes too heavy, the first thing to drop is `nodeExporter` and
-the related node-level alerts.
+VPA right-sizes the requests at runtime (RequestsOnly), so the limits are
+the real ceilings. The 4th static worker (`ksail.prod.yaml workers: 4`)
+was added to host this always-on tier.
 
 ## Related
 
-- [DR runbook](./runbook.md) — what to do when an alert fires
-- [Velero + CNPG](./velero-cnpg.md) — the systems whose health is being checked
+- [DR runbook](./runbook.md) — what to do when an alert fires, and restore
+- [Velero + CNPG](./velero-cnpg.md) — the systems whose health is checked
+- [restore-drill.md](./restore-drill.md) — CI validation of the stack
 - [HA primitives](../../README.md) — cluster environments and topology
@@ -0,0 +1,102 @@
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: alloy
+  namespace: monitoring
+  labels:
+    helm.toolkit.fluxcd.io/remediation: enabled
+spec:
+  # Needs Loki's push endpoint and shares the grafana HelmRepository +
+  # monitoring namespace defined by the loki release.
+  dependsOn:
+    - name: loki
+      namespace: monitoring
+  interval: 5m
+  timeout: 10m
+  install:
+    remediation:
+      retries: -1
+  upgrade:
+    remediation:
+      retries: -1
+      remediateLastFailure: true
+  chart:
+    spec:
+      chart: alloy
+      version: 1.8.2
+      sourceRef:
+        kind: HelmRepository
+        name: grafana
+  # https://github.com/grafana/alloy/blob/main/operations/helm/charts/alloy/values.yaml
+  #
+  # Log shipper for the cluster: a DaemonSet that tails pod logs via the
+  # Kubernetes API and pushes them to Loki. Discovery is filtered to the
+  # node the pod runs on (spec.nodeName == $NODE_NAME) so each DaemonSet
+  # replica only ships its own node's logs -- otherwise every replica
+  # would tail every pod and N-plicate the logs. Uses the API (not
+  # hostPath), so no privileged mounts are needed; the chart's default
+  # RBAC already grants pods/log.
+  values:
+    controller:
+      type: daemonset
+    alloy:
+      # Pure log forwarding -- no clustering needed.
+      clustering:
+        enabled: false
+      extraEnv:
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: spec.nodeName
+      resources:
+        requests:
+          cpu: 25m
+          memory: 96Mi
+        limits:
+          memory: 256Mi
+      configMap:
+        content: |
+          discovery.kubernetes "pods" {
+            role = "pod"
+            selectors {
+              role  = "pod"
+              field = "spec.nodeName=" + sys.env("NODE_NAME")
+            }
+          }
+
+          discovery.relabel "pod_logs" {
+            targets = discovery.kubernetes.pods.targets
+            rule {
+              source_labels = ["__meta_kubernetes_namespace"]
+              target_label  = "namespace"
+            }
+            rule {
+              source_labels = ["__meta_kubernetes_pod_name"]
+              target_label  = "pod"
+            }
+            rule {
+              source_labels = ["__meta_kubernetes_pod_container_name"]
+              target_label  = "container"
+            }
+            rule {
+              source_labels = ["__meta_kubernetes_pod_node_name"]
+              target_label  = "node"
+            }
+            rule {
+              source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
+              target_label  = "app"
+            }
+          }
+
+          loki.source.kubernetes "pods" {
+            targets    = discovery.relabel.pod_logs.output
+            forward_to = [loki.write.default.receiver]
+          }
+
+          loki.write "default" {
+            endpoint {
+              url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"
+            }
+          }
+    serviceMonitor:
+      enabled: true
@@ -0,0 +1,5 @@
+---
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - helm-release.yaml