Skip to content

[feature]: Add log aggregation to the kube-prometheus-stack observability setup #1507

@devantler

Description

@devantler

Summary

Extend the existing kube-prometheus-stack deployment to include centralized log aggregation, enabling Kubernetes audit logs and container logs to be queryable in a self-hosted Grafana instance.

Motivation

  • The monitoring stack is currently metrics-only (Prometheus + Alertmanager). Grafana is disabled (grafana.enabled: false).
  • Kubernetes audit logs (added in feat: add Talos security hardening patches #1506) are written to files on the EPHEMERAL partition — only accessible via talosctl on individual nodes.
  • No centralized log search or correlation between logs and metrics exists.
  • A self-hosted Grafana with Loki would enable audit log investigation, pod log search, and metric/log correlation — all from one UI.

Architecture

kube-prometheus-stack does not include Loki or Alloy as sub-charts. However, Grafana is a sub-chart that can be enabled. The approach is:

  1. Enable Grafana in kube-prometheus-stack via grafana.enabled: true + datasource configuration
  2. Deploy Loki as a separate HelmRelease (Grafana Helm repo) for log storage/query
  3. Deploy Alloy as a separate HelmRelease (Grafana Helm repo) as the log collector

Note: Promtail is deprecated (EOL March 2026). Grafana Alloy is the official successor — an OpenTelemetry-based collector for logs, metrics, and traces.

┌─────────────────────────────────────────────────┐
│              kube-prometheus-stack               │
│  ┌───────────┐  ┌─────────────┐  ┌───────────┐ │
│  │ Prometheus │  │ Alertmanager│  │  Grafana   │ │
│  │  (metrics) │  │  (webhooks) │  │  (UI/viz)  │ │
│  └─────┬─────┘  └─────────────┘  └──┬────┬───┘ │
│        │                             │    │     │
└────────│─────────────────────────────│────│─────┘
         │  ┌──────────────────────────┘    │
         │  │  datasource: prometheus       │ datasource: loki
         │  │                               │
   ┌─────┴──┴───┐                    ┌──────┴──────┐
   │  Prometheus │◄── scrape ──┐     │    Loki     │
   │   (metrics) │             │     │   (logs)    │
   └─────────────┘             │     └──────▲──────┘
                               │            │ push
                         ┌─────┴────────────┴─────┐
                         │      Grafana Alloy      │
                         │  (DaemonSet collector)  │
                         │  • container logs       │
                         │  • audit log files      │
                         └─────────────────────────┘

Proposed Solution

1. Enable Grafana in kube-prometheus-stack

Update helm-release.yaml values:

grafana:
  enabled: true
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki-gateway.monitoring.svc.cluster.local
      access: proxy

2. Deploy Loki (separate HelmRelease)

Deploy Loki in monitoring namespace as a companion to kube-prometheus-stack. Use the grafana/loki Helm chart in single-binary or simple-scalable mode (homelab scale).

3. Deploy Alloy (separate HelmRelease)

Deploy Grafana Alloy as a DaemonSet to collect:

  • Container logs from all pods (standard Kubernetes log collection)
  • Audit log files from control-plane nodes by mounting /var/log/kubernetes/audit/ (host path from Talos audit-logging patch)

4. Optional: Audit webhook backend

Add audit-webhook-* API server flags to talos/cluster/audit-logging.yaml for real-time streaming to Loki (in addition to the file backend). This is optional — Alloy tailing the audit log files achieves the same result with better resilience.

Acceptance Criteria

  • Grafana enabled in kube-prometheus-stack with Prometheus + Loki datasources
  • Loki deployed as a separate HelmRelease in monitoring namespace
  • Alloy deployed as a DaemonSet HelmRelease collecting container logs + audit logs
  • Kubernetes audit logs queryable via LogQL in Grafana
  • Container logs from all namespaces searchable in Grafana
  • Existing Prometheus metrics remain accessible in Grafana
  • CiliumNetworkPolicy allows Alloy → Loki and Grafana → Loki traffic

Additional Context

  • Current kube-prometheus-stack chart: v84.5.0 (prometheus-community)
  • Current audit logging setup: talos/cluster/audit-logging.yaml (file backend, 30-day rotation)
  • Helm charts needed: grafana/loki, grafana/alloy
  • The file-based audit backend should be kept as the reliable primary — Loki provides the search/query layer on top
  • Reference: Grafana Alloy docs, Loki docs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for Feature.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions