A GitOps-managed Kubernetes playground cluster using FluxCD β built to learn Kubernetes and get hands-on experience with running a production-grade cluster. This repo serves as a reference for a fully automated cluster setup with secrets management, observability, storage, and networking β no manual kubectl apply after bootstrapping.
The cluster itself is built with Talos Linux and provisioned in k8s-cluster-talos.
One component that covers the full networking stack β no need to combine multiple tools:
- CNI β pod networking
- kube-proxy replacement β eBPF-based, lower overhead
- L2 Announcements β announces LoadBalancer IPs via ARP to the local network, replaces MetalLB
- Gateway API β ingress/routing without a separate ingress controller
Sealed Secrets encrypts secrets per-cluster and stores the ciphertext in Git. OpenBao (open-source Vault fork) keeps secrets completely out of Git and provides a central UI to manage them. The tradeoff is more setup complexity upfront, but day-to-day handling is simpler β add or update a secret in one place, External Secrets syncs it into the cluster automatically. External Secrets is the bridge that makes this work: it reads from OpenBao and creates the Kubernetes Secrets that workloads actually consume.
This repo went through both alternatives first: started with HashiCorp Vault, but switched after HashiCorp moved it to a paid/BSL license β OpenBao is the community fork that stayed open-source. Also tried Sealed Secrets along the way β works fine, but having a central UI and managing secrets completely outside of Git won in the end.
Setting up OpenBao on Kubernetes has quite a few gotchas (raft config, auto-unseal, HTTP vs HTTPS mismatches...). If you're doing this yourself: read the docs first.
On-prem cluster running on VMs β no cloud storage provider available. Longhorn is the simplest way to expose local node storage as a proper CSI-backed StorageClass with replication across nodes. No external storage infrastructure needed.
The standard solution for Kubernetes monitoring. Prometheus + Grafana + Alertmanager in one Helm chart, with pre-built dashboards for the whole cluster out of the box.
Two annotations on a Gateway and you get a DNS record and a valid TLS certificate β fully automated, fully GitOps. No manual DNS or certificate management needed when deploying a new app.
Declarative database provisioning via GitOps β define a PostgreSQL cluster or Redis instance as a YAML manifest and the operator handles the rest. No manual database setup outside of Git.
- Lower memory footprint
- Great GUI integration via Headlamp (FluxCD plugin) and the Weaveworks GitOps VSCode extension
- ArgoCD got annoying
| Component | Purpose | Why |
|---|---|---|
| FluxCD | GitOps continuous delivery | Declarative, pull-based β cluster reconciles itself from this repo |
| Cilium | CNI, L2 LoadBalancer, Gateway API | Replaces kube-proxy, handles L2 LB for bare-metal LoadBalancer IPs, and serves as the Gateway API implementation |
| cert-manager | TLS certificates via Let's Encrypt | DNS-01 challenge via Cloudflare β works regardless of which nameservers the rest of the cluster uses |
| Longhorn | Distributed block storage | Replicated block storage for stateful workloads |
| OpenBao | Secrets management | Open-source Vault fork β stores all secrets; auto-unseal via static key so the cluster recovers after restarts without manual intervention |
| External Secrets | Sync secrets from OpenBao into Kubernetes | Secrets live in OpenBao, not in this repo; External Secrets syncs them into Kubernetes at runtime |
| External DNS | Automatic DNS records | Creates DNS entries in Pi-hole automatically when a Gateway or Service is created |
| kube-prometheus-stack | Prometheus + Grafana monitoring | Full cluster observability out of the box |
| Loki | Log aggregation | SingleBinary mode with filesystem storage on Longhorn β queried through Grafana |
| Grafana Alloy | Log collector | DaemonSet that collects logs from all pods and ships them to Loki |
| Flagger | Progressive delivery | Canary deployments with automated rollback based on Hubble HTTP metrics |
| OpenCost | Kubernetes cost monitoring | Track resource cost per workload |
| CloudNative PG | PostgreSQL operator | Manages PostgreSQL clusters declaratively |
| Redis Operator | Redis operator | Manages Redis instances declaratively |
| App | Stack | Why |
|---|---|---|
| Nextcloud | PostgreSQL (CloudNative PG) + Redis | Self-hosted file sync, but mainly to test CloudNative PG and Redis Operator in practice |
| podinfo | Flagger Canary | Smoke test with progressive delivery β verifies DNS, certs, routing, canary deployments, and the overall stack work end-to-end |
clusters/ # Flux entrypoint β reconciles infra and apps
infra/
controller/ # Helm releases for infrastructure components
config/ # Config that depends on controllers (issuers, gateways, secret stores...)
apps/ # Application Helm releases
docs/ # Notes and setup guides for specific components
Flux reconciles three Kustomizations in strict order via dependsOn:
infra-controller β infra-config β apps
infra-controller and infra-config are split because config resources (e.g. ClusterIssuer, ClusterSecretStore, Gateways) depend on the CRDs that controllers install. Without the split, Flux would try to apply config before the CRDs exist and fail.
Flux deploys infrastructure in dependency order:
- cert-manager
- Cilium
- Longhorn
- OpenBao
- External Secrets
- kube-prometheus-stack
- Loki
- Grafana Alloy
- Flagger + Loadtester
- OpenCost
- CloudNative PG
- Redis Operator
External DNS is deployed independently outside this order β it requires a pihole secret that must be created manually before bootstrapping (see docs/external-dns.md).
No secrets are stored in this repo. The flow is:
- OpenBao holds all secrets (Cloudflare API token, app passwords, etc.)
- External Secrets reads from OpenBao and creates Kubernetes Secrets at runtime
- cert-manager uses the Cloudflare token secret for DNS-01 challenges
- External DNS uses a manually bootstrapped secret for Pi-hole access (chicken-and-egg: External Secrets can't run before OpenBao is up)
OpenBao uses a static key for auto-unseal stored as a Kubernetes Secret β the cluster fully recovers after restarts without manual unsealing. See docs/openbao.md for the full setup guide.
Things that broke and why β in case this repo saves someone else the debugging time.
Added CPU limits to everything early on β Longhorn and Prometheus kept hanging and getting throttled, especially under load. Infrastructure components have spiky CPU usage. Requests are fine, limits are not. Removed all CPU limits, kept only memory requests.
Initially all HelmReleases deployed in parallel β controllers collided and failed. Added dependsOn between them so only one deploys at a time. Later simplified by relying on the Kustomization-level dependsOn chain (infra-controller β infra-config β apps) instead of chaining every single HelmRelease.
cert-manager couldn't validate DNS-01 challenges because it was using the cluster's internal DNS, which couldn't reach the authoritative nameserver for the domain. Fix: set dns01RecursiveNameservers: "1.1.1.1:53,9.9.9.9:53" in the cert-manager Helm values so it always queries Cloudflare directly.
Tried passing the static unseal key via extraSecretEnvironmentVars β OpenBao wouldn't start at all. The key needs to be mounted as a volume instead. See docs/openbao.md for the full list of OpenBao gotchas.
This cluster runs control-plane-only nodes (no dedicated workers), so Longhorn runs on control-planes by default. On clusters with dedicated workers, Longhorn should be restricted to worker nodes. nodeSelector with a worker label didn't work reliably β a node affinity rule that excludes control-plane nodes via DoesNotExist is more robust.
Prometheus components use privileged operations (node exporters etc.) which violate the default baseline PodSecurity policy. The namespace needs a pod-security.kubernetes.io/enforce: privileged label, otherwise pods get blocked silently.
DNS lookups were taking unusually long. Fix: bpf.hostLegacyRouting: true in the Cilium Helm values β makes Cilium compatible with Talos DNS routing.
Every HelmRelease needed the same boilerplate: interval, timeout, driftDetection, install, upgrade, test. Instead of copying it into every file, used Kustomization-level patches in clusters/infra.yml to inject these into all HelmReleases at once. Much cleaner.
CRD handling needs different strategies depending on the operation: crds: CreateReplace on install (to actually apply them), crds: Skip on upgrade (to avoid overwriting CRDs that other controllers may depend on). Getting this wrong causes silent CRD drift.
Enabling driftDetection immediately caused false positives β Flux flagged legitimate changes as drift: /spec/replicas modified by HPA, and /status on cert-manager Certificates written back by the controller. Fix: add ignore rules for both. Also set mode: warn instead of enabled so drift is logged but doesn't block reconciliation. Both rules are now applied globally via a Kustomization patch in clusters/infra.yml instead of repeating them in every HelmRelease.
Longhorn PVCs with ReadWriteOnce fail when pods get rescheduled to a different node. Use ReadWriteMany β Longhorn supports it and it's required for stable operation across multiple nodes.
The transparent DNS proxy in Cilium interfered with DNS routing. Needs to be explicitly disabled: dnsproxy.enableTransparentMode: false.
Started with HashiCorp Vault, switched after the BSL license change. Tried Sealed Secrets in between β encrypts secrets and stores them in Git, works fine, but managing everything through a central UI outside of Git is nicer day-to-day. Landed on OpenBao (open-source Vault fork) + External Secrets.