Skip to content

feat(zookeeper): decouple metrics and fluentd in zk-exporter#61

Open
gnanirahulnutakki wants to merge 3 commits into
masterfrom
feature/zk-decouple-metrics-fluentd
Open

feat(zookeeper): decouple metrics and fluentd in zk-exporter#61
gnanirahulnutakki wants to merge 3 commits into
masterfrom
feature/zk-decouple-metrics-fluentd

Conversation

@gnanirahulnutakki
Copy link
Copy Markdown
Member

Summary

Makes the zk-exporter metrics subsystem and the fluentd logging subsystem independently enableable. Previously the whole exporter sidecar was gated on a single flag (`metrics.enabled`), making fluentd-only deployments impossible.

Also changes the `pushMode` default from `true` to `false` to prevent `CrashLoopBackOff` in environments without a reachable Pushgateway.

Why

Observed in `qa-self-managed` / `duploservices-saasops1` namespace:

  • With chart default `metrics.pushMode: true` but no Pushgateway at `http://prometheus-pushgateway:9091\`, the zk-exporter sidecar runs for 300s, times out, exits with code 1 → CrashLoopBackOff on all three ZK pods.
  • No way to enable fluentd logging alone without also enabling metrics (and hitting the above crash).

The underlying `zk-exporter` image entrypoint already handles `PUSH_MODE` and `FLUENTD_ENABLE` as independent env vars — the chart template was the bottleneck.

Changes

`templates/statefulset.yaml`

  • Sidecar container now spawns on `or metrics.enabled metrics.fluentd.enabled` instead of `metrics.enabled` alone
  • Metrics-specific env/ports/volumes gated on `metrics.enabled`
  • Fluentd-specific env/volumes gated on `metrics.fluentd.enabled`

`values.yaml`

  • `metrics.pushMode` default: `true` → `false`
  • Added documentation explaining the independence
  • Noted that `metrics.securityContext` is currently unused by the template

Rendering matrix verified

metrics.enabled fluentd.enabled Container spawned? Metrics port 9095 PUSH_MODE env FLUENTD_ENABLE env
true false Yes Yes Yes No
false true Yes No No Yes
true true Yes Yes Yes Yes
false false No

Test plan

  • Deploy with `metrics.enabled=true, pushMode=false` → verify pull-mode metrics scrape works on :9095, no Pushgateway errors
  • Deploy with `metrics.enabled=false, metrics.fluentd.enabled=true` → verify fluentd forwards logs, no zookeeper-exporter Go binary running
  • Deploy with both enabled → verify both subsystems run
  • Deploy with both disabled (default) → verify no sidecar container created
  • Verify no regression when upgrading an existing release with `metrics.enabled: true, pushMode: true` explicitly set

Previously the entire zk-exporter sidecar container was gated on
metrics.enabled, and FLUENTD_ENABLE env var was nested inside the same
conditional. This made it impossible to enable fluentd logging without
also enabling the metrics exporter.

Changes:
- Sidecar container spawns if EITHER metrics.enabled OR fluentd.enabled
- ZK_CONN, PUSH_MODE, PUSHGATEWAY_URI, METRICS_PORT, port 9095 rendered
  only when metrics.enabled=true
- FLUENTD_ENABLE, FLUENTD_CONF, ELASTICSEARCH_HOST, ELASTICSEARCH_TYPE,
  fluentd-config volume mount rendered only when fluentd.enabled=true
- Both subsystems now fully independent (matches entrypoint.sh behavior)

values.yaml:
- Default pushMode flipped from true to false. When true with an
  unreachable Pushgateway, the exporter exits with code 1 after 300s
  causing CrashLoopBackOff. Observed in production-like environments
  where the default pushgateway URL does not exist.
- Added documentation comments explaining the independence and the
  dead metrics.securityContext config (template doesn't use it).

Rendering matrix verified:
  metrics=T fluentd=F -> sidecar, port 9095, ZK_CONN, PUSH_MODE, no FLUENTD
  metrics=F fluentd=T -> sidecar, no port 9095, FLUENTD_ENABLE only
  metrics=T fluentd=T -> sidecar with both subsystems
  metrics=F fluentd=F -> no sidecar container (default)
@gnanirahulnutakki gnanirahulnutakki self-assigned this Apr 13, 2026
Adds a new workflow that publishes the zookeeper chart to
ghcr.io/radiantlogic-devops/helm/zookeeper-dev as an OCI artifact.

This mirrors the pattern used by helm-v8's release-charts.yml and enables
quick feature-branch testing without waiting for a master merge.

Triggers:
- push to master (when charts/zookeeper/** or workflow file changes)
- workflow_dispatch (manual run from any branch)

Consumers pull with:
  helm pull oci://ghcr.io/radiantlogic-devops/helm/zookeeper-dev --version <ver>

To test a feature branch:
1. Bump charts/zookeeper/Chart.yaml version (e.g. 0.1.7-my-feature)
2. Push the branch
3. Run the workflow manually (workflow_dispatch) on that branch
4. Pull the chart by the new version

Does NOT replace the existing GitHub Pages release workflow (release.yml)
which continues to publish to https://radiantlogic-devops.github.io/helm/
on master push.
The headless service governing the ZK StatefulSet was missing
publishNotReadyAddresses: true. With K8s 1.22+ EndpointSlice handling
(strict in 1.33+ where the legacy Endpoints API is deprecated),
CoreDNS withholds DNS records for pods whose readiness probe has not
yet passed.

For a StatefulSet that needs sibling DNS to bootstrap a quorum (as ZK
does), this creates a chicken-and-egg deadlock:
  zk-0 starts -> looks up DNS for zk-1, zk-2 -> NXDOMAIN
       -> can't form quorum -> not Ready -> own DNS not published
       -> same for zk-1 and zk-2 -> cluster never forms.

Symptoms in CI logs:
  WARN [WorkerSender] Failed to resolve address:
  zookeeper-1.zookeeper.<ns>.svc.cluster.local
  java.net.UnknownHostException

Reproduced locally on a kind v0.30.0 / k8s 1.34 cluster; verified
patching the live service with publishNotReadyAddresses fixed it.

This bug went unnoticed for years because pre-K8s-1.22 CoreDNS was
lenient about the legacy Endpoints API's notReadyAddresses path.
The lint-test-fid CI started failing in April 2026 when
helm/kind-action@v1 (a floating tag) rolled forward to a release
using kindest/node:v1.35.0, which strictly enforces the readiness
condition on EndpointSlice entries.

Bumps zookeeper chart version 0.1.6 -> 0.1.7. No app changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant