feat(zookeeper): decouple metrics and fluentd in zk-exporter#61
Open
gnanirahulnutakki wants to merge 3 commits into
Open
feat(zookeeper): decouple metrics and fluentd in zk-exporter#61gnanirahulnutakki wants to merge 3 commits into
gnanirahulnutakki wants to merge 3 commits into
Conversation
Previously the entire zk-exporter sidecar container was gated on metrics.enabled, and FLUENTD_ENABLE env var was nested inside the same conditional. This made it impossible to enable fluentd logging without also enabling the metrics exporter. Changes: - Sidecar container spawns if EITHER metrics.enabled OR fluentd.enabled - ZK_CONN, PUSH_MODE, PUSHGATEWAY_URI, METRICS_PORT, port 9095 rendered only when metrics.enabled=true - FLUENTD_ENABLE, FLUENTD_CONF, ELASTICSEARCH_HOST, ELASTICSEARCH_TYPE, fluentd-config volume mount rendered only when fluentd.enabled=true - Both subsystems now fully independent (matches entrypoint.sh behavior) values.yaml: - Default pushMode flipped from true to false. When true with an unreachable Pushgateway, the exporter exits with code 1 after 300s causing CrashLoopBackOff. Observed in production-like environments where the default pushgateway URL does not exist. - Added documentation comments explaining the independence and the dead metrics.securityContext config (template doesn't use it). Rendering matrix verified: metrics=T fluentd=F -> sidecar, port 9095, ZK_CONN, PUSH_MODE, no FLUENTD metrics=F fluentd=T -> sidecar, no port 9095, FLUENTD_ENABLE only metrics=T fluentd=T -> sidecar with both subsystems metrics=F fluentd=F -> no sidecar container (default)
Adds a new workflow that publishes the zookeeper chart to ghcr.io/radiantlogic-devops/helm/zookeeper-dev as an OCI artifact. This mirrors the pattern used by helm-v8's release-charts.yml and enables quick feature-branch testing without waiting for a master merge. Triggers: - push to master (when charts/zookeeper/** or workflow file changes) - workflow_dispatch (manual run from any branch) Consumers pull with: helm pull oci://ghcr.io/radiantlogic-devops/helm/zookeeper-dev --version <ver> To test a feature branch: 1. Bump charts/zookeeper/Chart.yaml version (e.g. 0.1.7-my-feature) 2. Push the branch 3. Run the workflow manually (workflow_dispatch) on that branch 4. Pull the chart by the new version Does NOT replace the existing GitHub Pages release workflow (release.yml) which continues to publish to https://radiantlogic-devops.github.io/helm/ on master push.
The headless service governing the ZK StatefulSet was missing
publishNotReadyAddresses: true. With K8s 1.22+ EndpointSlice handling
(strict in 1.33+ where the legacy Endpoints API is deprecated),
CoreDNS withholds DNS records for pods whose readiness probe has not
yet passed.
For a StatefulSet that needs sibling DNS to bootstrap a quorum (as ZK
does), this creates a chicken-and-egg deadlock:
zk-0 starts -> looks up DNS for zk-1, zk-2 -> NXDOMAIN
-> can't form quorum -> not Ready -> own DNS not published
-> same for zk-1 and zk-2 -> cluster never forms.
Symptoms in CI logs:
WARN [WorkerSender] Failed to resolve address:
zookeeper-1.zookeeper.<ns>.svc.cluster.local
java.net.UnknownHostException
Reproduced locally on a kind v0.30.0 / k8s 1.34 cluster; verified
patching the live service with publishNotReadyAddresses fixed it.
This bug went unnoticed for years because pre-K8s-1.22 CoreDNS was
lenient about the legacy Endpoints API's notReadyAddresses path.
The lint-test-fid CI started failing in April 2026 when
helm/kind-action@v1 (a floating tag) rolled forward to a release
using kindest/node:v1.35.0, which strictly enforces the readiness
condition on EndpointSlice entries.
Bumps zookeeper chart version 0.1.6 -> 0.1.7. No app changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the zk-exporter metrics subsystem and the fluentd logging subsystem independently enableable. Previously the whole exporter sidecar was gated on a single flag (`metrics.enabled`), making fluentd-only deployments impossible.
Also changes the `pushMode` default from `true` to `false` to prevent `CrashLoopBackOff` in environments without a reachable Pushgateway.
Why
Observed in `qa-self-managed` / `duploservices-saasops1` namespace:
The underlying `zk-exporter` image entrypoint already handles `PUSH_MODE` and `FLUENTD_ENABLE` as independent env vars — the chart template was the bottleneck.
Changes
`templates/statefulset.yaml`
`values.yaml`
Rendering matrix verified
Test plan