Troubleshooting

Common issues encountered during deployment and their solutions.

ClickHouse

IPv6 listen failure (GKE / clusters without IPv6)

Error: Listen [::]:9009 failed: Address family for hostname not supported

Fix: Add listen_host: "0.0.0.0" and interserver_listen_host: "0.0.0.0" to ClickHouse extraConfig. The default chart values already include this for GKE compatibility.

Prometheus port conflict

Error: Listen [0.0.0.0]:9363 failed: Address already in use

Cause: The ClickHouse Operator configures Prometheus independently. Do not add a prometheus block in extraConfig — it conflicts with the operator's own configuration.

Service DNS name mismatch

Error: getaddrinfo ENOTFOUND countly-clickhouse.clickhouse.svc

Cause: The official ClickHouse Operator creates services with the pattern <cr-name>-clickhouse-headless, not <cr-name>. The correct service name is countly-clickhouse-clickhouse-headless.clickhouse.svc.

Fix: Already handled in the chart helpers. If you override secrets.clickhouse.host, use the full service name.

MongoDB

Exporter container bootstrap deadlock

Error: Exporter container CreateContainerConfigError — secret not found.

Cause: The MongoDB operator creates connection string secrets only after the replica set is ready, but pod readiness requires all containers to be running.

Fix: The chart uses optional: true on the exporter's secretKeyRef. The exporter container starts without the secret and stabilizes once the operator creates it.

User roles too restrictive

Symptom: Countly fails to create or access databases beyond countly and countly_drill.

Fix: The app user needs readWriteAnyDatabase on admin (default in the chart). If you've overridden the roles, ensure they match:

users:
  app:
    roles:
      - { name: readWriteAnyDatabase, db: admin }
      - { name: dbAdmin, db: countly }

Kafka

Topic creation fails — replication factor exceeds broker count

Error: Failed to create topic drill-events: Topic creation errors

Cause: Default COUNTLY_CONFIG__KAFKA_REPLICATIONFACTOR: "2" but the local profile deploys only 1 broker.

Fix: The local profile already sets config.kafka.COUNTLY_CONFIG__KAFKA_REPLICATIONFACTOR: "1". If you use a custom profile with fewer than 2 brokers, override this value.

Kafka Connect connector 400 — consumer override policy

Error: The 'None' policy does not allow 'max.poll.records' to be overridden

Cause: connector.client.config.override.policy was set to None. The ClickHouse sink connector needs to tune consumer settings for high-throughput batching.

Fix: Already set to All in chart defaults. If you've overridden kafkaConnect.workerConfig, ensure:

kafkaConnect:
  workerConfig:
    connector.client.config.override.policy: All

Countly cannot reach Kafka Connect REST API

Symptom: Countly health manager reports Kafka Connect unreachable on port 8083.

Cause: Strimzi auto-creates a NetworkPolicy on Kafka Connect pods that only allows traffic from other Connect pods and the cluster-operator.

Fix: The chart creates an additional NetworkPolicy allowing the countly namespace to reach port 8083. Ensure networkPolicy.allowedNamespaces includes your Countly namespace.

Countly Application

Pods fail health probes — listening on localhost

Error: Startup probe failed: dial tcp <pod-ip>:3020: connect: connection refused

Cause: Missing HOST: "0.0.0.0" config for the component. Kubernetes probes connect via the pod IP, not localhost.

Fix: Already included in chart defaults. All components have COUNTLY_CONFIG__<COMPONENT>_HOST: "0.0.0.0" in their config sections.

Cross-namespace secret reference fails

Error: Countly pods reference a secret that exists in a different namespace.

Cause: Kubernetes secrets are namespace-scoped. The MongoDB operator creates secrets in the mongodb namespace, but Countly runs in the countly namespace.

Fix: The chart computes the MongoDB connection string from service DNS and creates its own secret in the countly namespace. Do not set secrets.mongodb.existingSecret to a cross-namespace secret — instead provide secrets.mongodb.password or secrets.mongodb.connectionString.

F5 NGINX Ingress Controller

Ingress rejected — invalid annotations

Symptom: kubectl describe ingress countly -n countly shows AddedOrUpdatedWithError events.

Cause: F5 NIC validates annotations strictly and rejects invalid ones. Common mistakes:

Using "on"/"off" instead of "True"/"False" for nginx.org/proxy-buffering
Missing s suffix on timeouts (e.g., "60" instead of "60s")
Using old nginx.ingress.kubernetes.io/* annotations (community ingress-nginx)

Fix: Check the events section of kubectl describe ingress — F5 NIC logs the reason for rejection. Update annotations to use nginx.org/* format.

Duplicate `proxy_http_version` directive

Error: nginx reload failed: "proxy_http_version" directive is duplicate

Cause: F5 NIC auto-injects proxy_http_version 1.1 when nginx.org/keepalive > 0. If your location-snippets also include proxy_http_version 1.1, it duplicates.

Fix: Remove proxy_http_version 1.1 from nginx.org/location-snippets. The chart defaults already exclude it.

OTEL export failures

Error: OTel export failure: DNS resolution failed for alloy-otlp.observability.svc.cluster.local:4317

Cause: The OTEL exporter endpoint is configured in f5-nginx-values.yaml but the Alloy-OTLP collector is not deployed.

Fix: Either deploy the observability stack (helm install countly-observability ...) or remove the otel-exporter-endpoint from f5-nginx-values.yaml. This error is benign and does not affect traffic.

TLS secret missing

Error: TLS secret countly-tls is invalid: secret doesn't exist or of an unsupported type

Cause: The ingress references a TLS secret that doesn't exist. By default, TLS is disabled (ingress.tls.mode: http). If you enabled TLS but the secret hasn't been created yet, this error appears.

Fix: Set ingress.tls.mode: letsencrypt for automatic certificate provisioning via cert-manager, ingress.tls.mode: existingSecret with a pre-created TLS secret, or ingress.tls.mode: http to disable TLS. See DEPLOYMENT-MODES.md.

Observability

Grafana shows "No data" for datasources

Cause: Backend service not ready or datasource URL incorrect.

Fix: Verify the backend pods are running:

kubectl get pods -n observability

Check Grafana datasource configuration:

kubectl get configmap -n observability -l app.kubernetes.io/component=grafana -o yaml | grep -A5 "url:"

Alloy DaemonSet cannot read container logs

Error: permission denied reading /var/log/pods

Cause: The Alloy DaemonSet runs as root to read host log files. Some hardened clusters block this.

Fix: Ensure the Alloy pods are running as root (default in the chart). Check pod security policies/standards:

kubectl get pods -n observability -l app.kubernetes.io/component=alloy -o yaml | grep -A3 securityContext

Duplicate metrics in Prometheus

Cause: Both Alloy DaemonSet and Alloy-Metrics scraping the same targets.

Fix: By design, the chart enforces a clean split — Alloy DaemonSet handles logs only, Alloy-OTLP handles traces/profiles, and Alloy-Metrics handles ALL Prometheus scraping. If you see duplicates, check for custom scrape configs.

Dashboard ConfigMaps not loading

Error: Grafana shows empty dashboard folder.

Cause: Dashboard JSON files are loaded via .Files.Get in the chart. If the dashboards/ directory is missing from the chart package, ConfigMaps will be empty.

Fix: Ensure the chart was packaged with helm package (which includes the dashboards/ directory) or deployed from the local chart directory.

External mode — no data flowing

Cause: External endpoint URLs not configured or auth missing.

Fix: Verify external URLs are set:

helm get values countly-observability -n observability | grep -E "(remoteWriteUrl|pushUrl|otlpGrpcEndpoint|ingestUrl)"

Check Alloy logs for connection errors:

kubectl logs -n observability daemonset/$(kubectl get ds -n observability -l app.kubernetes.io/component=alloy -o name) | grep -i error

Tempo metrics_generator errors when metrics disabled

Cause: This should not happen — the chart conditionally omits metrics_generator from Tempo config when metrics.enabled=false.

Fix: If you see this error, verify your values:

helm get values countly-observability -n observability | grep -A2 metrics

Ensure metrics.enabled is explicitly set to false, not just that Prometheus is missing.

General

cert-manager must be installed before ClickHouse Operator

Error: no matches for kind "Certificate" in version "cert-manager.io/v1"

Fix: Install cert-manager first. See PREREQUISITES.md.

Helm install order matters

Charts must be installed in this order:

countly-mongodb + countly-clickhouse (no dependencies between them)
countly-kafka (depends on ClickHouse for the sink connector)
countly (depends on all three)

Helmfile's needs: configuration enforces this automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

ClickHouse

IPv6 listen failure (GKE / clusters without IPv6)

Prometheus port conflict

Service DNS name mismatch

MongoDB

Exporter container bootstrap deadlock

User roles too restrictive

Kafka

Topic creation fails — replication factor exceeds broker count

Kafka Connect connector 400 — consumer override policy

Countly cannot reach Kafka Connect REST API

Countly Application

Pods fail health probes — listening on localhost

Cross-namespace secret reference fails

F5 NGINX Ingress Controller

Ingress rejected — invalid annotations

Duplicate `proxy_http_version` directive

OTEL export failures

TLS secret missing

Observability

Grafana shows "No data" for datasources

Alloy DaemonSet cannot read container logs

Duplicate metrics in Prometheus

Dashboard ConfigMaps not loading

External mode — no data flowing

Tempo metrics_generator errors when metrics disabled

General

cert-manager must be installed before ClickHouse Operator

Helm install order matters

FilesExpand file tree

TROUBLESHOOTING.md

Latest commit

History

TROUBLESHOOTING.md

File metadata and controls

Troubleshooting

ClickHouse

IPv6 listen failure (GKE / clusters without IPv6)

Prometheus port conflict

Service DNS name mismatch

MongoDB

Exporter container bootstrap deadlock

User roles too restrictive

Kafka

Topic creation fails — replication factor exceeds broker count

Kafka Connect connector 400 — consumer override policy

Countly cannot reach Kafka Connect REST API

Countly Application

Pods fail health probes — listening on localhost

Cross-namespace secret reference fails

F5 NGINX Ingress Controller

Ingress rejected — invalid annotations

Duplicate proxy_http_version directive

OTEL export failures

TLS secret missing

Observability

Grafana shows "No data" for datasources

Alloy DaemonSet cannot read container logs

Duplicate metrics in Prometheus

Dashboard ConfigMaps not loading

External mode — no data flowing

Tempo metrics_generator errors when metrics disabled

General

cert-manager must be installed before ClickHouse Operator

Helm install order matters

Duplicate `proxy_http_version` directive