Skip to content

Operator does not heal a ClickHouse host stuck at Ready=False while CHI is Completed #1994

@dashashutosh80

Description

@dashashutosh80

Summary

When a ClickHouse server pod managed by a Completed ClickHouseInstallation (CHI) flips from Ready=True to Ready=False due to a sustained application-layer issue (e.g. intermittent network loss, slow probe response, transient backend stall) — but without crashing badly enough to trip the kubelet liveness threshold — the operator does not re-evaluate the host or attempt any remedial action. The pod remains Ready=False indefinitely. There is no operator-side auto-recovery path that fires on Ready=True → Ready=False for a Completed CHI.

The only way out is operator-external intervention: deleting the pod, bumping a field in .spec to force a generation change, or restarting the operator Deployment itself (which only works because some unrelated branch of shouldForceRestartHost happens to fire on cold-start — e.g. an IsRollingUpdate classification, or the "different operator IP" config-rebuild path — not because the operator detected an unhealthy host).

This is true on release-0.24.2 (where the incident was observed in production) and the bug surface is essentially identical on current HEAD; only the cosmetics of the worker code have moved around.

Operator version

clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 BuiltAt:2024-12-06T14:45:56

Re-verified on current HEAD (b5b826eb6) and on release-0.24.2. The bug-relevant code paths are present in both. The kind reproduction below was run on 0.26.3 and exhibited identical behaviour.

Environment

  • Kubernetes 1.25.x (production) and 1.25.3 (kind reproduction).
  • Single-shard, single-replica CHI in Status=Completed.
  • ClickHouse 24.8.0 server image (production used a 25.x build; behaviour does not depend on the server image).
  • Standard chart-default probe configuration on the ClickHouse pod:
    Liveness:   http-get http://:http/ping  delay=60s timeout=1s period=3s #success=1 #failure=10
    Readiness:  http-get http://:http/ping  delay=10s timeout=1s period=3s #success=1 #failure=3
    
    Readiness trips after ≈9 s of failed probes; liveness needs ≈30 s of uninterrupted failures. Any flap that lets even one probe succeed inside any 30 s window keeps the container alive while readiness stays False for as long as the underlying fault persists.

Production incident

Originally observed during an HA test where worker nodes were brought up and down in random sequences. After one such cycle, the ClickHouse pod was rescheduled successfully but its readiness probe began returning non-200 intermittently at the application layer. The pod stayed Ready=False continuously for ~26 hours. During that entire window:

  • The CHI stayed at Status=Completed, with no generation change and no operator-side activity.
  • The container's RESTARTS counter did not increment — liveness never reached failure=10 consecutively, because the fault was flaky enough to let occasional probes succeed.
  • StatefulSet, EndpointSlice, kube-proxy, scheduler and kubelet all behaved correctly; the pod was simply not ready and nothing reacted to that fact at the operator layer.
  • Recovery only happened after the operator Deployment was rolled (kubectl rollout restart deploy/<operator>), and that recovery came from the operator's "different IP since previous reconcile" branch forcing a full reconfig — not from health-aware logic.

Key log lines from the recovery (production, sanitised):

05:00:45  main.go:67  Starting clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 ...
05:00:56  worker.go:364  Operator IPs are different. Operator was restarted on another
                         IP since previous reconcile of the CHI: clickhouse
05:00:59  worker.go:166  shouldForceRestartHost(): RollingUpdate requires force restart. Host: 0-0
05:00:59  worker-chi-reconciler.go:329  reconcileHostStatefulSet(): Reconcile host: 0-0.
                                        Shutting host down due to force restart
05:01:12  worker-boilerplate.go:141  processReconcilePod(): Delete Pod. .../chi-clickhouse-c1-0-0-0
05:01:19  worker-boilerplate.go:132  processReconcilePod(): Add Pod.    .../chi-clickhouse-c1-0-0-0

The remediating decision is the third line: RollingUpdate requires force restart. That's the first branch of shouldForceRestartHost and it triggered because the freshly-rolled operator's normalisation classified the CHI as IsRollingUpdate() (it had reconciling.policy: rolling in the spec). Recovery was not health-aware — the operator did not detect that the pod was unhealthy. It happened to take the IsRollingUpdate path on first reconcile after restart, which is unrelated to the pod's Ready=False state. If the CHI had not had that policy set, the operator would have logged Host force restart is not required and gone right back to sleep, leaving the pod stuck.

Equivalently: if the operator had not been restarted manually, the pod would have stayed Ready=False indefinitely. This is the user-visible bug.

The full pre-recovery 26-hour window was unfortunately rotated out of our log retention before the incident was investigated — the operator pod and ClickHouse pod were both still on their original incarnation, so kubelet, kube-apiserver, kube-controller-manager and kube-scheduler logs from that period are also unavailable. What we do have:

  • Recovery operator logs (Version:0.24.2 GitSHA:7fbf704), sanitised excerpt in evidence/production-recovery-excerpt.log.
  • kubectl describe of the affected pod with the readiness/liveness configuration shown above.
  • A clean reproduction on kind (below) that exercises the same code paths and demonstrates the exact failure mode end-to-end.

Hypothesis (then) → Reproduction (now) → Confirmed root cause

Hypothesis

For a Status=Completed CHI, no operator code path translates a child pod's Ready=True → Ready=False transition into either:

  1. an enqueued CHI reconcile, or
  2. a host-health decision once a reconcile is running.

Therefore a long-running Ready=False pod is invisible to the operator until something external bumps .spec, deletes the pod, or restarts the operator with a new IP.

Reproduction on kind (1 node, KinD 1.25.3, operator 0.26.3)

The setup uses a single-shard single-replica CHI in Status=Completed, identical probe configuration to production. Fault is injected at the node's iptables OUTPUT/INPUT chains, scoped to the ClickHouse pod IP and port 8123, with random per-packet drop. This is the production-realistic equivalent of intermittent network loss on the kubelet ↔ pod path.

Step-by-step:

  1. Baseline:

    $ kubectl -n clickhouse-operator get chi clickhouse \
        -o jsonpath='Gen={.metadata.generation} Status={.status.status} Task={.status.taskID}'
    Gen=13 Status=Completed Task=auto-65723cb8-1293-4137-995e-22d7d171f0c7
    
    $ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
        -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
    Ready=True Restarts=1
    
  2. Tail operator logs to a file in the background:

    kubectl -n clickhouse-operator logs deploy/altinity-clickhouse-operator -c altinity-clickhouse-operator -f > /tmp/op.log &
    
  3. Inject intermittent drops on the kind node (POD_IP=10.244.0.6 here). After ramping the drop ratio I converged on a value that reliably trips readiness without quite tripping liveness for the duration of the experiment:

    # On the kind node
    iptables -I OUTPUT -p tcp -d 10.244.0.6 --dport 8123 \
        -m statistic --mode random --probability 0.65 -j DROP
    iptables -I INPUT  -p tcp -s 10.244.0.6 --sport 8123 \
        -m statistic --mode random --probability 0.65 -j DROP
    

    (For the very first window I briefly used 0.9/0.9, which did push liveness above its failure=10 threshold and accumulated four container restarts. That was useful — it shows that even when kubelet recreates the container 4× back-to-back, the operator's pod-update handler still does nothing — but it's stronger than the production scenario, so I dropped probability back to 65 % for the actual confirmation window. See Stronger variant below.)

  4. Observe the pod flip and the CHI stay still:

    $ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
        -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
    Ready=False Restarts=1
    
    $ kubectl -n clickhouse-operator get chi clickhouse \
        -o jsonpath='Gen={.metadata.generation} Status={.status.status}'
    Gen=13 Status=Completed
    
  5. Operator activity during the 7-minute observation window (02:3502:42): the only log line attributable to this CHI was a single EndpointSlice event at the moment readiness first flipped:

    I0528 02:34:58.210516  processReconcileEndpointSlice():  Transition: '10.244.0.6'=>''
    I0528 02:34:58.224027  prepareListOfTemplates():         Found applicable templates num: 0
    I0528 02:34:58.234132  CHI:clickhouse-operator/clickhouse: IPs of the CR ... len: 1 [10.244.0.6]
    I0528 02:34:58.244363  worker-reconciler-chi.go:172      CHI:clickhouse-operator/clickhouse:
    I0528 02:34:58.248387  worker-config-map.go:70           Update ConfigMap chi-clickhouse-common-usersd
    

    That path lands in updateEndpoints, which only refreshes the users ConfigMap with the current set of pod IPs and updates CHI status; it never reaches reconcileCR, never calls shouldForceRestartHost, never inspects pod readiness.

    After that one burst at 02:34:58, zero further log lines were produced for this CHI for the rest of the observation window, despite the pod staying Ready=False the entire time.

  6. Control A — annotation-only update (no .spec change). Bump metadata.annotations.repro/touch while the pod is still Ready=False:

    $ kubectl -n clickhouse-operator annotate chi clickhouse repro/touch="$(date +%s)" --overwrite
    $ kubectl -n clickhouse-operator get chi clickhouse \
        -o jsonpath='Gen={.metadata.generation} RV={.metadata.resourceVersion}'
    Gen=13 RV=1175174     # generation unchanged, resourceVersion bumped
    

    The CHI informer's UpdateFunc fires (resourceVersion changed) and enqueues a ReconcileCHI{Update}. The worker immediately drops it because isGenerationTheSame(old, new) == true. Zero log lines, zero side effects. This proves the informer's periodic resync — which also looks like an annotation-only update at the worker — cannot rescue an unhealthy pod either.

  7. Control B — force a generation bump by editing .spec while the pod is still Ready=False. I used .spec.taskID:

    $ kubectl -n clickhouse-operator patch chi clickhouse --type=merge \
        -p '{"spec":{"taskID":"repro-1779936148"}}'
    $ kubectl -n clickhouse-operator get chi clickhouse \
        -o jsonpath='Gen={.metadata.generation} Status={.status.status}'
    Gen=14 Status=InProgress
    

    The operator immediately runs a full reconcile. Critically:

    I0528 02:42:30.253804  worker.go:197  shouldForceRestartHost():Host:0-0[0/0]:
                                          Host force restart is not required. Host: 0-0
    

    The pod was Ready=False at this exact moment, with the iptables drop still in place. shouldForceRestartHost was called, looked at the host, and returned false. The pod was only healed as a side-effect — .spec.taskID is hashed into the StatefulSet's object-version label, which made ReconcileStatefulSet see a label diff and roll the StatefulSet:

    I0528 02:42:30.261025  GetObjectStatusFromMetas():  cur and new objects ARE DIFFERENT
                                                        based on object version label
    I0528 02:42:30.261076  reconcileHostStatefulSet():  Reconcile host STS: 0-0. Reconcile StatefulSet
    I0528 02:42:30.265067  ReconcileStatefulSet():      Need to reconcile MODIFIED StatefulSet
    

    Note that on a pure .spec no-op (e.g. annotation change to the pod template), this side-effect would not have fired and the operator would have logged the same "Host force restart is not required" decision and gone back to sleep with the pod still Ready=False.

What's actually broken

There are two independent gaps:

Gap 1 — the wake-up side. The operator never enqueues a CHI reconcile in response to a Ready=True → Ready=False transition on a child pod for a Completed CHI.

  • The pod informer registers AddFunc, UpdateFunc, DeleteFunc (pkg/controller/chi/controller.go:434-460). The UpdateFunc enqueues a ReconcilePod{ReconcileUpdate} carrying both oldPod and newPod.
  • The worker handler processReconcilePod (pkg/controller/chi/worker-boilerplate.go:129-150 on release-0.24.2) explicitly discards ReconcileUpdate:
    case cmd_queue.ReconcileUpdate:
        //ignore
        //w.a.V(1).M(cmd.new).F().Info("Update Pod. %s/%s", cmd.new.Namespace, cmd.new.Name)
        //metricsPodUpdate(ctx)
        return nil
    On current HEAD this branch was replaced by a call to recoverAbortedReconcileOnPodReady, which is gated to oldPod = NotReady → newPod = Ready and CHI.Status == StatusAborted. Neither condition matches the production incident (transition is the wrong direction; status is Completed, not Aborted).
  • The EndpointSlice informer does fire when readiness flips and the pod IP is removed from the service endpoint set, but its worker handler updateEndpoints (v0.24.2 worker.go:264-294) is intentionally scoped to "rebuild users ConfigMap with the new set of pod IPs and update CHI status" — it never calls reconcileCR or shouldForceRestartHost. From the user's perspective the operator is awake but is choosing not to look at host health.
  • The CHI informer's periodic resync (chopInformerFactoryResyncPeriod = 60 * time.Second, cmd/operator/app/thread_chi.go:31-40) does fire every minute, but the resulting UpdateFunc calls hit isGenerationTheSame() in reconcileCR (v0.24.2 worker-chi-reconciler.go:40-54) and exit with nothing to do here, exit. Periodic resync is therefore not a safety net for this class of failure.

Gap 2 — the decision side. Even when a reconcile does run with the pod in Ready=False, shouldForceRestartHost returns false. The function (v0.24.2 worker.go:162-193) only considers:

// 1. Rolling update purpose
if host.GetCR().IsRollingUpdate() { return true }

// 2. New host
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusNew { return false }

// 3. Existing host without ancestor
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusSame && !host.HasAncestor() { return false }

// 4. Config-driven reboot rules
if model.IsConfigurationChangeRequiresReboot(host) { return true }

// 5. Unknown version AND CrashLoopBackOff
if host.Runtime.Version.IsUnknown() && w.isPodCrushed(host) { return true }

return false

There is no case for "host's pod has been Ready=False for longer than threshold T". Note also that case 5 requires host.Runtime.Version.IsUnknown() — i.e. the operator was never able to read the version in the first place — and isPodCrushed checks specifically for ContainerStatus.State.Waiting.Reason == "CrashLoopBackOff". A pod that is happily Running but Ready=False matches neither sub-condition.

Either gap alone would already be enough to keep an unhealthy pod stuck:

  • Closing only Gap 1 (enqueue on Ready→NotReady) without closing Gap 2 would just have the operator wake up, run a full reconcile, log Host force restart is not required, and go back to sleep — every 60 s instead of never. Same user-visible outcome.
  • Closing only Gap 2 (let shouldForceRestartHost react to long-running Ready=False) without closing Gap 1 would still require something else to enqueue the CHI in the first place, and on a steady-state Completed CHI nothing does.

Stronger variant: even container restarts don't help

During the early part of the kind run I unintentionally used a 90 % packet-drop ratio. This pushed liveness over its failure=10 threshold within a few minutes, and kubectl get pod showed RESTARTS=5 while readiness still flapped False between restarts. The operator's pod informer saw the resulting pod Update events (every container restart, kubelet writes a new ContainerStatus), and the worker dropped all of them at the ReconcileUpdate //ignore branch above. Zero operator log lines for the CHI throughout the four extra container restarts, confirming that the path is broken at the worker level, not at the informer level.

Why this matters in production

In an HA test that randomly brings nodes up and down, it is common for a freshly-rescheduled ClickHouse pod to land in a state where:

  • the underlying network or backing storage is slow but not catastrophically broken,
  • the /ping endpoint serves slowly enough or fails often enough that probes time out at the readiness threshold,
  • but liveness's failure=10 never accumulates, so kubelet does not restart the container.

The same shape can be produced by any of:

  • intermittent CNI loss between kubelet and the pod (the scenario actually used here),
  • a slow clickhouse-keeper/ZooKeeper backend that makes /ping slow above the probe timeout=1s,
  • transient disk latency on the data PVC large enough to delay HTTP responses past 1 s without crashing the server,
  • a partial GC stall or thread pool stall inside clickhouse-server,
  • a probe-path config that's strict enough relative to clickhouse-server's tail latency under load.

Proposed fix

I'd like to take a stab at this if the maintainers agree on the direction. Two complementary changes, both small:

  1. processReconcilePod (HEAD): in addition to the existing recoverAbortedReconcileOnPodReady call for NotReady→Ready on Aborted CHIs, add a symmetric path for Ready→NotReady on Completed CHIs that enqueues a ReconcileCHI{Update} for the CHI that owns the pod. This re-uses the existing cmd_queue.NewReconcileCHI(ReconcileAdd, nil, cr) path, with a debounce so a single flapping pod doesn't generate work-queue storms.

  2. shouldForceRestartHost: add a new case immediately after the isPodCrushed check:

    if w.isPodSustainedNotReady(ctx, host, threshold) {
        w.a.V(1).M(host).F().Info("Host pod has been NotReady for %s. Restart required.", threshold)
        return true
    }

    where isPodSustainedNotReady returns true iff Pod.Status.Conditions[Ready].Status == "False" and time.Since(LastTransitionTime) >= threshold. The threshold should be configurable in Config (suggested default: 5 min). This keeps the operator conservative — it explicitly does not restart on every readiness flap, which would be dangerous for a stateful workload — but ensures a sustained, externally-visible Ready=False eventually gets remediated.

To address the stateful-workload-safety concern up front: the proposed restart only triggers when the pod has already been Ready=False for threshold minutes, which is much longer than any reasonable transient (probe glitches, GC pauses, deploy rollouts) but much shorter than 26 hours. The pod is by definition already out of the service endpoint slice, so no client traffic is being lost by restarting it — by then it's been losing client traffic the whole time.

Happy to put a PR up against latest release and a separate backport against release-0.24.x or release-0.26.x once the approach is confirmed.

Workaround in the meantime

Until the operator gains a sustained-Ready=False reaction, the practical workaround for operators of clusters managed by 0.24.x is to add an external alert on kube_pod_status_ready{condition="false",pod=~"chi-.*"} > 5m and remediate manually with one of:

  • kubectl delete pod chi-...-0-0-0 — the StatefulSet will recreate it; this is the safest cluster-side action.
  • kubectl rollout restart deploy/<operator> — works for the same reason it worked in production, via the operator's "different IP" → reconfig branch.
  • Bumping any .spec field on the CHI to force a generation change.

None of these are appropriate for an auto-remediation system to drive without operator intervention, which is the gap this issue is asking the operator to close.

Attachments / artifacts

  • evidence/kind-repro-operator.log — full operator log capture during the kind reproduction (588 lines).
  • evidence/kind-repro-baseline.txt — baseline state snapshot taken at the start of the experiment.
  • Production operator log around the recovery is available on request (we have ~19 k lines starting at 2026-04-22 05:00:45, i.e. the moment the manual rollout kicked in). The pre-recovery 26-hour window had already rotated out by the time the incident was investigated, which is one of the reasons this issue is being filed: there is no way to retroactively see what the stalled operator was thinking, only the recovered one.

issue.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions