Summary
When a ClickHouse server pod managed by a Completed ClickHouseInstallation (CHI) flips from Ready=True to Ready=False due to a sustained application-layer issue (e.g. intermittent network loss, slow probe response, transient backend stall) — but without crashing badly enough to trip the kubelet liveness threshold — the operator does not re-evaluate the host or attempt any remedial action. The pod remains Ready=False indefinitely. There is no operator-side auto-recovery path that fires on Ready=True → Ready=False for a Completed CHI.
The only way out is operator-external intervention: deleting the pod, bumping a field in .spec to force a generation change, or restarting the operator Deployment itself (which only works because some unrelated branch of shouldForceRestartHost happens to fire on cold-start — e.g. an IsRollingUpdate classification, or the "different operator IP" config-rebuild path — not because the operator detected an unhealthy host).
This is true on release-0.24.2 (where the incident was observed in production) and the bug surface is essentially identical on current HEAD; only the cosmetics of the worker code have moved around.
Operator version
clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 BuiltAt:2024-12-06T14:45:56
Re-verified on current HEAD (b5b826eb6) and on release-0.24.2. The bug-relevant code paths are present in both. The kind reproduction below was run on 0.26.3 and exhibited identical behaviour.
Environment
- Kubernetes 1.25.x (production) and 1.25.3 (kind reproduction).
- Single-shard, single-replica CHI in
Status=Completed.
- ClickHouse 24.8.0 server image (production used a 25.x build; behaviour does not depend on the server image).
- Standard chart-default probe configuration on the ClickHouse pod:
Liveness: http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
Readiness: http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
Readiness trips after ≈9 s of failed probes; liveness needs ≈30 s of uninterrupted failures. Any flap that lets even one probe succeed inside any 30 s window keeps the container alive while readiness stays False for as long as the underlying fault persists.
Production incident
Originally observed during an HA test where worker nodes were brought up and down in random sequences. After one such cycle, the ClickHouse pod was rescheduled successfully but its readiness probe began returning non-200 intermittently at the application layer. The pod stayed Ready=False continuously for ~26 hours. During that entire window:
- The CHI stayed at
Status=Completed, with no generation change and no operator-side activity.
- The container's
RESTARTS counter did not increment — liveness never reached failure=10 consecutively, because the fault was flaky enough to let occasional probes succeed.
- StatefulSet, EndpointSlice, kube-proxy, scheduler and kubelet all behaved correctly; the pod was simply not ready and nothing reacted to that fact at the operator layer.
- Recovery only happened after the operator Deployment was rolled (
kubectl rollout restart deploy/<operator>), and that recovery came from the operator's "different IP since previous reconcile" branch forcing a full reconfig — not from health-aware logic.
Key log lines from the recovery (production, sanitised):
05:00:45 main.go:67 Starting clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 ...
05:00:56 worker.go:364 Operator IPs are different. Operator was restarted on another
IP since previous reconcile of the CHI: clickhouse
05:00:59 worker.go:166 shouldForceRestartHost(): RollingUpdate requires force restart. Host: 0-0
05:00:59 worker-chi-reconciler.go:329 reconcileHostStatefulSet(): Reconcile host: 0-0.
Shutting host down due to force restart
05:01:12 worker-boilerplate.go:141 processReconcilePod(): Delete Pod. .../chi-clickhouse-c1-0-0-0
05:01:19 worker-boilerplate.go:132 processReconcilePod(): Add Pod. .../chi-clickhouse-c1-0-0-0
The remediating decision is the third line: RollingUpdate requires force restart. That's the first branch of shouldForceRestartHost and it triggered because the freshly-rolled operator's normalisation classified the CHI as IsRollingUpdate() (it had reconciling.policy: rolling in the spec). Recovery was not health-aware — the operator did not detect that the pod was unhealthy. It happened to take the IsRollingUpdate path on first reconcile after restart, which is unrelated to the pod's Ready=False state. If the CHI had not had that policy set, the operator would have logged Host force restart is not required and gone right back to sleep, leaving the pod stuck.
Equivalently: if the operator had not been restarted manually, the pod would have stayed Ready=False indefinitely. This is the user-visible bug.
The full pre-recovery 26-hour window was unfortunately rotated out of our log retention before the incident was investigated — the operator pod and ClickHouse pod were both still on their original incarnation, so kubelet, kube-apiserver, kube-controller-manager and kube-scheduler logs from that period are also unavailable. What we do have:
- Recovery operator logs (
Version:0.24.2 GitSHA:7fbf704), sanitised excerpt in evidence/production-recovery-excerpt.log.
kubectl describe of the affected pod with the readiness/liveness configuration shown above.
- A clean reproduction on
kind (below) that exercises the same code paths and demonstrates the exact failure mode end-to-end.
Hypothesis (then) → Reproduction (now) → Confirmed root cause
Hypothesis
For a Status=Completed CHI, no operator code path translates a child pod's Ready=True → Ready=False transition into either:
- an enqueued CHI reconcile, or
- a host-health decision once a reconcile is running.
Therefore a long-running Ready=False pod is invisible to the operator until something external bumps .spec, deletes the pod, or restarts the operator with a new IP.
Reproduction on kind (1 node, KinD 1.25.3, operator 0.26.3)
The setup uses a single-shard single-replica CHI in Status=Completed, identical probe configuration to production. Fault is injected at the node's iptables OUTPUT/INPUT chains, scoped to the ClickHouse pod IP and port 8123, with random per-packet drop. This is the production-realistic equivalent of intermittent network loss on the kubelet ↔ pod path.
Step-by-step:
-
Baseline:
$ kubectl -n clickhouse-operator get chi clickhouse \
-o jsonpath='Gen={.metadata.generation} Status={.status.status} Task={.status.taskID}'
Gen=13 Status=Completed Task=auto-65723cb8-1293-4137-995e-22d7d171f0c7
$ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
-o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
Ready=True Restarts=1
-
Tail operator logs to a file in the background:
kubectl -n clickhouse-operator logs deploy/altinity-clickhouse-operator -c altinity-clickhouse-operator -f > /tmp/op.log &
-
Inject intermittent drops on the kind node (POD_IP=10.244.0.6 here). After ramping the drop ratio I converged on a value that reliably trips readiness without quite tripping liveness for the duration of the experiment:
# On the kind node
iptables -I OUTPUT -p tcp -d 10.244.0.6 --dport 8123 \
-m statistic --mode random --probability 0.65 -j DROP
iptables -I INPUT -p tcp -s 10.244.0.6 --sport 8123 \
-m statistic --mode random --probability 0.65 -j DROP
(For the very first window I briefly used 0.9/0.9, which did push liveness above its failure=10 threshold and accumulated four container restarts. That was useful — it shows that even when kubelet recreates the container 4× back-to-back, the operator's pod-update handler still does nothing — but it's stronger than the production scenario, so I dropped probability back to 65 % for the actual confirmation window. See Stronger variant below.)
-
Observe the pod flip and the CHI stay still:
$ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
-o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
Ready=False Restarts=1
$ kubectl -n clickhouse-operator get chi clickhouse \
-o jsonpath='Gen={.metadata.generation} Status={.status.status}'
Gen=13 Status=Completed
-
Operator activity during the 7-minute observation window (02:35–02:42): the only log line attributable to this CHI was a single EndpointSlice event at the moment readiness first flipped:
I0528 02:34:58.210516 processReconcileEndpointSlice(): Transition: '10.244.0.6'=>''
I0528 02:34:58.224027 prepareListOfTemplates(): Found applicable templates num: 0
I0528 02:34:58.234132 CHI:clickhouse-operator/clickhouse: IPs of the CR ... len: 1 [10.244.0.6]
I0528 02:34:58.244363 worker-reconciler-chi.go:172 CHI:clickhouse-operator/clickhouse:
I0528 02:34:58.248387 worker-config-map.go:70 Update ConfigMap chi-clickhouse-common-usersd
That path lands in updateEndpoints, which only refreshes the users ConfigMap with the current set of pod IPs and updates CHI status; it never reaches reconcileCR, never calls shouldForceRestartHost, never inspects pod readiness.
After that one burst at 02:34:58, zero further log lines were produced for this CHI for the rest of the observation window, despite the pod staying Ready=False the entire time.
-
Control A — annotation-only update (no .spec change). Bump metadata.annotations.repro/touch while the pod is still Ready=False:
$ kubectl -n clickhouse-operator annotate chi clickhouse repro/touch="$(date +%s)" --overwrite
$ kubectl -n clickhouse-operator get chi clickhouse \
-o jsonpath='Gen={.metadata.generation} RV={.metadata.resourceVersion}'
Gen=13 RV=1175174 # generation unchanged, resourceVersion bumped
The CHI informer's UpdateFunc fires (resourceVersion changed) and enqueues a ReconcileCHI{Update}. The worker immediately drops it because isGenerationTheSame(old, new) == true. Zero log lines, zero side effects. This proves the informer's periodic resync — which also looks like an annotation-only update at the worker — cannot rescue an unhealthy pod either.
-
Control B — force a generation bump by editing .spec while the pod is still Ready=False. I used .spec.taskID:
$ kubectl -n clickhouse-operator patch chi clickhouse --type=merge \
-p '{"spec":{"taskID":"repro-1779936148"}}'
$ kubectl -n clickhouse-operator get chi clickhouse \
-o jsonpath='Gen={.metadata.generation} Status={.status.status}'
Gen=14 Status=InProgress
The operator immediately runs a full reconcile. Critically:
I0528 02:42:30.253804 worker.go:197 shouldForceRestartHost():Host:0-0[0/0]:
Host force restart is not required. Host: 0-0
The pod was Ready=False at this exact moment, with the iptables drop still in place. shouldForceRestartHost was called, looked at the host, and returned false. The pod was only healed as a side-effect — .spec.taskID is hashed into the StatefulSet's object-version label, which made ReconcileStatefulSet see a label diff and roll the StatefulSet:
I0528 02:42:30.261025 GetObjectStatusFromMetas(): cur and new objects ARE DIFFERENT
based on object version label
I0528 02:42:30.261076 reconcileHostStatefulSet(): Reconcile host STS: 0-0. Reconcile StatefulSet
I0528 02:42:30.265067 ReconcileStatefulSet(): Need to reconcile MODIFIED StatefulSet
Note that on a pure .spec no-op (e.g. annotation change to the pod template), this side-effect would not have fired and the operator would have logged the same "Host force restart is not required" decision and gone back to sleep with the pod still Ready=False.
What's actually broken
There are two independent gaps:
Gap 1 — the wake-up side. The operator never enqueues a CHI reconcile in response to a Ready=True → Ready=False transition on a child pod for a Completed CHI.
- The pod informer registers
AddFunc, UpdateFunc, DeleteFunc (pkg/controller/chi/controller.go:434-460). The UpdateFunc enqueues a ReconcilePod{ReconcileUpdate} carrying both oldPod and newPod.
- The worker handler
processReconcilePod (pkg/controller/chi/worker-boilerplate.go:129-150 on release-0.24.2) explicitly discards ReconcileUpdate:
case cmd_queue.ReconcileUpdate:
//ignore
//w.a.V(1).M(cmd.new).F().Info("Update Pod. %s/%s", cmd.new.Namespace, cmd.new.Name)
//metricsPodUpdate(ctx)
return nil
On current HEAD this branch was replaced by a call to recoverAbortedReconcileOnPodReady, which is gated to oldPod = NotReady → newPod = Ready and CHI.Status == StatusAborted. Neither condition matches the production incident (transition is the wrong direction; status is Completed, not Aborted).
- The EndpointSlice informer does fire when readiness flips and the pod IP is removed from the service endpoint set, but its worker handler
updateEndpoints (v0.24.2 worker.go:264-294) is intentionally scoped to "rebuild users ConfigMap with the new set of pod IPs and update CHI status" — it never calls reconcileCR or shouldForceRestartHost. From the user's perspective the operator is awake but is choosing not to look at host health.
- The CHI informer's periodic resync (
chopInformerFactoryResyncPeriod = 60 * time.Second, cmd/operator/app/thread_chi.go:31-40) does fire every minute, but the resulting UpdateFunc calls hit isGenerationTheSame() in reconcileCR (v0.24.2 worker-chi-reconciler.go:40-54) and exit with nothing to do here, exit. Periodic resync is therefore not a safety net for this class of failure.
Gap 2 — the decision side. Even when a reconcile does run with the pod in Ready=False, shouldForceRestartHost returns false. The function (v0.24.2 worker.go:162-193) only considers:
// 1. Rolling update purpose
if host.GetCR().IsRollingUpdate() { return true }
// 2. New host
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusNew { return false }
// 3. Existing host without ancestor
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusSame && !host.HasAncestor() { return false }
// 4. Config-driven reboot rules
if model.IsConfigurationChangeRequiresReboot(host) { return true }
// 5. Unknown version AND CrashLoopBackOff
if host.Runtime.Version.IsUnknown() && w.isPodCrushed(host) { return true }
return false
There is no case for "host's pod has been Ready=False for longer than threshold T". Note also that case 5 requires host.Runtime.Version.IsUnknown() — i.e. the operator was never able to read the version in the first place — and isPodCrushed checks specifically for ContainerStatus.State.Waiting.Reason == "CrashLoopBackOff". A pod that is happily Running but Ready=False matches neither sub-condition.
Either gap alone would already be enough to keep an unhealthy pod stuck:
- Closing only Gap 1 (enqueue on
Ready→NotReady) without closing Gap 2 would just have the operator wake up, run a full reconcile, log Host force restart is not required, and go back to sleep — every 60 s instead of never. Same user-visible outcome.
- Closing only Gap 2 (let
shouldForceRestartHost react to long-running Ready=False) without closing Gap 1 would still require something else to enqueue the CHI in the first place, and on a steady-state Completed CHI nothing does.
Stronger variant: even container restarts don't help
During the early part of the kind run I unintentionally used a 90 % packet-drop ratio. This pushed liveness over its failure=10 threshold within a few minutes, and kubectl get pod showed RESTARTS=5 while readiness still flapped False between restarts. The operator's pod informer saw the resulting pod Update events (every container restart, kubelet writes a new ContainerStatus), and the worker dropped all of them at the ReconcileUpdate //ignore branch above. Zero operator log lines for the CHI throughout the four extra container restarts, confirming that the path is broken at the worker level, not at the informer level.
Why this matters in production
In an HA test that randomly brings nodes up and down, it is common for a freshly-rescheduled ClickHouse pod to land in a state where:
- the underlying network or backing storage is slow but not catastrophically broken,
- the
/ping endpoint serves slowly enough or fails often enough that probes time out at the readiness threshold,
- but liveness's
failure=10 never accumulates, so kubelet does not restart the container.
The same shape can be produced by any of:
- intermittent CNI loss between kubelet and the pod (the scenario actually used here),
- a slow
clickhouse-keeper/ZooKeeper backend that makes /ping slow above the probe timeout=1s,
- transient disk latency on the data PVC large enough to delay HTTP responses past 1 s without crashing the server,
- a partial GC stall or thread pool stall inside
clickhouse-server,
- a probe-path config that's strict enough relative to
clickhouse-server's tail latency under load.
Proposed fix
I'd like to take a stab at this if the maintainers agree on the direction. Two complementary changes, both small:
-
processReconcilePod (HEAD): in addition to the existing recoverAbortedReconcileOnPodReady call for NotReady→Ready on Aborted CHIs, add a symmetric path for Ready→NotReady on Completed CHIs that enqueues a ReconcileCHI{Update} for the CHI that owns the pod. This re-uses the existing cmd_queue.NewReconcileCHI(ReconcileAdd, nil, cr) path, with a debounce so a single flapping pod doesn't generate work-queue storms.
-
shouldForceRestartHost: add a new case immediately after the isPodCrushed check:
if w.isPodSustainedNotReady(ctx, host, threshold) {
w.a.V(1).M(host).F().Info("Host pod has been NotReady for %s. Restart required.", threshold)
return true
}
where isPodSustainedNotReady returns true iff Pod.Status.Conditions[Ready].Status == "False" and time.Since(LastTransitionTime) >= threshold. The threshold should be configurable in Config (suggested default: 5 min). This keeps the operator conservative — it explicitly does not restart on every readiness flap, which would be dangerous for a stateful workload — but ensures a sustained, externally-visible Ready=False eventually gets remediated.
To address the stateful-workload-safety concern up front: the proposed restart only triggers when the pod has already been Ready=False for threshold minutes, which is much longer than any reasonable transient (probe glitches, GC pauses, deploy rollouts) but much shorter than 26 hours. The pod is by definition already out of the service endpoint slice, so no client traffic is being lost by restarting it — by then it's been losing client traffic the whole time.
Happy to put a PR up against latest release and a separate backport against release-0.24.x or release-0.26.x once the approach is confirmed.
Workaround in the meantime
Until the operator gains a sustained-Ready=False reaction, the practical workaround for operators of clusters managed by 0.24.x is to add an external alert on kube_pod_status_ready{condition="false",pod=~"chi-.*"} > 5m and remediate manually with one of:
kubectl delete pod chi-...-0-0-0 — the StatefulSet will recreate it; this is the safest cluster-side action.
kubectl rollout restart deploy/<operator> — works for the same reason it worked in production, via the operator's "different IP" → reconfig branch.
- Bumping any
.spec field on the CHI to force a generation change.
None of these are appropriate for an auto-remediation system to drive without operator intervention, which is the gap this issue is asking the operator to close.
Attachments / artifacts
evidence/kind-repro-operator.log — full operator log capture during the kind reproduction (588 lines).
evidence/kind-repro-baseline.txt — baseline state snapshot taken at the start of the experiment.
- Production operator log around the recovery is available on request (we have ~19 k lines starting at
2026-04-22 05:00:45, i.e. the moment the manual rollout kicked in). The pre-recovery 26-hour window had already rotated out by the time the incident was investigated, which is one of the reasons this issue is being filed: there is no way to retroactively see what the stalled operator was thinking, only the recovered one.
issue.zip
Summary
When a ClickHouse server pod managed by a
CompletedClickHouseInstallation(CHI) flips fromReady=TruetoReady=Falsedue to a sustained application-layer issue (e.g. intermittent network loss, slow probe response, transient backend stall) — but without crashing badly enough to trip the kubelet liveness threshold — the operator does not re-evaluate the host or attempt any remedial action. The pod remainsReady=Falseindefinitely. There is no operator-side auto-recovery path that fires onReady=True → Ready=Falsefor aCompletedCHI.The only way out is operator-external intervention: deleting the pod, bumping a field in
.specto force a generation change, or restarting the operator Deployment itself (which only works because some unrelated branch ofshouldForceRestartHosthappens to fire on cold-start — e.g. anIsRollingUpdateclassification, or the "different operator IP" config-rebuild path — not because the operator detected an unhealthy host).This is true on
release-0.24.2(where the incident was observed in production) and the bug surface is essentially identical on current HEAD; only the cosmetics of the worker code have moved around.Operator version
Re-verified on current HEAD (
b5b826eb6) and onrelease-0.24.2. The bug-relevant code paths are present in both. The kind reproduction below was run on0.26.3and exhibited identical behaviour.Environment
Status=Completed.Falsefor as long as the underlying fault persists.Production incident
Originally observed during an HA test where worker nodes were brought up and down in random sequences. After one such cycle, the ClickHouse pod was rescheduled successfully but its readiness probe began returning non-
200intermittently at the application layer. The pod stayedReady=Falsecontinuously for ~26 hours. During that entire window:Status=Completed, with no generation change and no operator-side activity.RESTARTScounter did not increment — liveness never reachedfailure=10consecutively, because the fault was flaky enough to let occasional probes succeed.kubectl rollout restart deploy/<operator>), and that recovery came from the operator's "different IP since previous reconcile" branch forcing a full reconfig — not from health-aware logic.Key log lines from the recovery (production, sanitised):
The remediating decision is the third line:
RollingUpdate requires force restart. That's the first branch ofshouldForceRestartHostand it triggered because the freshly-rolled operator's normalisation classified the CHI asIsRollingUpdate()(it hadreconciling.policy: rollingin the spec). Recovery was not health-aware — the operator did not detect that the pod was unhealthy. It happened to take theIsRollingUpdatepath on first reconcile after restart, which is unrelated to the pod'sReady=Falsestate. If the CHI had not had that policy set, the operator would have loggedHost force restart is not requiredand gone right back to sleep, leaving the pod stuck.Equivalently: if the operator had not been restarted manually, the pod would have stayed
Ready=Falseindefinitely. This is the user-visible bug.The full pre-recovery 26-hour window was unfortunately rotated out of our log retention before the incident was investigated — the operator pod and ClickHouse pod were both still on their original incarnation, so kubelet, kube-apiserver, kube-controller-manager and kube-scheduler logs from that period are also unavailable. What we do have:
Version:0.24.2 GitSHA:7fbf704), sanitised excerpt inevidence/production-recovery-excerpt.log.kubectl describeof the affected pod with the readiness/liveness configuration shown above.kind(below) that exercises the same code paths and demonstrates the exact failure mode end-to-end.Hypothesis (then) → Reproduction (now) → Confirmed root cause
Hypothesis
For a
Status=CompletedCHI, no operator code path translates a child pod'sReady=True → Ready=Falsetransition into either:Therefore a long-running
Ready=Falsepod is invisible to the operator until something external bumps.spec, deletes the pod, or restarts the operator with a new IP.Reproduction on
kind(1 node, KinD 1.25.3, operator 0.26.3)The setup uses a single-shard single-replica CHI in
Status=Completed, identical probe configuration to production. Fault is injected at the node'siptablesOUTPUT/INPUTchains, scoped to the ClickHouse pod IP and port 8123, with random per-packet drop. This is the production-realistic equivalent of intermittent network loss on the kubelet ↔ pod path.Step-by-step:
Baseline:
Tail operator logs to a file in the background:
Inject intermittent drops on the kind node (POD_IP=
10.244.0.6here). After ramping the drop ratio I converged on a value that reliably trips readiness without quite tripping liveness for the duration of the experiment:(For the very first window I briefly used
0.9/0.9, which did push liveness above itsfailure=10threshold and accumulated four container restarts. That was useful — it shows that even when kubelet recreates the container 4× back-to-back, the operator's pod-update handler still does nothing — but it's stronger than the production scenario, so I dropped probability back to 65 % for the actual confirmation window. See Stronger variant below.)Observe the pod flip and the CHI stay still:
Operator activity during the 7-minute observation window (
02:35–02:42): the only log line attributable to this CHI was a single EndpointSlice event at the moment readiness first flipped:That path lands in
updateEndpoints, which only refreshes the users ConfigMap with the current set of pod IPs and updates CHI status; it never reachesreconcileCR, never callsshouldForceRestartHost, never inspects pod readiness.After that one burst at
02:34:58, zero further log lines were produced for this CHI for the rest of the observation window, despite the pod stayingReady=Falsethe entire time.Control A — annotation-only update (no
.specchange). Bumpmetadata.annotations.repro/touchwhile the pod is stillReady=False:The CHI informer's
UpdateFuncfires (resourceVersion changed) and enqueues aReconcileCHI{Update}. The worker immediately drops it becauseisGenerationTheSame(old, new) == true. Zero log lines, zero side effects. This proves the informer's periodic resync — which also looks like an annotation-only update at the worker — cannot rescue an unhealthy pod either.Control B — force a generation bump by editing
.specwhile the pod is stillReady=False. I used.spec.taskID:The operator immediately runs a full reconcile. Critically:
The pod was
Ready=Falseat this exact moment, with the iptables drop still in place.shouldForceRestartHostwas called, looked at the host, and returnedfalse. The pod was only healed as a side-effect —.spec.taskIDis hashed into the StatefulSet's object-version label, which madeReconcileStatefulSetsee a label diff and roll the StatefulSet:Note that on a pure
.specno-op (e.g. annotation change to the pod template), this side-effect would not have fired and the operator would have logged the same "Host force restart is not required" decision and gone back to sleep with the pod stillReady=False.What's actually broken
There are two independent gaps:
Gap 1 — the wake-up side. The operator never enqueues a CHI reconcile in response to a
Ready=True → Ready=Falsetransition on a child pod for aCompletedCHI.AddFunc,UpdateFunc,DeleteFunc(pkg/controller/chi/controller.go:434-460). TheUpdateFuncenqueues aReconcilePod{ReconcileUpdate}carrying botholdPodandnewPod.processReconcilePod(pkg/controller/chi/worker-boilerplate.go:129-150onrelease-0.24.2) explicitly discardsReconcileUpdate:recoverAbortedReconcileOnPodReady, which is gated tooldPod = NotReady → newPod = ReadyandCHI.Status == StatusAborted. Neither condition matches the production incident (transition is the wrong direction; status isCompleted, notAborted).updateEndpoints(v0.24.2 worker.go:264-294) is intentionally scoped to "rebuild users ConfigMap with the new set of pod IPs and update CHI status" — it never callsreconcileCRorshouldForceRestartHost. From the user's perspective the operator is awake but is choosing not to look at host health.chopInformerFactoryResyncPeriod = 60 * time.Second,cmd/operator/app/thread_chi.go:31-40) does fire every minute, but the resultingUpdateFunccalls hitisGenerationTheSame()inreconcileCR(v0.24.2 worker-chi-reconciler.go:40-54) and exit withnothing to do here, exit. Periodic resync is therefore not a safety net for this class of failure.Gap 2 — the decision side. Even when a reconcile does run with the pod in
Ready=False,shouldForceRestartHostreturnsfalse. The function (v0.24.2 worker.go:162-193) only considers:There is no case for "host's pod has been
Ready=Falsefor longer than threshold T". Note also that case 5 requireshost.Runtime.Version.IsUnknown()— i.e. the operator was never able to read the version in the first place — andisPodCrushedchecks specifically forContainerStatus.State.Waiting.Reason == "CrashLoopBackOff". A pod that is happilyRunningbutReady=Falsematches neither sub-condition.Either gap alone would already be enough to keep an unhealthy pod stuck:
Ready→NotReady) without closing Gap 2 would just have the operator wake up, run a full reconcile, logHost force restart is not required, and go back to sleep — every 60 s instead of never. Same user-visible outcome.shouldForceRestartHostreact to long-runningReady=False) without closing Gap 1 would still require something else to enqueue the CHI in the first place, and on a steady-stateCompletedCHI nothing does.Stronger variant: even container restarts don't help
During the early part of the kind run I unintentionally used a 90 % packet-drop ratio. This pushed liveness over its
failure=10threshold within a few minutes, andkubectl get podshowedRESTARTS=5while readiness still flappedFalsebetween restarts. The operator's pod informer saw the resulting podUpdateevents (every container restart, kubelet writes a newContainerStatus), and the worker dropped all of them at theReconcileUpdate //ignorebranch above. Zero operator log lines for the CHI throughout the four extra container restarts, confirming that the path is broken at the worker level, not at the informer level.Why this matters in production
In an HA test that randomly brings nodes up and down, it is common for a freshly-rescheduled ClickHouse pod to land in a state where:
/pingendpoint serves slowly enough or fails often enough that probes time out at the readiness threshold,failure=10never accumulates, so kubelet does not restart the container.The same shape can be produced by any of:
clickhouse-keeper/ZooKeeper backend that makes/pingslow above the probetimeout=1s,clickhouse-server,clickhouse-server's tail latency under load.Proposed fix
I'd like to take a stab at this if the maintainers agree on the direction. Two complementary changes, both small:
processReconcilePod(HEAD): in addition to the existingrecoverAbortedReconcileOnPodReadycall forNotReady→ReadyonAbortedCHIs, add a symmetric path forReady→NotReadyonCompletedCHIs that enqueues aReconcileCHI{Update}for the CHI that owns the pod. This re-uses the existingcmd_queue.NewReconcileCHI(ReconcileAdd, nil, cr)path, with a debounce so a single flapping pod doesn't generate work-queue storms.shouldForceRestartHost: add a new case immediately after theisPodCrushedcheck:where
isPodSustainedNotReadyreturns true iffPod.Status.Conditions[Ready].Status == "False"andtime.Since(LastTransitionTime) >= threshold. The threshold should be configurable inConfig(suggested default: 5 min). This keeps the operator conservative — it explicitly does not restart on every readiness flap, which would be dangerous for a stateful workload — but ensures a sustained, externally-visibleReady=Falseeventually gets remediated.To address the stateful-workload-safety concern up front: the proposed restart only triggers when the pod has already been
Ready=Falseforthresholdminutes, which is much longer than any reasonable transient (probe glitches, GC pauses, deploy rollouts) but much shorter than 26 hours. The pod is by definition already out of the service endpoint slice, so no client traffic is being lost by restarting it — by then it's been losing client traffic the whole time.Happy to put a PR up against latest release and a separate backport against
release-0.24.xorrelease-0.26.xonce the approach is confirmed.Workaround in the meantime
Until the operator gains a sustained-
Ready=Falsereaction, the practical workaround for operators of clusters managed by 0.24.x is to add an external alert onkube_pod_status_ready{condition="false",pod=~"chi-.*"} > 5mand remediate manually with one of:kubectl delete pod chi-...-0-0-0— the StatefulSet will recreate it; this is the safest cluster-side action.kubectl rollout restart deploy/<operator>— works for the same reason it worked in production, via the operator's "different IP" → reconfig branch..specfield on the CHI to force a generation change.None of these are appropriate for an auto-remediation system to drive without operator intervention, which is the gap this issue is asking the operator to close.
Attachments / artifacts
evidence/kind-repro-operator.log— full operator log capture during the kind reproduction (588 lines).evidence/kind-repro-baseline.txt— baseline state snapshot taken at the start of the experiment.2026-04-22 05:00:45, i.e. the moment the manual rollout kicked in). The pre-recovery 26-hour window had already rotated out by the time the incident was investigated, which is one of the reasons this issue is being filed: there is no way to retroactively see what the stalled operator was thinking, only the recovered one.issue.zip