Operator does not heal a ClickHouse host stuck at `Ready=False` while CHI is `Completed`

## Summary

When a ClickHouse server pod managed by a `Completed` `ClickHouseInstallation` (CHI) flips from `Ready=True` to `Ready=False` due to a sustained application-layer issue (e.g. intermittent network loss, slow probe response, transient backend stall) — but **without** crashing badly enough to trip the kubelet liveness threshold — the operator does **not** re-evaluate the host or attempt any remedial action. The pod remains `Ready=False` indefinitely. There is no operator-side auto-recovery path that fires on `Ready=True → Ready=False` for a `Completed` CHI.

The only way out is operator-external intervention: deleting the pod, bumping a field in `.spec` to force a generation change, or restarting the operator Deployment itself (which only works because some *unrelated* branch of `shouldForceRestartHost` happens to fire on cold-start — e.g. an `IsRollingUpdate` classification, or the "different operator IP" config-rebuild path — not because the operator detected an unhealthy host).

This is true on `release-0.24.2` (where the incident was observed in production) and the bug surface is essentially identical on current HEAD; only the cosmetics of the worker code have moved around.

## Operator version

```
clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 BuiltAt:2024-12-06T14:45:56
```

Re-verified on current HEAD (`b5b826eb6`) and on `release-0.24.2`. The bug-relevant code paths are present in both. The kind reproduction below was run on `0.26.3` and exhibited identical behaviour.

## Environment

- Kubernetes 1.25.x (production) and 1.25.3 (kind reproduction).
- Single-shard, single-replica CHI in `Status=Completed`.
- ClickHouse 24.8.0 server image (production used a 25.x build; behaviour does not depend on the server image).
- Standard chart-default probe configuration on the ClickHouse pod:
  ```
  Liveness:   http-get http://:http/ping  delay=60s timeout=1s period=3s #success=1 #failure=10
  Readiness:  http-get http://:http/ping  delay=10s timeout=1s period=3s #success=1 #failure=3
  ```
  Readiness trips after ≈9 s of failed probes; liveness needs ≈30 s of *uninterrupted* failures. Any flap that lets even one probe succeed inside any 30 s window keeps the container alive while readiness stays `False` for as long as the underlying fault persists.

## Production incident

Originally observed during an HA test where worker nodes were brought up and down in random sequences. After one such cycle, the ClickHouse pod was rescheduled successfully but its readiness probe began returning non-`200` intermittently at the application layer. The pod stayed `Ready=False` continuously for **~26 hours**. During that entire window:

- The CHI stayed at `Status=Completed`, with no generation change and no operator-side activity.
- The container's `RESTARTS` counter did **not** increment — liveness never reached `failure=10` consecutively, because the fault was flaky enough to let occasional probes succeed.
- StatefulSet, EndpointSlice, kube-proxy, scheduler and kubelet all behaved correctly; the pod was simply not ready and nothing reacted to that fact at the operator layer.
- Recovery only happened after the operator Deployment was rolled (`kubectl rollout restart deploy/<operator>`), and that recovery came from the operator's "different IP since previous reconcile" branch forcing a full reconfig — not from health-aware logic.

Key log lines from the recovery (production, sanitised):

```
05:00:45  main.go:67  Starting clickhouse-operator. Version:0.24.2 GitSHA:7fbf704 ...
05:00:56  worker.go:364  Operator IPs are different. Operator was restarted on another
                         IP since previous reconcile of the CHI: clickhouse
05:00:59  worker.go:166  shouldForceRestartHost(): RollingUpdate requires force restart. Host: 0-0
05:00:59  worker-chi-reconciler.go:329  reconcileHostStatefulSet(): Reconcile host: 0-0.
                                        Shutting host down due to force restart
05:01:12  worker-boilerplate.go:141  processReconcilePod(): Delete Pod. .../chi-clickhouse-c1-0-0-0
05:01:19  worker-boilerplate.go:132  processReconcilePod(): Add Pod.    .../chi-clickhouse-c1-0-0-0
```

The remediating decision is the third line: `RollingUpdate requires force restart`. That's the first branch of `shouldForceRestartHost` and it triggered because the freshly-rolled operator's normalisation classified the CHI as `IsRollingUpdate()` (it had `reconciling.policy: rolling` in the spec). **Recovery was not health-aware** — the operator did not detect that the pod was unhealthy. It happened to take the `IsRollingUpdate` path on first reconcile after restart, which is unrelated to the pod's `Ready=False` state. If the CHI had not had that policy set, the operator would have logged `Host force restart is not required` and gone right back to sleep, leaving the pod stuck.

**Equivalently: if the operator had not been restarted manually, the pod would have stayed `Ready=False` indefinitely.** This is the user-visible bug.

The full pre-recovery 26-hour window was unfortunately rotated out of our log retention before the incident was investigated — the operator pod and ClickHouse pod were both still on their original incarnation, so kubelet, kube-apiserver, kube-controller-manager and kube-scheduler logs from that period are also unavailable. What we *do* have:

- Recovery operator logs (`Version:0.24.2 GitSHA:7fbf704`), sanitised excerpt in `evidence/production-recovery-excerpt.log`.
- `kubectl describe` of the affected pod with the readiness/liveness configuration shown above.
- A clean reproduction on `kind` (below) that exercises the same code paths and demonstrates the exact failure mode end-to-end.

## Hypothesis (then) → Reproduction (now) → Confirmed root cause

### Hypothesis

For a `Status=Completed` CHI, no operator code path translates a child pod's `Ready=True → Ready=False` transition into either:

1. an enqueued CHI reconcile, *or*
2. a host-health decision once a reconcile *is* running.

Therefore a long-running `Ready=False` pod is invisible to the operator until something external bumps `.spec`, deletes the pod, or restarts the operator with a new IP.

### Reproduction on `kind` (1 node, KinD 1.25.3, operator 0.26.3)

The setup uses a single-shard single-replica CHI in `Status=Completed`, identical probe configuration to production. Fault is injected at the node's `iptables` `OUTPUT`/`INPUT` chains, scoped to the ClickHouse pod IP and port 8123, with random per-packet drop. This is the production-realistic equivalent of intermittent network loss on the kubelet ↔ pod path.

Step-by-step:

1. Baseline:
   ```
   $ kubectl -n clickhouse-operator get chi clickhouse \
       -o jsonpath='Gen={.metadata.generation} Status={.status.status} Task={.status.taskID}'
   Gen=13 Status=Completed Task=auto-65723cb8-1293-4137-995e-22d7d171f0c7

   $ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
       -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
   Ready=True Restarts=1
   ```

2. Tail operator logs to a file in the background:
   ```
   kubectl -n clickhouse-operator logs deploy/altinity-clickhouse-operator -c altinity-clickhouse-operator -f > /tmp/op.log &
   ```

3. Inject intermittent drops on the kind node (POD_IP=`10.244.0.6` here). After ramping the drop ratio I converged on a value that reliably trips readiness without quite tripping liveness for the duration of the experiment:
   ```
   # On the kind node
   iptables -I OUTPUT -p tcp -d 10.244.0.6 --dport 8123 \
       -m statistic --mode random --probability 0.65 -j DROP
   iptables -I INPUT  -p tcp -s 10.244.0.6 --sport 8123 \
       -m statistic --mode random --probability 0.65 -j DROP
   ```
   *(For the very first window I briefly used `0.9/0.9`, which did push liveness above its `failure=10` threshold and accumulated four container restarts. That was useful — it shows that **even when kubelet recreates the container 4× back-to-back, the operator's pod-update handler still does nothing** — but it's stronger than the production scenario, so I dropped probability back to 65 % for the actual confirmation window. See [Stronger variant](#stronger-variant) below.)*

4. Observe the pod flip and the CHI stay still:
   ```
   $ kubectl -n clickhouse-operator get pod chi-clickhouse-c1-0-0-0 \
       -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status} Restarts={.status.containerStatuses[0].restartCount}'
   Ready=False Restarts=1

   $ kubectl -n clickhouse-operator get chi clickhouse \
       -o jsonpath='Gen={.metadata.generation} Status={.status.status}'
   Gen=13 Status=Completed
   ```

5. Operator activity during the 7-minute observation window (`02:35`–`02:42`): the **only** log line attributable to this CHI was a single EndpointSlice event at the moment readiness first flipped:
   ```
   I0528 02:34:58.210516  processReconcileEndpointSlice():  Transition: '10.244.0.6'=>''
   I0528 02:34:58.224027  prepareListOfTemplates():         Found applicable templates num: 0
   I0528 02:34:58.234132  CHI:clickhouse-operator/clickhouse: IPs of the CR ... len: 1 [10.244.0.6]
   I0528 02:34:58.244363  worker-reconciler-chi.go:172      CHI:clickhouse-operator/clickhouse:
   I0528 02:34:58.248387  worker-config-map.go:70           Update ConfigMap chi-clickhouse-common-usersd
   ```
   That path lands in `updateEndpoints`, which only refreshes the users ConfigMap with the current set of pod IPs and updates CHI status; it never reaches `reconcileCR`, never calls `shouldForceRestartHost`, never inspects pod readiness.

   After that one burst at `02:34:58`, **zero further log lines were produced for this CHI for the rest of the observation window**, despite the pod staying `Ready=False` the entire time.

6. Control A — annotation-only update (no `.spec` change). Bump `metadata.annotations.repro/touch` while the pod is still `Ready=False`:
   ```
   $ kubectl -n clickhouse-operator annotate chi clickhouse repro/touch="$(date +%s)" --overwrite
   $ kubectl -n clickhouse-operator get chi clickhouse \
       -o jsonpath='Gen={.metadata.generation} RV={.metadata.resourceVersion}'
   Gen=13 RV=1175174     # generation unchanged, resourceVersion bumped
   ```
   The CHI informer's `UpdateFunc` fires (resourceVersion changed) and enqueues a `ReconcileCHI{Update}`. The worker immediately drops it because `isGenerationTheSame(old, new) == true`. Zero log lines, zero side effects. **This proves the informer's periodic resync — which also looks like an annotation-only update at the worker — cannot rescue an unhealthy pod either.**

7. Control B — force a generation bump by editing `.spec` while the pod is still `Ready=False`. I used `.spec.taskID`:
   ```
   $ kubectl -n clickhouse-operator patch chi clickhouse --type=merge \
       -p '{"spec":{"taskID":"repro-1779936148"}}'
   $ kubectl -n clickhouse-operator get chi clickhouse \
       -o jsonpath='Gen={.metadata.generation} Status={.status.status}'
   Gen=14 Status=InProgress
   ```
   The operator immediately runs a full reconcile. Critically:
   ```
   I0528 02:42:30.253804  worker.go:197  shouldForceRestartHost():Host:0-0[0/0]:
                                         Host force restart is not required. Host: 0-0
   ```
   The pod *was* `Ready=False` at this exact moment, with the iptables drop still in place. `shouldForceRestartHost` was called, looked at the host, and returned `false`. The pod was only healed as a side-effect — `.spec.taskID` is hashed into the StatefulSet's object-version label, which made `ReconcileStatefulSet` see a label diff and roll the StatefulSet:
   ```
   I0528 02:42:30.261025  GetObjectStatusFromMetas():  cur and new objects ARE DIFFERENT
                                                       based on object version label
   I0528 02:42:30.261076  reconcileHostStatefulSet():  Reconcile host STS: 0-0. Reconcile StatefulSet
   I0528 02:42:30.265067  ReconcileStatefulSet():      Need to reconcile MODIFIED StatefulSet
   ```
   Note that on a pure `.spec` no-op (e.g. annotation change to the pod template), this side-effect would not have fired and the operator would have logged the same "Host force restart is not required" decision and gone back to sleep with the pod still `Ready=False`.

### What's actually broken

There are **two** independent gaps:

**Gap 1 — the wake-up side.** The operator never enqueues a CHI reconcile in response to a `Ready=True → Ready=False` transition on a child pod for a `Completed` CHI.

- The pod informer registers `AddFunc`, `UpdateFunc`, `DeleteFunc` ([`pkg/controller/chi/controller.go:434-460`](../../clickhouse-operator/pkg/controller/chi/controller.go)). The `UpdateFunc` enqueues a `ReconcilePod{ReconcileUpdate}` carrying both `oldPod` and `newPod`.
- The worker handler `processReconcilePod` ([`pkg/controller/chi/worker-boilerplate.go:129-150` on `release-0.24.2`](https://github.com/Altinity/clickhouse-operator/blob/release-0.24.2/pkg/controller/chi/worker-boilerplate.go#L129-L150)) explicitly discards `ReconcileUpdate`:
  ```go
  case cmd_queue.ReconcileUpdate:
      //ignore
      //w.a.V(1).M(cmd.new).F().Info("Update Pod. %s/%s", cmd.new.Namespace, cmd.new.Name)
      //metricsPodUpdate(ctx)
      return nil
  ```
  On current HEAD this branch was replaced by a call to `recoverAbortedReconcileOnPodReady`, which is gated to `oldPod = NotReady → newPod = Ready` **and** `CHI.Status == StatusAborted`. Neither condition matches the production incident (transition is the wrong direction; status is `Completed`, not `Aborted`).
- The EndpointSlice informer *does* fire when readiness flips and the pod IP is removed from the service endpoint set, but its worker handler `updateEndpoints` ([v0.24.2 worker.go:264-294](https://github.com/Altinity/clickhouse-operator/blob/release-0.24.2/pkg/controller/chi/worker.go#L264-L294)) is intentionally scoped to "rebuild users ConfigMap with the new set of pod IPs and update CHI status" — it never calls `reconcileCR` or `shouldForceRestartHost`. From the user's perspective the operator is awake but is choosing not to look at host health.
- The CHI informer's periodic resync (`chopInformerFactoryResyncPeriod = 60 * time.Second`, [`cmd/operator/app/thread_chi.go:31-40`](https://github.com/Altinity/clickhouse-operator/blob/release-0.24.2/cmd/operator/app/thread_chi.go#L31-L40)) does fire every minute, but the resulting `UpdateFunc` calls hit `isGenerationTheSame()` in `reconcileCR` ([v0.24.2 worker-chi-reconciler.go:40-54](https://github.com/Altinity/clickhouse-operator/blob/release-0.24.2/pkg/controller/chi/worker-chi-reconciler.go#L40-L54)) and exit with `nothing to do here, exit`. Periodic resync is therefore not a safety net for this class of failure.

**Gap 2 — the decision side.** Even when a reconcile *does* run with the pod in `Ready=False`, `shouldForceRestartHost` returns `false`. The function ([v0.24.2 worker.go:162-193](https://github.com/Altinity/clickhouse-operator/blob/release-0.24.2/pkg/controller/chi/worker.go#L162-L193)) only considers:

```go
// 1. Rolling update purpose
if host.GetCR().IsRollingUpdate() { return true }

// 2. New host
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusNew { return false }

// 3. Existing host without ancestor
if host.GetReconcileAttributes().GetStatus() == api.ObjectStatusSame && !host.HasAncestor() { return false }

// 4. Config-driven reboot rules
if model.IsConfigurationChangeRequiresReboot(host) { return true }

// 5. Unknown version AND CrashLoopBackOff
if host.Runtime.Version.IsUnknown() && w.isPodCrushed(host) { return true }

return false
```

There is **no case for "host's pod has been `Ready=False` for longer than threshold T"**. Note also that case 5 requires `host.Runtime.Version.IsUnknown()` — i.e. the operator was never able to read the version in the first place — *and* `isPodCrushed` checks specifically for `ContainerStatus.State.Waiting.Reason == "CrashLoopBackOff"`. A pod that is happily `Running` but `Ready=False` matches neither sub-condition.

Either gap alone would already be enough to keep an unhealthy pod stuck:

- Closing only Gap 1 (enqueue on `Ready→NotReady`) without closing Gap 2 would just have the operator wake up, run a full reconcile, log `Host force restart is not required`, and go back to sleep — every 60 s instead of never. Same user-visible outcome.
- Closing only Gap 2 (let `shouldForceRestartHost` react to long-running `Ready=False`) without closing Gap 1 would still require something else to enqueue the CHI in the first place, and on a steady-state `Completed` CHI nothing does.

### Stronger variant: even container restarts don't help

During the early part of the kind run I unintentionally used a 90 % packet-drop ratio. This pushed liveness over its `failure=10` threshold within a few minutes, and `kubectl get pod` showed `RESTARTS=5` while readiness still flapped `False` between restarts. The operator's pod informer saw the resulting pod `Update` events (every container restart, kubelet writes a new `ContainerStatus`), and the worker dropped all of them at the `ReconcileUpdate //ignore` branch above. **Zero operator log lines for the CHI throughout the four extra container restarts**, confirming that the path is broken at the worker level, not at the informer level.

## Why this matters in production

In an HA test that randomly brings nodes up and down, it is common for a freshly-rescheduled ClickHouse pod to land in a state where:

- the underlying network or backing storage is slow but not catastrophically broken,
- the `/ping` endpoint serves slowly enough or fails often enough that probes time out at the readiness threshold,
- but liveness's `failure=10` never accumulates, so kubelet does not restart the container.

The same shape can be produced by any of:

- intermittent CNI loss between kubelet and the pod (the scenario actually used here),
- a slow `clickhouse-keeper`/ZooKeeper backend that makes `/ping` slow above the probe `timeout=1s`,
- transient disk latency on the data PVC large enough to delay HTTP responses past 1 s without crashing the server,
- a partial GC stall or thread pool stall inside `clickhouse-server`,
- a probe-path config that's strict enough relative to `clickhouse-server`'s tail latency under load.

## Proposed fix

I'd like to take a stab at this if the maintainers agree on the direction. Two complementary changes, both small:

1. **`processReconcilePod` (HEAD):** in addition to the existing `recoverAbortedReconcileOnPodReady` call for `NotReady→Ready` on `Aborted` CHIs, add a symmetric path for `Ready→NotReady` on `Completed` CHIs that enqueues a `ReconcileCHI{Update}` for the CHI that owns the pod. This re-uses the existing `cmd_queue.NewReconcileCHI(ReconcileAdd, nil, cr)` path, with a debounce so a single flapping pod doesn't generate work-queue storms.

2. **`shouldForceRestartHost`:** add a new case immediately after the `isPodCrushed` check:
   ```go
   if w.isPodSustainedNotReady(ctx, host, threshold) {
       w.a.V(1).M(host).F().Info("Host pod has been NotReady for %s. Restart required.", threshold)
       return true
   }
   ```
   where `isPodSustainedNotReady` returns true iff `Pod.Status.Conditions[Ready].Status == "False"` and `time.Since(LastTransitionTime) >= threshold`. The threshold should be configurable in `Config` (suggested default: 5 min). This keeps the operator conservative — it explicitly does *not* restart on every readiness flap, which would be dangerous for a stateful workload — but ensures a sustained, externally-visible `Ready=False` eventually gets remediated.

To address the stateful-workload-safety concern up front: the proposed restart only triggers when the pod has *already* been `Ready=False` for `threshold` minutes, which is much longer than any reasonable transient (probe glitches, GC pauses, deploy rollouts) but much shorter than 26 hours. The pod is by definition already out of the service endpoint slice, so no client traffic is being lost by restarting it — by then it's been losing client traffic the whole time.

Happy to put a PR up against latest release and a separate backport against `release-0.24.x` or `release-0.26.x` once the approach is confirmed.

## Workaround in the meantime

Until the operator gains a sustained-`Ready=False` reaction, the practical workaround for operators of clusters managed by 0.24.x is to add an external alert on `kube_pod_status_ready{condition="false",pod=~"chi-.*"} > 5m` and remediate manually with one of:

- `kubectl delete pod chi-...-0-0-0` — the StatefulSet will recreate it; this is the safest cluster-side action.
- `kubectl rollout restart deploy/<operator>` — works for the same reason it worked in production, via the operator's "different IP" → reconfig branch.
- Bumping any `.spec` field on the CHI to force a generation change.

None of these are appropriate for an auto-remediation system to drive without operator intervention, which is the gap this issue is asking the operator to close.

## Attachments / artifacts

- `evidence/kind-repro-operator.log` — full operator log capture during the kind reproduction (588 lines).
- `evidence/kind-repro-baseline.txt` — baseline state snapshot taken at the start of the experiment.
- Production operator log around the recovery is available on request (we have ~19 k lines starting at `2026-04-22 05:00:45`, i.e. the moment the manual rollout kicked in). The pre-recovery 26-hour window had already rotated out by the time the incident was investigated, which is one of the reasons this issue is being filed: there is no way to retroactively see what the *stalled* operator was thinking, only the recovered one.

[issue.zip](https://github.com/user-attachments/files/28335314/issue.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator does not heal a ClickHouse host stuck at `Ready=False` while CHI is `Completed` #1994

Summary

Operator version

Environment

Production incident

Hypothesis (then) → Reproduction (now) → Confirmed root cause

Hypothesis

Reproduction on `kind` (1 node, KinD 1.25.3, operator 0.26.3)

What's actually broken

Stronger variant: even container restarts don't help

Why this matters in production

Proposed fix

Workaround in the meantime

Attachments / artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Operator does not heal a ClickHouse host stuck at Ready=False while CHI is Completed #1994

Description

Summary

Operator version

Environment

Production incident

Hypothesis (then) → Reproduction (now) → Confirmed root cause

Hypothesis

Reproduction on kind (1 node, KinD 1.25.3, operator 0.26.3)

What's actually broken

Stronger variant: even container restarts don't help

Why this matters in production

Proposed fix

Workaround in the meantime

Attachments / artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Operator does not heal a ClickHouse host stuck at `Ready=False` while CHI is `Completed` #1994

Reproduction on `kind` (1 node, KinD 1.25.3, operator 0.26.3)