StatefulSet recreate path silently leaves host at Replicas=0 when graceful scale-to-0 Update fails (statefulset-reconciler.go)

## Summary

When the operator falls into the `Update → Recreate` StatefulSet path and `r.sts.Update()` inside `doDeleteStatefulSet` returns *any* non-nil error, three compounding bugs in `pkg/controller/common/statefulset/statefulset-reconciler.go` cause the reconcile to end with the StatefulSet stuck at `spec.replicas: 0`, no further retry, and the CHI marked **`Completed`**. The only recovery is to bounce the operator pod or `kubectl scale --replicas=1` manually.

We hit this in production during a chart upgrade `0.24.2 → 0.26.3` (full sanitized trace attached as `operator-logs.txt`). The bugs are version-agnostic: the affected file is **byte-identical** between `release-0.26.1`, `release-0.26.2`, `release-0.26.3`, `release-0.27.0`, and `master`.

**This issue is intermittent.** The buggy code path is reached deterministically on every reconcile that takes the `Update → Recreate` branch, but whether the cascade pins a stuck end-state depends on whether `r.sts.Update()` at line 527 inside `doDeleteStatefulSet` returns a non-nil error in that particular call. The same trigger sequence on the same cluster will sometimes produce a clean recreate and sometimes the stuck cascade. Our reproduction (see below) reflects this intermittency by retrying until the cascade fires; production hit it on the first attempt during a chart upgrade `0.24.2 → 0.26.3`.

## Affected versions

- Reproduced: **0.26.3**
- Code-path verified unchanged in: **0.27.0**, **master** (HEAD)
- Likely present since `recreateStatefulSet` was introduced in this form (≥ 0.24.x)

## Root cause: three compounding bugs

All line numbers are against `pkg/controller/common/statefulset/statefulset-reconciler.go` at tag `release-0.26.3`. Same lines on `master`.

The three bugs **compound**: any single one in isolation would either be benign or self-recover. Together, they produce silent end-of-reconcile at `Replicas=0` with `CHI=Completed`. **Bug 2 is the chain's first link** — fixing Bug 2 alone breaks the cascade and resolves the production failure (see `Proposed fix` below). Bugs 1 and 3 are described here for completeness.

### Bug 1 — `recreateStatefulSet` discards the error from `doDeleteStatefulSet` (line 195)

```go
func (r *Reconciler) recreateStatefulSet(ctx context.Context, host *api.Host, register bool, opts *ReconcileOptions) error {
    ...
    _ = r.doDeleteStatefulSet(ctx, host)   // <-- error silently discarded
    _ = r.storage.ReconcilePVCs(ctx, host, api.DesiredStatefulSet)
    return r.createStatefulSet(ctx, host, register, opts)
}
```

If the graceful delete fails, the recreate proceeds anyway. The original STS is still in the cluster, so the next call will predictably fail with `AlreadyExists`.

### Bug 2 — `doDeleteStatefulSet` aborts before reaching `Delete` if the scale-to-0 `Update` returns any error (lines 525-530)

```go
var zero int32 = 0
host.Runtime.CurStatefulSet.Spec.Replicas = &zero
if _, err := r.sts.Update(ctx, host.Runtime.CurStatefulSet); err != nil {
    log.V(1).M(host).Error("UNABLE to update StatefulSet %s/%s", namespace, name)
    return err   // <-- returns here, never reaches r.sts.Delete() at line 536
}
```

The function name says `doDeleteStatefulSet`, the contract is "the STS is gone after I return", but the body abandons the goal on a transient `Update` error and never falls through to `r.sts.Delete()` — which is what would actually fulfill the contract.

### Bug 3 — `doCreateStatefulSet` collapses every `Create` error (incl. `AlreadyExists`) to `ErrCRUDRecreate`, which is silently mapped to `nil` (lines 411-414 and 388-391)

```go
// doCreateStatefulSet
if _, err := r.sts.Create(ctx, statefulSet); err != nil {
    log.V(1).M(host).F().Error("StatefulSet create failed. err: %v", err)
    return common.ErrCRUDRecreate
}
```

```go
// shouldAbortOrContinueCreateStatefulSet
case common.ErrCRUDRecreate:
    r.a.V(1).M(host).Warning("Got recreate action. Ignore and continue for now")
    return nil
```

The parent caller `reconcileHostStatefulSet` (`pkg/controller/chi/worker-reconciler-chi.go:413`) registers the STS as successfully reconciled and the CHI moves to `Completed` despite the host being at `Replicas=0`.

## Production trace

Full sanitized log: `operator-logs.txt`. Key sequence (line numbers in `[brackets]` reference the source file above):

```
hostScaleDown(): Host shutdown via scale down: 0-0
diff item [15]:'.Replicas' = '0'                                        # STS now at Replicas=0
updateStatefulSet(...) - started
updateStatefulSet(...) switch from Update to Recreate                   # recreateStatefulSet begins [186]

doDeleteStatefulSet(): .../chi-clickhouse-chcluster1-0-0                 [508]
UNABLE to update StatefulSet .../chi-clickhouse-chcluster1-0-0           [528]    # Bug 2 fires
                                                                                  # r.sts.Update() returned a non-nil error;
                                                                                  # function returned, never called Delete()

createStatefulSet(...) - started
doCreateStatefulSet(): ...
StatefulSet create failed. err: statefulsets.apps "..." already exists   [412]    # consequence of Bugs 1+2
Got recreate action. Ignore and continue for now                         [390]    # Bug 3 fires
finalizeReconcileAndMarkCompleted(): reconcile completed successfully             # CHI -> Completed
```

Note: the underlying error type returned by `r.sts.Update()` at line 527 in our production trace is **not directly visible in the log** because the logging at line 528 (`log.V(1).M(host).Error("UNABLE to update StatefulSet %s/%s", namespace, name)`) only formats `namespace/name` and silently discards the `err` value. Consequently, while the cascade signature is unambiguous (Bug 2 fired), the specific apiserver error class (409 / 422 / 403 / 503 / network) at *this* call site is unknown from the log alone. **Bug 2's `return err` is unconditional — the cascade fires identically for any non-nil error, so the proposed fix is correct regardless.** Logging the underlying error here would be a useful diagnostic improvement on its own.

What the production log *does* prove: `Operation cannot be fulfilled ... the object has been modified` (the apiserver's verbatim `409 Conflict` message) occurs on this cluster at runtime — see `cr.go:82` `statusUpdateRetry()` at log line 2065. This site catches and retries it correctly. The `doDeleteStatefulSet` site is identical machinery without retry-on-conflict logic.

End state, persisted across operator reconciles:

```
$ kubectl -n default get sts chi-clickhouse-chcluster1-0-0
NAME                            READY   AGE
chi-clickhouse-chcluster1-0-0   0/0     28d

$ kubectl -n default get chi clickhouse -o jsonpath='{.status.status}'
Completed
```

The operator never picks the CHI up again. Bouncing the operator pod recovers it.

## What triggers this in production

Two contributing factors that together produced the stuck end-state in our case:

1. **Operator chart upgrade with pod-template diff** (`0.24.2 → 0.26.3`). All existing CHIs reconcile at startup of the new operator pod; the chart upgrade introduced fields that landed in the immutable region of `StatefulSet.spec` (selector, serviceName, volumeClaimTemplates), which made `doUpdateStatefulSet` fail at line 470 and the operator switch to the recreate path. **This is directly observed in the production operator log.**
2. **A non-nil error returned by `r.sts.Update()` at line 527** during `doDeleteStatefulSet`. The log confirms this failure occurred (`UNABLE to update StatefulSet`, line 528) but does not reveal the underlying error class. The most likely candidate is a `409 Conflict` race against `kube-controller-manager` writing `.status.observedGeneration` / `.status.currentRevision` / `.status.replicas` during pod termination after `hostScaleDown` set `Replicas=0` ~9 seconds earlier. The operator's `Get → Update` window inside `doDeleteStatefulSet` (lines 512-527) was 261 ms in our production trace (10:22:20.325250 → 10:22:20.586212), wide enough for multiple status writes by `kube-controller-manager` to land inside it. **The 409-via-RV-race is a strongly-supported hypothesis, not a directly-confirmed fact.** Other plausible candidates (`422 Invalid`, `403 Forbidden` from a webhook, `503` apiserver/etcd transient, network timeout) are all consistent with the same log evidence and produce the same downstream cascade because Bug 2 doesn't differentiate.

Whatever the specific apiserver error was, factor #2 is intermittent: same trigger #1 on the same cluster sometimes produces a clean recreate and sometimes the stuck cascade, depending on the millisecond-scale interleaving inside `doDeleteStatefulSet`'s `Get → Update` window. **Our reproduction (below) confirms the cascade fires from `409 Conflict` specifically, but the proposed fix corrects Bug 2 for any non-nil error class.**

## Reproduction (kind cluster)

`repro/` — full reproduction with **real 409 Conflict**. Topology matches production: 1 CHI + 1 CHK on stock operator `0.26.3`.

The setup:
- Pre-stage `STS` to 0 replicas (simulates production's `hostSoftwareRestart`-fail → `hostScaleDown` outcome on operator startup)
- Run a Go racer (`rv-racer/`) doing 32 concurrent `Get → mutate annotations → Update` loops on the STS — faithful proxy for `kube-controller-manager`'s STS controller writing `.status` fields during pod transitions
- Patch `CHI.spec.templates.volumeClaimTemplates[0].spec.resources.requests.storage` (immutable on STS) — forces `doUpdateStatefulSet` to fail `422 Invalid`, switching operator to recreate path
- `recreateStatefulSet → doDeleteStatefulSet → Get` (RV=N) → racer's update lands (RV=N+1) → operator's `Update` → real **409 Conflict** → Bug 2 fires → cascade

Because the bug is intermittent (the racer must land an Update in the operator's ~5-30 ms `Get → Update` window), the script retries up to 8 times until the cascade fires; typically fires on attempt 1-3.

A **control test** (`run-control.sh`) runs the same trigger sequence with the racer disabled — it recovers cleanly. This A/B proves the 409 is the differentiator between healthy reconcile and the stuck cascade.

### Side-by-side cascade comparison

| Event | Source line | Production | Kind reproduction |
|-------|-------------|------------|-------------------|
| `switch from Update to Recreate` | `:292` | I0519 10:22:20.309900 | I0523 05:11:17.096780 |
| `doDeleteStatefulSet` started | `:508` | I0519 10:22:20.586211 | I0523 05:11:17.103113 |
| `UNABLE to update StatefulSet` | `:528` | E0519 10:22:20.586212 | E0523 05:11:17.131153 |
| `StatefulSet create failed ... already exists` | `:412` | E0519 10:22:20.911908 | E0523 05:11:17.175664 |
| `Got recreate action. Ignore and continue for now` | `:390` | W0519 10:22:22.067685 | W0523 05:11:17.241865 |
| `reconcile completed successfully` | `worker.go:415` | I0519 10:54:08.271354 | I0523 05:12:30.431607 |

(Full repro log: `repro/minimal/operator.log`. Cascade lines only: `repro/minimal/cascade.log`.)

## Proposed fix

The minimal correct fix is a single change to `doDeleteStatefulSet`. A second optional change to `recreateStatefulSet` adds defense-in-depth without altering existing CRUD-error semantics.

### Patch A (canonical) — `doDeleteStatefulSet` must honor its contract

The function's name and contract say "the StatefulSet is gone after I return". Today, if the courtesy scale-to-0 `Update` fails for *any* reason, the function returns at line 529 and never calls `Delete` — the very call that would have actually removed the object. **Fall through to `Delete` regardless of the `Update` outcome.** `Delete` is what guarantees the contract; the scale-to-0 is a graceful-shutdown courtesy that's safe to skip when it fails.

```go
func (r *Reconciler) doDeleteStatefulSet(ctx context.Context, host *api.Host) error {
    name := r.namer.Name(interfaces.NameStatefulSet, host)
    namespace := host.Runtime.Address.Namespace
    log.V(1).M(host).F().Info("%s/%s", namespace, name)

    cur, err := r.sts.Get(ctx, host)
    if err != nil {
        if apiErrors.IsNotFound(err) {
            log.V(1).M(host).Info("NEUTRAL not found StatefulSet %s/%s", namespace, name)
            return nil
        }
        log.V(1).M(host).F().Error("FAIL get StatefulSet %s/%s err:%v", namespace, name, err)
        return err
    }
    host.Runtime.CurStatefulSet = cur

    // Courtesy scale-to-0 for graceful pod termination.
    // Skip if already at 0; on any failure, proceed to Delete which terminates pods anyway.
    if cur.Spec.Replicas == nil || *cur.Spec.Replicas != 0 {
        var zero int32 = 0
        cur.Spec.Replicas = &zero
        if _, uerr := r.sts.Update(ctx, cur); uerr != nil {
            // INTENTIONAL: do not return here. The function contract is "STS is gone" and
            // r.sts.Delete below is what fulfills it. Update-to-zero failures are common
            // (e.g. 409 Conflict from kube-controller-manager writing .status fields during
            // pod transitions) and must not block the actual delete.
            log.V(1).M(host).Warning(
                "Scale-to-0 update failed (%v) - proceeding to Delete which will terminate pods", uerr)
        } else {
            _ = r.hostObjectsPoller.WaitHostStatefulSetReady(ctx, host)
        }
    } else {
        log.V(1).M(host).Info("StatefulSet %s/%s already at Replicas=0, skipping scale-down", namespace, name)
    }

    if derr := r.sts.Delete(ctx, namespace, name); derr != nil && !apiErrors.IsNotFound(derr) {
        log.V(1).M(host).F().Error("FAIL delete StatefulSet %s/%s err: %v", namespace, name, derr)
        return derr
    }
    log.V(1).M(host).Info("OK delete StatefulSet %s/%s", namespace, name)
    return nil
}
```

This single change resolves the bug end-to-end: `Delete` removes the StatefulSet, the subsequent `r.sts.Create()` succeeds (no `AlreadyExists`), the host completes its recreate cleanly, and the CHI legitimately moves to `Completed`. Bugs 1 and 3 become academic because their cascade preconditions are no longer met.

The fix is correct against all error types (409 Conflict, 422 Invalid, 403 Forbidden, 503 Unavailable, network timeouts) since all are handled identically by deferring to `Delete`. It preserves graceful shutdown when the courtesy `Update` succeeds, and doesn't change CRUD error propagation — no risk of accidentally flipping CHIs to `ReconcileAbort`.

### Patch B (optional, defense-in-depth) — `recreateStatefulSet` shouldn't silently swallow delete errors

With Patch A in place, `doDeleteStatefulSet` only returns a non-nil error if `Get` or `Delete` itself failed in an unrecoverable way. That signal shouldn't be discarded:

```go
func (r *Reconciler) recreateStatefulSet(ctx context.Context, host *api.Host, register bool, opts *ReconcileOptions) error {
    if util.IsContextDone(ctx) {
        return nil
    }
    if err := r.doDeleteStatefulSet(ctx, host); err != nil {
        r.a.V(1).M(host).Warning(
            "Recreate deferred: delete failed: %v - will retry on next reconcile", err)
        return err   // plain error — propagates to ReconcileStatefulSet caller, NOT ErrCRUDAbort
    }
    _ = r.storage.ReconcilePVCs(ctx, host, api.DesiredStatefulSet)
    return r.createStatefulSet(ctx, host, register, opts)
}
```

Returning a plain `error` (not `common.ErrCRUDAbort`) keeps the failure transient — the next reconcile pass will retry, matching the contract for create/update failures handled via `OnStatefulSetCreateFailed`/`OnStatefulSetUpdateFailed`. Returning `ErrCRUDAbort` here would flip the CHI to `ReconcileAbort` and require manual intervention, which is the wrong policy for transient errors.

Happy to open a PR with Patch A (and optionally Patch B) plus a unit test that constructs a fake `IKubeSTS` returning `Conflict` on the scale-to-0 `Update` and asserts that `doDeleteStatefulSet` (a) still calls `Delete`, (b) returns `nil`, and that `recreateStatefulSet` end-to-end produces a fresh StatefulSet rather than leaving the cluster at `Replicas=0`.

## Environment

- Operator: 0.26.3 (image tag pinned, upstream chart, no patches)
- ClickHouse: 25.8.17
- Kubernetes: 1.27
- `reconcile.statefulSet.recreate.onUpdateFailure: recreate` (upstream default)
- `reconcile.statefulSet.update.onFailure: abort` (upstream default)
- CHI: 1 shard × 1 replica, separate CHK with 3 keepers

## Attached files (All part of the attached zip)

- `clickhouse.yaml` — sanitized `ClickHouseInstallation` manifest reconciled in the production trace.
- `operator-values.yaml` — sanitized `values.yaml` for the operator chart.
- `operator-logs.txt` — sanitized operator pod log covering the failing reconcile pass.
- `repro/` — kind reproduction (Go racer + reproduction script + control test + captured operator logs + stuck CHI/STS yaml).

In the three top-level attachments, only environment-specific identifiers were renamed (namespace, internal user name, pod IPs, image registry, internal Secret/Volume names). Line numbers, timestamps, message text, and diff structure are preserved verbatim.

[issue.zip](https://github.com/user-attachments/files/28204442/issue.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatefulSet recreate path silently leaves host at Replicas=0 when graceful scale-to-0 Update fails (statefulset-reconciler.go) #1990

Summary

Affected versions

Root cause: three compounding bugs

Bug 1 — `recreateStatefulSet` discards the error from `doDeleteStatefulSet` (line 195)

Bug 2 — `doDeleteStatefulSet` aborts before reaching `Delete` if the scale-to-0 `Update` returns any error (lines 525-530)

Bug 3 — `doCreateStatefulSet` collapses every `Create` error (incl. `AlreadyExists`) to `ErrCRUDRecreate`, which is silently mapped to `nil` (lines 411-414 and 388-391)

Production trace

What triggers this in production

Reproduction (kind cluster)

Side-by-side cascade comparison

Proposed fix

Patch A (canonical) — `doDeleteStatefulSet` must honor its contract

Patch B (optional, defense-in-depth) — `recreateStatefulSet` shouldn't silently swallow delete errors

Environment

Attached files (All part of the attached zip)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Event	Source line	Production	Kind reproduction
`switch from Update to Recreate`	`:292`	I0519 10:22:20.309900	I0523 05:11:17.096780
`doDeleteStatefulSet` started	`:508`	I0519 10:22:20.586211	I0523 05:11:17.103113
`UNABLE to update StatefulSet`	`:528`	E0519 10:22:20.586212	E0523 05:11:17.131153
`StatefulSet create failed ... already exists`	`:412`	E0519 10:22:20.911908	E0523 05:11:17.175664
`Got recreate action. Ignore and continue for now`	`:390`	W0519 10:22:22.067685	W0523 05:11:17.241865
`reconcile completed successfully`	`worker.go:415`	I0519 10:54:08.271354	I0523 05:12:30.431607

StatefulSet recreate path silently leaves host at Replicas=0 when graceful scale-to-0 Update fails (statefulset-reconciler.go) #1990

Description

Summary

Affected versions

Root cause: three compounding bugs

Bug 1 — recreateStatefulSet discards the error from doDeleteStatefulSet (line 195)

Bug 2 — doDeleteStatefulSet aborts before reaching Delete if the scale-to-0 Update returns any error (lines 525-530)

Bug 3 — doCreateStatefulSet collapses every Create error (incl. AlreadyExists) to ErrCRUDRecreate, which is silently mapped to nil (lines 411-414 and 388-391)

Production trace

What triggers this in production

Reproduction (kind cluster)

Side-by-side cascade comparison

Proposed fix

Patch A (canonical) — doDeleteStatefulSet must honor its contract

Patch B (optional, defense-in-depth) — recreateStatefulSet shouldn't silently swallow delete errors

Environment

Attached files (All part of the attached zip)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `recreateStatefulSet` discards the error from `doDeleteStatefulSet` (line 195)

Bug 2 — `doDeleteStatefulSet` aborts before reaching `Delete` if the scale-to-0 `Update` returns any error (lines 525-530)

Bug 3 — `doCreateStatefulSet` collapses every `Create` error (incl. `AlreadyExists`) to `ErrCRUDRecreate`, which is silently mapped to `nil` (lines 411-414 and 388-391)

Patch A (canonical) — `doDeleteStatefulSet` must honor its contract

Patch B (optional, defense-in-depth) — `recreateStatefulSet` shouldn't silently swallow delete errors