Skip to content

StatefulSet recreate path silently leaves host at Replicas=0 when graceful scale-to-0 Update fails (statefulset-reconciler.go) #1990

@dashashutosh80

Description

@dashashutosh80

Summary

When the operator falls into the Update → Recreate StatefulSet path and r.sts.Update() inside doDeleteStatefulSet returns any non-nil error, three compounding bugs in pkg/controller/common/statefulset/statefulset-reconciler.go cause the reconcile to end with the StatefulSet stuck at spec.replicas: 0, no further retry, and the CHI marked Completed. The only recovery is to bounce the operator pod or kubectl scale --replicas=1 manually.

We hit this in production during a chart upgrade 0.24.2 → 0.26.3 (full sanitized trace attached as operator-logs.txt). The bugs are version-agnostic: the affected file is byte-identical between release-0.26.1, release-0.26.2, release-0.26.3, release-0.27.0, and master.

This issue is intermittent. The buggy code path is reached deterministically on every reconcile that takes the Update → Recreate branch, but whether the cascade pins a stuck end-state depends on whether r.sts.Update() at line 527 inside doDeleteStatefulSet returns a non-nil error in that particular call. The same trigger sequence on the same cluster will sometimes produce a clean recreate and sometimes the stuck cascade. Our reproduction (see below) reflects this intermittency by retrying until the cascade fires; production hit it on the first attempt during a chart upgrade 0.24.2 → 0.26.3.

Affected versions

  • Reproduced: 0.26.3
  • Code-path verified unchanged in: 0.27.0, master (HEAD)
  • Likely present since recreateStatefulSet was introduced in this form (≥ 0.24.x)

Root cause: three compounding bugs

All line numbers are against pkg/controller/common/statefulset/statefulset-reconciler.go at tag release-0.26.3. Same lines on master.

The three bugs compound: any single one in isolation would either be benign or self-recover. Together, they produce silent end-of-reconcile at Replicas=0 with CHI=Completed. Bug 2 is the chain's first link — fixing Bug 2 alone breaks the cascade and resolves the production failure (see Proposed fix below). Bugs 1 and 3 are described here for completeness.

Bug 1 — recreateStatefulSet discards the error from doDeleteStatefulSet (line 195)

func (r *Reconciler) recreateStatefulSet(ctx context.Context, host *api.Host, register bool, opts *ReconcileOptions) error {
    ...
    _ = r.doDeleteStatefulSet(ctx, host)   // <-- error silently discarded
    _ = r.storage.ReconcilePVCs(ctx, host, api.DesiredStatefulSet)
    return r.createStatefulSet(ctx, host, register, opts)
}

If the graceful delete fails, the recreate proceeds anyway. The original STS is still in the cluster, so the next call will predictably fail with AlreadyExists.

Bug 2 — doDeleteStatefulSet aborts before reaching Delete if the scale-to-0 Update returns any error (lines 525-530)

var zero int32 = 0
host.Runtime.CurStatefulSet.Spec.Replicas = &zero
if _, err := r.sts.Update(ctx, host.Runtime.CurStatefulSet); err != nil {
    log.V(1).M(host).Error("UNABLE to update StatefulSet %s/%s", namespace, name)
    return err   // <-- returns here, never reaches r.sts.Delete() at line 536
}

The function name says doDeleteStatefulSet, the contract is "the STS is gone after I return", but the body abandons the goal on a transient Update error and never falls through to r.sts.Delete() — which is what would actually fulfill the contract.

Bug 3 — doCreateStatefulSet collapses every Create error (incl. AlreadyExists) to ErrCRUDRecreate, which is silently mapped to nil (lines 411-414 and 388-391)

// doCreateStatefulSet
if _, err := r.sts.Create(ctx, statefulSet); err != nil {
    log.V(1).M(host).F().Error("StatefulSet create failed. err: %v", err)
    return common.ErrCRUDRecreate
}
// shouldAbortOrContinueCreateStatefulSet
case common.ErrCRUDRecreate:
    r.a.V(1).M(host).Warning("Got recreate action. Ignore and continue for now")
    return nil

The parent caller reconcileHostStatefulSet (pkg/controller/chi/worker-reconciler-chi.go:413) registers the STS as successfully reconciled and the CHI moves to Completed despite the host being at Replicas=0.

Production trace

Full sanitized log: operator-logs.txt. Key sequence (line numbers in [brackets] reference the source file above):

hostScaleDown(): Host shutdown via scale down: 0-0
diff item [15]:'.Replicas' = '0'                                        # STS now at Replicas=0
updateStatefulSet(...) - started
updateStatefulSet(...) switch from Update to Recreate                   # recreateStatefulSet begins [186]

doDeleteStatefulSet(): .../chi-clickhouse-chcluster1-0-0                 [508]
UNABLE to update StatefulSet .../chi-clickhouse-chcluster1-0-0           [528]    # Bug 2 fires
                                                                                  # r.sts.Update() returned a non-nil error;
                                                                                  # function returned, never called Delete()

createStatefulSet(...) - started
doCreateStatefulSet(): ...
StatefulSet create failed. err: statefulsets.apps "..." already exists   [412]    # consequence of Bugs 1+2
Got recreate action. Ignore and continue for now                         [390]    # Bug 3 fires
finalizeReconcileAndMarkCompleted(): reconcile completed successfully             # CHI -> Completed

Note: the underlying error type returned by r.sts.Update() at line 527 in our production trace is not directly visible in the log because the logging at line 528 (log.V(1).M(host).Error("UNABLE to update StatefulSet %s/%s", namespace, name)) only formats namespace/name and silently discards the err value. Consequently, while the cascade signature is unambiguous (Bug 2 fired), the specific apiserver error class (409 / 422 / 403 / 503 / network) at this call site is unknown from the log alone. Bug 2's return err is unconditional — the cascade fires identically for any non-nil error, so the proposed fix is correct regardless. Logging the underlying error here would be a useful diagnostic improvement on its own.

What the production log does prove: Operation cannot be fulfilled ... the object has been modified (the apiserver's verbatim 409 Conflict message) occurs on this cluster at runtime — see cr.go:82 statusUpdateRetry() at log line 2065. This site catches and retries it correctly. The doDeleteStatefulSet site is identical machinery without retry-on-conflict logic.

End state, persisted across operator reconciles:

$ kubectl -n default get sts chi-clickhouse-chcluster1-0-0
NAME                            READY   AGE
chi-clickhouse-chcluster1-0-0   0/0     28d

$ kubectl -n default get chi clickhouse -o jsonpath='{.status.status}'
Completed

The operator never picks the CHI up again. Bouncing the operator pod recovers it.

What triggers this in production

Two contributing factors that together produced the stuck end-state in our case:

  1. Operator chart upgrade with pod-template diff (0.24.2 → 0.26.3). All existing CHIs reconcile at startup of the new operator pod; the chart upgrade introduced fields that landed in the immutable region of StatefulSet.spec (selector, serviceName, volumeClaimTemplates), which made doUpdateStatefulSet fail at line 470 and the operator switch to the recreate path. This is directly observed in the production operator log.
  2. A non-nil error returned by r.sts.Update() at line 527 during doDeleteStatefulSet. The log confirms this failure occurred (UNABLE to update StatefulSet, line 528) but does not reveal the underlying error class. The most likely candidate is a 409 Conflict race against kube-controller-manager writing .status.observedGeneration / .status.currentRevision / .status.replicas during pod termination after hostScaleDown set Replicas=0 ~9 seconds earlier. The operator's Get → Update window inside doDeleteStatefulSet (lines 512-527) was 261 ms in our production trace (10:22:20.325250 → 10:22:20.586212), wide enough for multiple status writes by kube-controller-manager to land inside it. The 409-via-RV-race is a strongly-supported hypothesis, not a directly-confirmed fact. Other plausible candidates (422 Invalid, 403 Forbidden from a webhook, 503 apiserver/etcd transient, network timeout) are all consistent with the same log evidence and produce the same downstream cascade because Bug 2 doesn't differentiate.

Whatever the specific apiserver error was, factor #2 is intermittent: same trigger #1 on the same cluster sometimes produces a clean recreate and sometimes the stuck cascade, depending on the millisecond-scale interleaving inside doDeleteStatefulSet's Get → Update window. Our reproduction (below) confirms the cascade fires from 409 Conflict specifically, but the proposed fix corrects Bug 2 for any non-nil error class.

Reproduction (kind cluster)

repro/ — full reproduction with real 409 Conflict. Topology matches production: 1 CHI + 1 CHK on stock operator 0.26.3.

The setup:

  • Pre-stage STS to 0 replicas (simulates production's hostSoftwareRestart-fail → hostScaleDown outcome on operator startup)
  • Run a Go racer (rv-racer/) doing 32 concurrent Get → mutate annotations → Update loops on the STS — faithful proxy for kube-controller-manager's STS controller writing .status fields during pod transitions
  • Patch CHI.spec.templates.volumeClaimTemplates[0].spec.resources.requests.storage (immutable on STS) — forces doUpdateStatefulSet to fail 422 Invalid, switching operator to recreate path
  • recreateStatefulSet → doDeleteStatefulSet → Get (RV=N) → racer's update lands (RV=N+1) → operator's Update → real 409 Conflict → Bug 2 fires → cascade

Because the bug is intermittent (the racer must land an Update in the operator's ~5-30 ms Get → Update window), the script retries up to 8 times until the cascade fires; typically fires on attempt 1-3.

A control test (run-control.sh) runs the same trigger sequence with the racer disabled — it recovers cleanly. This A/B proves the 409 is the differentiator between healthy reconcile and the stuck cascade.

Side-by-side cascade comparison

Event Source line Production Kind reproduction
switch from Update to Recreate :292 I0519 10:22:20.309900 I0523 05:11:17.096780
doDeleteStatefulSet started :508 I0519 10:22:20.586211 I0523 05:11:17.103113
UNABLE to update StatefulSet :528 E0519 10:22:20.586212 E0523 05:11:17.131153
StatefulSet create failed ... already exists :412 E0519 10:22:20.911908 E0523 05:11:17.175664
Got recreate action. Ignore and continue for now :390 W0519 10:22:22.067685 W0523 05:11:17.241865
reconcile completed successfully worker.go:415 I0519 10:54:08.271354 I0523 05:12:30.431607

(Full repro log: repro/minimal/operator.log. Cascade lines only: repro/minimal/cascade.log.)

Proposed fix

The minimal correct fix is a single change to doDeleteStatefulSet. A second optional change to recreateStatefulSet adds defense-in-depth without altering existing CRUD-error semantics.

Patch A (canonical) — doDeleteStatefulSet must honor its contract

The function's name and contract say "the StatefulSet is gone after I return". Today, if the courtesy scale-to-0 Update fails for any reason, the function returns at line 529 and never calls Delete — the very call that would have actually removed the object. Fall through to Delete regardless of the Update outcome. Delete is what guarantees the contract; the scale-to-0 is a graceful-shutdown courtesy that's safe to skip when it fails.

func (r *Reconciler) doDeleteStatefulSet(ctx context.Context, host *api.Host) error {
    name := r.namer.Name(interfaces.NameStatefulSet, host)
    namespace := host.Runtime.Address.Namespace
    log.V(1).M(host).F().Info("%s/%s", namespace, name)

    cur, err := r.sts.Get(ctx, host)
    if err != nil {
        if apiErrors.IsNotFound(err) {
            log.V(1).M(host).Info("NEUTRAL not found StatefulSet %s/%s", namespace, name)
            return nil
        }
        log.V(1).M(host).F().Error("FAIL get StatefulSet %s/%s err:%v", namespace, name, err)
        return err
    }
    host.Runtime.CurStatefulSet = cur

    // Courtesy scale-to-0 for graceful pod termination.
    // Skip if already at 0; on any failure, proceed to Delete which terminates pods anyway.
    if cur.Spec.Replicas == nil || *cur.Spec.Replicas != 0 {
        var zero int32 = 0
        cur.Spec.Replicas = &zero
        if _, uerr := r.sts.Update(ctx, cur); uerr != nil {
            // INTENTIONAL: do not return here. The function contract is "STS is gone" and
            // r.sts.Delete below is what fulfills it. Update-to-zero failures are common
            // (e.g. 409 Conflict from kube-controller-manager writing .status fields during
            // pod transitions) and must not block the actual delete.
            log.V(1).M(host).Warning(
                "Scale-to-0 update failed (%v) - proceeding to Delete which will terminate pods", uerr)
        } else {
            _ = r.hostObjectsPoller.WaitHostStatefulSetReady(ctx, host)
        }
    } else {
        log.V(1).M(host).Info("StatefulSet %s/%s already at Replicas=0, skipping scale-down", namespace, name)
    }

    if derr := r.sts.Delete(ctx, namespace, name); derr != nil && !apiErrors.IsNotFound(derr) {
        log.V(1).M(host).F().Error("FAIL delete StatefulSet %s/%s err: %v", namespace, name, derr)
        return derr
    }
    log.V(1).M(host).Info("OK delete StatefulSet %s/%s", namespace, name)
    return nil
}

This single change resolves the bug end-to-end: Delete removes the StatefulSet, the subsequent r.sts.Create() succeeds (no AlreadyExists), the host completes its recreate cleanly, and the CHI legitimately moves to Completed. Bugs 1 and 3 become academic because their cascade preconditions are no longer met.

The fix is correct against all error types (409 Conflict, 422 Invalid, 403 Forbidden, 503 Unavailable, network timeouts) since all are handled identically by deferring to Delete. It preserves graceful shutdown when the courtesy Update succeeds, and doesn't change CRUD error propagation — no risk of accidentally flipping CHIs to ReconcileAbort.

Patch B (optional, defense-in-depth) — recreateStatefulSet shouldn't silently swallow delete errors

With Patch A in place, doDeleteStatefulSet only returns a non-nil error if Get or Delete itself failed in an unrecoverable way. That signal shouldn't be discarded:

func (r *Reconciler) recreateStatefulSet(ctx context.Context, host *api.Host, register bool, opts *ReconcileOptions) error {
    if util.IsContextDone(ctx) {
        return nil
    }
    if err := r.doDeleteStatefulSet(ctx, host); err != nil {
        r.a.V(1).M(host).Warning(
            "Recreate deferred: delete failed: %v - will retry on next reconcile", err)
        return err   // plain error — propagates to ReconcileStatefulSet caller, NOT ErrCRUDAbort
    }
    _ = r.storage.ReconcilePVCs(ctx, host, api.DesiredStatefulSet)
    return r.createStatefulSet(ctx, host, register, opts)
}

Returning a plain error (not common.ErrCRUDAbort) keeps the failure transient — the next reconcile pass will retry, matching the contract for create/update failures handled via OnStatefulSetCreateFailed/OnStatefulSetUpdateFailed. Returning ErrCRUDAbort here would flip the CHI to ReconcileAbort and require manual intervention, which is the wrong policy for transient errors.

Happy to open a PR with Patch A (and optionally Patch B) plus a unit test that constructs a fake IKubeSTS returning Conflict on the scale-to-0 Update and asserts that doDeleteStatefulSet (a) still calls Delete, (b) returns nil, and that recreateStatefulSet end-to-end produces a fresh StatefulSet rather than leaving the cluster at Replicas=0.

Environment

  • Operator: 0.26.3 (image tag pinned, upstream chart, no patches)
  • ClickHouse: 25.8.17
  • Kubernetes: 1.27
  • reconcile.statefulSet.recreate.onUpdateFailure: recreate (upstream default)
  • reconcile.statefulSet.update.onFailure: abort (upstream default)
  • CHI: 1 shard × 1 replica, separate CHK with 3 keepers

Attached files (All part of the attached zip)

  • clickhouse.yaml — sanitized ClickHouseInstallation manifest reconciled in the production trace.
  • operator-values.yaml — sanitized values.yaml for the operator chart.
  • operator-logs.txt — sanitized operator pod log covering the failing reconcile pass.
  • repro/ — kind reproduction (Go racer + reproduction script + control test + captured operator logs + stuck CHI/STS yaml).

In the three top-level attachments, only environment-specific identifiers were renamed (namespace, internal user name, pod IPs, image registry, internal Secret/Volume names). Line numbers, timestamps, message text, and diff structure are preserved verbatim.

issue.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    plannedThis functionality is planned for implementation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions