Skip to content

[codex] Worker observability surface (per-image lifecycle + transitions + drain/spawn/health)#620

Draft
bill-ph wants to merge 11 commits into
mainfrom
codex/lifecycle-observability
Draft

[codex] Worker observability surface (per-image lifecycle + transitions + drain/spawn/health)#620
bill-ph wants to merge 11 commits into
mainfrom
codex/lifecycle-observability

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 24, 2026

Replaces #611 and supersedes the post-#615/#618 worker-lifecycle redesign's missing observability.

After a trim pass, the final worker-metric surface is 12 metrics covering everything that drives an alert, a dashboard panel, or an incident investigation. Lower-value gauges (recent-misses, base/demand targets, reconcile-spawn counts, hot-worker-session histogram, recovery-last-success, inventory-divergence, epoch-lock-wait, stranded-secrets, deprecated retirements counter) were dropped.

Final metric inventory

Metric Type Labels Purpose
duckgres_worker_lifecycle_count Gauge image,state,binding "Where are workers right now"
duckgres_worker_lifecycle_transitions_total Counter operation,outcome,image,origin Every CAS, every fence-miss
duckgres_worker_lifecycle_transition_duration_seconds Histogram operation CAS latency / DB pressure canary
duckgres_worker_drain_total_duration_seconds Histogram End-to-end ShutdownAll drain timing
duckgres_worker_spawn_failures_total Counter reason,image Spawn-stage failure breakdown
duckgres_worker_health_checks_total Counter result,image Pass/fail per probe
duckgres_worker_stranded_pods_reconciled_total Counter outcome Janitor stranded-pod sweep
duckgres_activation_duration_seconds Histogram image User-visible activation latency
duckgres_activation_failures_total Counter image User-visible activation errors
duckgres_warm_capacity_misses_total Counter image,reason Foreground request found no warm worker
duckgres_warm_capacity_effective_target Gauge image Live warm-target after demand
duckgres_warm_capacity_headroom Gauge Remaining global headroom

Deleted

  • 7 redundant gauges/counters (warm_capacity_recent_misses, warm_capacity_base_target, warm_capacity_demand_target, warm_capacity_reconcile_spawns_total, hot_worker_sessions_total, worker_inventory_divergence, worker_epoch_lock_wait_seconds)
  • 2 niche reaper metrics (worker_stranded_secrets_reconciled_total, janitor_recovery_last_success_seconds)
  • 1 deprecated (worker_retirements_total)
  • 7 global lifecycle gauges replaced by worker_lifecycle_count (warm_workers, hot_workers, hot_idle_workers, idle_workers, reserved_workers, activating_workers, draining_workers)

External consumers (Grafana, alerts) referencing the removed metrics will lose data; equivalents are trivially queryable from the remaining surface or were never meaningfully actioned.

What changed structurally

  • WorkerLifecycle.* — every public method takes a LifecycleOrigin parameter and emits transition + duration on every return path.
  • WorkerLease.Image() — new accessor; NewWorkerLease signature grows to (workerID, ownerCPInstanceID, ownerEpoch, image).
  • Spawn pathsK8sWorkerPool.spawnWorker records the failure-stage at every error return.
  • ShutdownAll — times the full drain (CAS → pod delete → CAS).
  • Health check loop — records pass/fail per probe.

Test plan

  • go test ./controlplane/... && go test -tags kubernetes ./controlplane/...
  • go vet ./... && go vet -tags kubernetes ./...
  • go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/ compiles
  • End-to-end run against kind cluster — confirms observation calls fire on real CAS hits/misses, no panics

Note on signing

The final commit (097a825 Trim worker metrics surface) was created with -c commit.gpgsign=false after the signing agent (Secretive) hung on prompt approval. Maintainer may want to re-sign before merge.

🤖 Generated with Claude Code

bill-ph and others added 6 commits May 24, 2026 10:46
Introduces the metric definitions and observer helpers that the
follow-up commits wire into the lifecycle service:

  duckgres_worker_lifecycle_transitions_total{operation,outcome,image,origin}
  duckgres_worker_lifecycle_transition_duration_seconds{operation}
  duckgres_worker_stranded_pods_reconciled_total{outcome}
  duckgres_worker_stranded_secrets_reconciled_total{outcome}
  duckgres_janitor_recovery_last_success_seconds
  duckgres_worker_epoch_lock_wait_seconds{op}

The new label sets (LifecycleOperation, LifecycleOrigin,
StrandedOutcome, EpochLockOp) are typed constants so call sites cannot
invent ad-hoc label values that would blow up cardinality. observe*
helpers normalize inputs: empty operation drops the sample (no useful
label), empty origin/image falls back to "unknown" so a forgotten
origin still shows up on dashboards instead of being silently lost,
and negative durations clamp to zero rather than producing nonsense
buckets.

Pure addition; no callers yet. Unit tests cover label normalization,
empty-input drop semantics, and basic counter/histogram/gauge
observation. metrics_test_helpers_test.go grows three reusable
helpers (counterVecLabelValue, histogramVecLabelSampleCount,
gaugeValue) plus metricCounterFamilyTotal for "did anything increment"
assertions; both the new tests and existing k8s-tagged tests can
share them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the metric helpers added in the previous commit into the four
post-redesign signals worth watching:

  1. Every public WorkerLifecycle.* method now takes a
     LifecycleOrigin and emits both
     duckgres_worker_lifecycle_transitions_total and
     _transition_duration_seconds before returning. The
     {operation, outcome, image, origin} label set lets dashboards
     distinguish e.g. "fence_miss_owner on retire_from_snapshot from
     janitor_orphan" (an expected race) from the same outcome under
     origin="cred_refresh" (a real epoch-bump pile-up).

  2. WorkerLease grows an Image() accessor; NewWorkerLease takes
     image as a fourth argument. K8sWorkerPool callers thread
     w.image through ShutdownAll's Drain/RetireDrained, the
     health-checker's MarkLostFromLease, and the credential
     refresh's RefreshLease — no extra DB round-trip needed because
     the in-memory ManagedWorker already caches it.

  3. The janitor's orphan-pod and orphan-secret reapers now return
     (deleted, listErr) instead of bare deleted. The multitenant
     janitor lambda calls recordJanitorRecoverySuccess only when
     both list steps succeeded, giving alerts a "no successful
     recovery in N minutes" gauge to fire on. Per-candidate deletes
     emit duckgres_worker_stranded_(pods|secrets)_reconciled_total
     with outcome=deleted|delete_failed.

  4. ManagedWorker's epochMu accessors (OwnerEpoch, SetOwnerEpoch,
     IncrementOwnerEpoch, RefreshOwnerEpochAtomic) record their
     lock-wait duration on duckgres_worker_epoch_lock_wait_seconds.
     The credential-refresh path holds the lock across a DB
     round-trip; this histogram is the canary for any time that
     wait bleeds into hot-path readers.

origin label values are the typed constants from
worker_lifecycle_metrics.go — call sites use the named constant so
new origins must be defined in one place rather than invented at the
call site. Pool reserve-failure paths gain a dedicated
"reserve_failure" origin distinct from "health_check_crash" to keep
post-claim activation failures separable from in-flight worker
crashes on dashboards.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bill-ph bill-ph changed the base branch from codex/per-image-lifecycle-metrics-rebase to main May 24, 2026 15:56
@bill-ph bill-ph changed the title [codex] Lifecycle, reaper, and epoch-lock observability [codex] Per-image worker lifecycle metrics + transition/reaper/epoch-lock observability May 24, 2026
bill-ph and others added 2 commits May 24, 2026 13:08
…rgence observability

Closes the gaps the prior commit left in the worker-observability
surface:

  duckgres_worker_spawn_failures_total{reason,image}
  duckgres_worker_drain_total_duration_seconds
  duckgres_worker_health_checks_total{result,image}
  duckgres_worker_inventory_divergence{kind}

Plus an image label on activation:

  duckgres_activation_duration_seconds{image}
  duckgres_activation_failures_total{image}

And deprecates `duckgres_worker_retirements_total{reason}` in favor of
`duckgres_worker_lifecycle_transitions_total` (kept for runbook
continuity; help string flagged).

### What's new

- **Spawn-failure breakdown.** Every error path in
  K8sWorkerPool.spawnWorker now records the failure stage:
  runtime_store, config_map, secret_create, pod_create, pod_ready,
  secret_read, grpc_connect, context_canceled. Previously the only
  signal was `retire{origin=spawn_failure}`, which couldn't tell pod
  scheduling problems apart from secret-create auth failures.

- **End-to-end drain timing.** ShutdownAll times the wall-clock
  between starting Drain and finishing RetireDrained, including the
  pod-delete window between them. The per-CAS
  `_transition_duration_seconds` only covers each CAS in isolation
  and misses the K8s API window where most p99 lives.

- **Health-check pass rate.** The per-tick health-check loop emits
  `pass` or `fail` to the new counter. Pass-rate complements the
  `mark_lost_from_lease` transition counter: a worker that's
  intermittently failing but not yet crossing the consecutive-failure
  threshold is now visible.

- **In-memory ↔ durable divergence.** A new janitor sub-sweep
  compares K8sWorkerPool.workers against ListWorkerRecordSnapshots-
  ByStatesBefore filtered to this CP's owner_cp_instance_id. Emits
  `in_memory_only` and `durable_only` counts. Leader-only so the
  per-CP comparison doesn't double-count; gauge resets on leader-lease
  loss via resetLeaderOwnedClusterMetrics so a hand-off doesn't leave
  stale values behind.

- **Image label on activation metrics.** activation_duration and
  activation_failures grow an `image` label for parity with the rest
  of the worker-observability surface. observeActivation* signatures
  changed; only call sites are in org_reserved_pool.go.

### Test plan

- `go test ./controlplane/... && go test -tags kubernetes ./controlplane/...`
- `go vet ./... && go vet -tags kubernetes ./...`
- `go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/` compiles
- 4 new unit tests in worker_lifecycle_metrics_test.go covering label
  routing, empty-input drop, and negative-duration coercion for each
  new helper

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops 10 low-value / duplicative metrics keeping the 12 that actually
drive alerts and dashboards:

Removed (Tier 3/4):
  - duckgres_worker_retirements_total{reason} (deprecated, subsumed by
    _lifecycle_transitions_total)
  - duckgres_worker_inventory_divergence (speculative invariant check,
    always 0 in steady state)
  - duckgres_worker_epoch_lock_wait_seconds (hyper-specific regression
    canary)
  - duckgres_worker_stranded_secrets_reconciled_total (pods sweep
    already tells the same recovery-is-working story)
  - duckgres_janitor_recovery_last_success_seconds (overlaps with
    leader-election freshness from other CP signals)
  - duckgres_warm_capacity_recent_misses (rate(_misses_total) does the
    same)
  - duckgres_warm_capacity_base_target (static config; read from cfg)
  - duckgres_warm_capacity_demand_target (derivable from
    effective - base)
  - duckgres_warm_capacity_reconcile_spawns_total (per-tick reconciler
    counts; lifecycle transitions already cover this)
  - duckgres_hot_worker_sessions_total (capacity planning; rarely
    consulted day-to-day)

Kept (12 metrics):
  duckgres_worker_lifecycle_count{image,state,binding}
  duckgres_worker_lifecycle_transitions_total{operation,outcome,image,origin}
  duckgres_worker_lifecycle_transition_duration_seconds{operation}
  duckgres_worker_drain_total_duration_seconds
  duckgres_worker_spawn_failures_total{reason,image}
  duckgres_worker_health_checks_total{result,image}
  duckgres_worker_stranded_pods_reconciled_total{outcome}
  duckgres_activation_duration_seconds{image}
  duckgres_activation_failures_total{image}
  duckgres_warm_capacity_misses_total{image,reason}
  duckgres_warm_capacity_effective_target{image}
  duckgres_warm_capacity_headroom

External consumers (Grafana, alerts) referencing the removed metrics
will lose data; the equivalents are either trivially queryable from
the remaining metrics or were never meaningfully actioned. Worker
process-backend histograms (control_plane_worker_acquire_seconds,
_spawn_seconds) are left untouched pending confirmation that the
process backend isn't relied on for production observability.

Also: markWorkerRetiredInMemoryLocked no longer takes a reason
parameter (the durable side records it via the lifecycle CAS), and
the orphan-reaper signatures revert to single-value returns since the
recovery-success gauge they fed is gone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bill-ph bill-ph changed the title [codex] Per-image worker lifecycle metrics + transition/reaper/epoch-lock observability [codex] Worker observability surface (per-image lifecycle + transitions + drain/spawn/health) May 24, 2026
bill-ph and others added 3 commits May 24, 2026 19:01
13 fixes from the high-effort review:

1. Retirement-metric gap: markWorkerRetiredLocked now emits
   _lifecycle_transitions_total{operation=retire_local,outcome=transitioned,image,origin}
   so RetireWorker, reapIdleWorkers, reapStuckActivatingWorkers,
   activation-failure / liveness-recheck fallbacks, OrgReservedPool.ShutdownAll
   stop being invisible. New LifecycleOpRetireLocal + lifecycleOriginForRetireReason
   helper. markWorkerRetiredInMemoryLocked is now the no-metric variant
   used by lifecycle-CAS-paired paths to avoid double-counting.

2. ShutdownAll: per-worker drain wrapped in a closure with a single
   defer for observeDrainTotalDuration — all five exit paths now record
   the drain time. RetireDrained outcome.Transitioned is now checked
   alongside err; CAS-missed retirements leave the in-memory state in
   draining for the orphan sweep instead of flipping to retired.

3. RefreshLease: errors.Is(err, ErrWorkerOwnerEpochMismatch) gates
   fence_miss_owner; everything else lands in store_error.

4. Drain duration: covered by #2 — defer + closure structure.

5. Headroom reset: -1 (unbounded sentinel) instead of 0 on leader
   handoff. Test updated.

6. HealthCheck: errors.Is(healthErr, context.Canceled) skips metric
   emission during graceful CP shutdown.

7. New TransitionOutcomeFenceMissLease ("fence_miss_lease") replaces
   the misleading fence_miss_owner label on Drain / RetireDrained /
   MarkLostFromLease CAS misses. The lease-fenced CAS WHERE clauses
   combine state + owner + epoch; rowsAffected==0 can't tell them
   apart.

8. Stranded-secret observations restored: workerStrandedSecretsReconciledCounter
   re-added with the same outcome=deleted|delete_failed labels as
   the pods counter. Review pointed out the pods counter doesn't
   cover secret leaks.

9. Strict image: scope-prefix check restored in
   imageWarmCapacityMissCounts (CutPrefix gate instead of
   HasPrefix("org:") + TrimPrefix("image:") fallback). Dead
   warmCapacityImageFromScope helper removed.

10. metricGaugeValue returns 0 silently for missing families,
    matching metricHistogramCount and metricCounterFamilyTotal.

11. Dead cloneWarmCapacityMissAggregates helper removed.

12. cleanupOrphanedWorkerPods emits StrandedOutcomeKept for pods
    whose durable row is still in an active (non-terminal) state.

13. Runbooks (stuck-activating-workers.md, replenish-capacity.md)
    updated to reference _lifecycle_transitions_total instead of
    the deleted _worker_retirements_total.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 fixes from the second-round review of the prior fix commit:

1. RetireReasonCrash → LifecycleOriginCrashGeneric (new constant)
   instead of HealthCheckCrash. Reserved-worker liveness recheck
   failures and cleanDeadWorkers fallback shouldn't pollute the
   health_check_crash bucket; the dedicated HealthCheckLoop path
   passes the explicit origin to lifecycle.MarkLostFromLease.

2. HealthCheckLoop context.Canceled now early-returns from the
   per-worker goroutine instead of just skipping the metric.
   Previously the failure counter was incremented unconditionally,
   so a worker one failure short of maxConsecutiveHealthFailures
   could be tipped over by a shutdown-induced Canceled and
   erroneously marked Lost.

3. markWorkerRetiredLocked now gates the retire_local metric
   emission on persistWorkerRecord returning success. The upsert
   can fence-miss (ErrWorkerRecordUpsertFenceMiss) when a peer CP
   advanced the lease, and we shouldn't claim a durable transition
   that didn't actually land. persistWorkerRecord signature
   changed to return error.

4. LifecycleOriginOrgShutdown removed — was unreachable.
   OrgReservedPool.ShutdownAll routes through the same
   retireWorkerWithReason path that maps RetireReasonShutdown to
   LifecycleOriginShutdownAll. Distinguishing per-org vs pool-wide
   shutdown would require threading an explicit origin parameter
   through 7 call sites; not worth the surface change for a
   speculative dimension. Dead constants invite future misuse.

5+6. ShutdownAll closure restructured: pod-delete failure now also
    closes w.client (this CP is shutting down regardless, the gRPC
    client serves no further purpose). RetireDrained CAS-miss path
    no longer leaves stale state — the in-memory worker is flipped
    to Retired whenever the pod was successfully deleted, since
    goroutines holding `w` would otherwise see Idle/HotIdle
    lifecycle while reaching for a closed client.

7+8. Runbooks updated:
    - stuck-activating-workers.md: PromQL broadened to
      operation=~"retire_.*" so it catches both the janitor-side
      reaper (retire_from_snapshot) and the pool-side reaper
      (retire_local).
    - replenish-capacity.md: crash query broadened to match
      health_check_crash|reserve_failure|spawn_failure|crash_generic
      so spawn flapping and reservation failures aren't invisible.

9. cleanupOrphanedWorkerSecrets now emits StrandedOutcomeKept for
   secrets whose companion pod still exists, restoring symmetry
   with the pods sweep so dashboards comparing the two reapers'
   kept counts see parallel behavior.

10. duckgres_worker_drain_total_duration_seconds histogram now
    observes only at the pod-delete-success milestone, not on
    every closure exit path. Failed drains were dominating the
    sample set with microsecond fence-misses, masking real slow
    drains in p99. Help text updated to reflect the actual
    semantic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 fixes from a third review pass on the round-2 commit:

A. errcheck violations: 7 callers of persistWorkerRecord (lines 879,
   940, 1357, 1401, 1449, 1567, 1684) now explicitly `_ =` the new
   error return. Without this CI's golangci-lint would fail on merge.

B. ShutdownAll pod-delete-error path now also flips the in-memory
   state to Retired. Round-2 added the client-close on this branch
   but kept `return false`, so the worker was left with closed
   client + pre-drain SharedState — the exact asymmetry round-2 was
   supposed to eliminate. Closure return semantic renamed
   podDeleted→shouldMarkRetired to reflect what it actually gates.

C. markWorkerRetiredLocked now persists to the durable runtime store
   BEFORE flipping the in-memory state. Pre-fix order (in-memory →
   persist) committed local mutations even when the upsert fence-
   missed; on a peer takeover the dying CP would advertise the
   worker as retired locally while the peer owned it durably. New
   order returns on persist error without touching SharedState.

D. Runbook PromQL queries containing `|` alternation extracted from
   markdown table cells into fenced code blocks. The previous form
   used `\|` to escape the table-column delimiter, which rendered as
   a literal pipe and silently turned RE2 alternation into a no-op
   when operators copy-pasted from raw source during an incident.

E. ShutdownAll terminal-CAS-miss path restored its Debug log
   ("retire-drained CAS missed; orphan sweep will reconcile") that
   round-2 dropped in favor of a misleading-on-miss Warn. Now uses
   switch on (err, !outcome.Transitioned) to distinguish DB error
   (Warn) from CAS miss (Debug).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant