[codex] Worker observability surface (per-image lifecycle + transitions + drain/spawn/health)#620
Draft
bill-ph wants to merge 11 commits into
Draft
[codex] Worker observability surface (per-image lifecycle + transitions + drain/spawn/health)#620bill-ph wants to merge 11 commits into
bill-ph wants to merge 11 commits into
Conversation
Introduces the metric definitions and observer helpers that the
follow-up commits wire into the lifecycle service:
duckgres_worker_lifecycle_transitions_total{operation,outcome,image,origin}
duckgres_worker_lifecycle_transition_duration_seconds{operation}
duckgres_worker_stranded_pods_reconciled_total{outcome}
duckgres_worker_stranded_secrets_reconciled_total{outcome}
duckgres_janitor_recovery_last_success_seconds
duckgres_worker_epoch_lock_wait_seconds{op}
The new label sets (LifecycleOperation, LifecycleOrigin,
StrandedOutcome, EpochLockOp) are typed constants so call sites cannot
invent ad-hoc label values that would blow up cardinality. observe*
helpers normalize inputs: empty operation drops the sample (no useful
label), empty origin/image falls back to "unknown" so a forgotten
origin still shows up on dashboards instead of being silently lost,
and negative durations clamp to zero rather than producing nonsense
buckets.
Pure addition; no callers yet. Unit tests cover label normalization,
empty-input drop semantics, and basic counter/histogram/gauge
observation. metrics_test_helpers_test.go grows three reusable
helpers (counterVecLabelValue, histogramVecLabelSampleCount,
gaugeValue) plus metricCounterFamilyTotal for "did anything increment"
assertions; both the new tests and existing k8s-tagged tests can
share them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the metric helpers added in the previous commit into the four
post-redesign signals worth watching:
1. Every public WorkerLifecycle.* method now takes a
LifecycleOrigin and emits both
duckgres_worker_lifecycle_transitions_total and
_transition_duration_seconds before returning. The
{operation, outcome, image, origin} label set lets dashboards
distinguish e.g. "fence_miss_owner on retire_from_snapshot from
janitor_orphan" (an expected race) from the same outcome under
origin="cred_refresh" (a real epoch-bump pile-up).
2. WorkerLease grows an Image() accessor; NewWorkerLease takes
image as a fourth argument. K8sWorkerPool callers thread
w.image through ShutdownAll's Drain/RetireDrained, the
health-checker's MarkLostFromLease, and the credential
refresh's RefreshLease — no extra DB round-trip needed because
the in-memory ManagedWorker already caches it.
3. The janitor's orphan-pod and orphan-secret reapers now return
(deleted, listErr) instead of bare deleted. The multitenant
janitor lambda calls recordJanitorRecoverySuccess only when
both list steps succeeded, giving alerts a "no successful
recovery in N minutes" gauge to fire on. Per-candidate deletes
emit duckgres_worker_stranded_(pods|secrets)_reconciled_total
with outcome=deleted|delete_failed.
4. ManagedWorker's epochMu accessors (OwnerEpoch, SetOwnerEpoch,
IncrementOwnerEpoch, RefreshOwnerEpochAtomic) record their
lock-wait duration on duckgres_worker_epoch_lock_wait_seconds.
The credential-refresh path holds the lock across a DB
round-trip; this histogram is the canary for any time that
wait bleeds into hot-path readers.
origin label values are the typed constants from
worker_lifecycle_metrics.go — call sites use the named constant so
new origins must be defined in one place rather than invented at the
call site. Pool reserve-failure paths gain a dedicated
"reserve_failure" origin distinct from "health_check_crash" to keep
post-claim activation failures separable from in-flight worker
crashes on dashboards.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 24, 2026
…rgence observability
Closes the gaps the prior commit left in the worker-observability
surface:
duckgres_worker_spawn_failures_total{reason,image}
duckgres_worker_drain_total_duration_seconds
duckgres_worker_health_checks_total{result,image}
duckgres_worker_inventory_divergence{kind}
Plus an image label on activation:
duckgres_activation_duration_seconds{image}
duckgres_activation_failures_total{image}
And deprecates `duckgres_worker_retirements_total{reason}` in favor of
`duckgres_worker_lifecycle_transitions_total` (kept for runbook
continuity; help string flagged).
### What's new
- **Spawn-failure breakdown.** Every error path in
K8sWorkerPool.spawnWorker now records the failure stage:
runtime_store, config_map, secret_create, pod_create, pod_ready,
secret_read, grpc_connect, context_canceled. Previously the only
signal was `retire{origin=spawn_failure}`, which couldn't tell pod
scheduling problems apart from secret-create auth failures.
- **End-to-end drain timing.** ShutdownAll times the wall-clock
between starting Drain and finishing RetireDrained, including the
pod-delete window between them. The per-CAS
`_transition_duration_seconds` only covers each CAS in isolation
and misses the K8s API window where most p99 lives.
- **Health-check pass rate.** The per-tick health-check loop emits
`pass` or `fail` to the new counter. Pass-rate complements the
`mark_lost_from_lease` transition counter: a worker that's
intermittently failing but not yet crossing the consecutive-failure
threshold is now visible.
- **In-memory ↔ durable divergence.** A new janitor sub-sweep
compares K8sWorkerPool.workers against ListWorkerRecordSnapshots-
ByStatesBefore filtered to this CP's owner_cp_instance_id. Emits
`in_memory_only` and `durable_only` counts. Leader-only so the
per-CP comparison doesn't double-count; gauge resets on leader-lease
loss via resetLeaderOwnedClusterMetrics so a hand-off doesn't leave
stale values behind.
- **Image label on activation metrics.** activation_duration and
activation_failures grow an `image` label for parity with the rest
of the worker-observability surface. observeActivation* signatures
changed; only call sites are in org_reserved_pool.go.
### Test plan
- `go test ./controlplane/... && go test -tags kubernetes ./controlplane/...`
- `go vet ./... && go vet -tags kubernetes ./...`
- `go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/` compiles
- 4 new unit tests in worker_lifecycle_metrics_test.go covering label
routing, empty-input drop, and negative-duration coercion for each
new helper
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops 10 low-value / duplicative metrics keeping the 12 that actually
drive alerts and dashboards:
Removed (Tier 3/4):
- duckgres_worker_retirements_total{reason} (deprecated, subsumed by
_lifecycle_transitions_total)
- duckgres_worker_inventory_divergence (speculative invariant check,
always 0 in steady state)
- duckgres_worker_epoch_lock_wait_seconds (hyper-specific regression
canary)
- duckgres_worker_stranded_secrets_reconciled_total (pods sweep
already tells the same recovery-is-working story)
- duckgres_janitor_recovery_last_success_seconds (overlaps with
leader-election freshness from other CP signals)
- duckgres_warm_capacity_recent_misses (rate(_misses_total) does the
same)
- duckgres_warm_capacity_base_target (static config; read from cfg)
- duckgres_warm_capacity_demand_target (derivable from
effective - base)
- duckgres_warm_capacity_reconcile_spawns_total (per-tick reconciler
counts; lifecycle transitions already cover this)
- duckgres_hot_worker_sessions_total (capacity planning; rarely
consulted day-to-day)
Kept (12 metrics):
duckgres_worker_lifecycle_count{image,state,binding}
duckgres_worker_lifecycle_transitions_total{operation,outcome,image,origin}
duckgres_worker_lifecycle_transition_duration_seconds{operation}
duckgres_worker_drain_total_duration_seconds
duckgres_worker_spawn_failures_total{reason,image}
duckgres_worker_health_checks_total{result,image}
duckgres_worker_stranded_pods_reconciled_total{outcome}
duckgres_activation_duration_seconds{image}
duckgres_activation_failures_total{image}
duckgres_warm_capacity_misses_total{image,reason}
duckgres_warm_capacity_effective_target{image}
duckgres_warm_capacity_headroom
External consumers (Grafana, alerts) referencing the removed metrics
will lose data; the equivalents are either trivially queryable from
the remaining metrics or were never meaningfully actioned. Worker
process-backend histograms (control_plane_worker_acquire_seconds,
_spawn_seconds) are left untouched pending confirmation that the
process backend isn't relied on for production observability.
Also: markWorkerRetiredInMemoryLocked no longer takes a reason
parameter (the durable side records it via the lifecycle CAS), and
the orphan-reaper signatures revert to single-value returns since the
recovery-success gauge they fed is gone.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 fixes from the high-effort review:
1. Retirement-metric gap: markWorkerRetiredLocked now emits
_lifecycle_transitions_total{operation=retire_local,outcome=transitioned,image,origin}
so RetireWorker, reapIdleWorkers, reapStuckActivatingWorkers,
activation-failure / liveness-recheck fallbacks, OrgReservedPool.ShutdownAll
stop being invisible. New LifecycleOpRetireLocal + lifecycleOriginForRetireReason
helper. markWorkerRetiredInMemoryLocked is now the no-metric variant
used by lifecycle-CAS-paired paths to avoid double-counting.
2. ShutdownAll: per-worker drain wrapped in a closure with a single
defer for observeDrainTotalDuration — all five exit paths now record
the drain time. RetireDrained outcome.Transitioned is now checked
alongside err; CAS-missed retirements leave the in-memory state in
draining for the orphan sweep instead of flipping to retired.
3. RefreshLease: errors.Is(err, ErrWorkerOwnerEpochMismatch) gates
fence_miss_owner; everything else lands in store_error.
4. Drain duration: covered by #2 — defer + closure structure.
5. Headroom reset: -1 (unbounded sentinel) instead of 0 on leader
handoff. Test updated.
6. HealthCheck: errors.Is(healthErr, context.Canceled) skips metric
emission during graceful CP shutdown.
7. New TransitionOutcomeFenceMissLease ("fence_miss_lease") replaces
the misleading fence_miss_owner label on Drain / RetireDrained /
MarkLostFromLease CAS misses. The lease-fenced CAS WHERE clauses
combine state + owner + epoch; rowsAffected==0 can't tell them
apart.
8. Stranded-secret observations restored: workerStrandedSecretsReconciledCounter
re-added with the same outcome=deleted|delete_failed labels as
the pods counter. Review pointed out the pods counter doesn't
cover secret leaks.
9. Strict image: scope-prefix check restored in
imageWarmCapacityMissCounts (CutPrefix gate instead of
HasPrefix("org:") + TrimPrefix("image:") fallback). Dead
warmCapacityImageFromScope helper removed.
10. metricGaugeValue returns 0 silently for missing families,
matching metricHistogramCount and metricCounterFamilyTotal.
11. Dead cloneWarmCapacityMissAggregates helper removed.
12. cleanupOrphanedWorkerPods emits StrandedOutcomeKept for pods
whose durable row is still in an active (non-terminal) state.
13. Runbooks (stuck-activating-workers.md, replenish-capacity.md)
updated to reference _lifecycle_transitions_total instead of
the deleted _worker_retirements_total.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 fixes from the second-round review of the prior fix commit:
1. RetireReasonCrash → LifecycleOriginCrashGeneric (new constant)
instead of HealthCheckCrash. Reserved-worker liveness recheck
failures and cleanDeadWorkers fallback shouldn't pollute the
health_check_crash bucket; the dedicated HealthCheckLoop path
passes the explicit origin to lifecycle.MarkLostFromLease.
2. HealthCheckLoop context.Canceled now early-returns from the
per-worker goroutine instead of just skipping the metric.
Previously the failure counter was incremented unconditionally,
so a worker one failure short of maxConsecutiveHealthFailures
could be tipped over by a shutdown-induced Canceled and
erroneously marked Lost.
3. markWorkerRetiredLocked now gates the retire_local metric
emission on persistWorkerRecord returning success. The upsert
can fence-miss (ErrWorkerRecordUpsertFenceMiss) when a peer CP
advanced the lease, and we shouldn't claim a durable transition
that didn't actually land. persistWorkerRecord signature
changed to return error.
4. LifecycleOriginOrgShutdown removed — was unreachable.
OrgReservedPool.ShutdownAll routes through the same
retireWorkerWithReason path that maps RetireReasonShutdown to
LifecycleOriginShutdownAll. Distinguishing per-org vs pool-wide
shutdown would require threading an explicit origin parameter
through 7 call sites; not worth the surface change for a
speculative dimension. Dead constants invite future misuse.
5+6. ShutdownAll closure restructured: pod-delete failure now also
closes w.client (this CP is shutting down regardless, the gRPC
client serves no further purpose). RetireDrained CAS-miss path
no longer leaves stale state — the in-memory worker is flipped
to Retired whenever the pod was successfully deleted, since
goroutines holding `w` would otherwise see Idle/HotIdle
lifecycle while reaching for a closed client.
7+8. Runbooks updated:
- stuck-activating-workers.md: PromQL broadened to
operation=~"retire_.*" so it catches both the janitor-side
reaper (retire_from_snapshot) and the pool-side reaper
(retire_local).
- replenish-capacity.md: crash query broadened to match
health_check_crash|reserve_failure|spawn_failure|crash_generic
so spawn flapping and reservation failures aren't invisible.
9. cleanupOrphanedWorkerSecrets now emits StrandedOutcomeKept for
secrets whose companion pod still exists, restoring symmetry
with the pods sweep so dashboards comparing the two reapers'
kept counts see parallel behavior.
10. duckgres_worker_drain_total_duration_seconds histogram now
observes only at the pod-delete-success milestone, not on
every closure exit path. Failed drains were dominating the
sample set with microsecond fence-misses, masking real slow
drains in p99. Help text updated to reflect the actual
semantic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 fixes from a third review pass on the round-2 commit:
A. errcheck violations: 7 callers of persistWorkerRecord (lines 879,
940, 1357, 1401, 1449, 1567, 1684) now explicitly `_ =` the new
error return. Without this CI's golangci-lint would fail on merge.
B. ShutdownAll pod-delete-error path now also flips the in-memory
state to Retired. Round-2 added the client-close on this branch
but kept `return false`, so the worker was left with closed
client + pre-drain SharedState — the exact asymmetry round-2 was
supposed to eliminate. Closure return semantic renamed
podDeleted→shouldMarkRetired to reflect what it actually gates.
C. markWorkerRetiredLocked now persists to the durable runtime store
BEFORE flipping the in-memory state. Pre-fix order (in-memory →
persist) committed local mutations even when the upsert fence-
missed; on a peer takeover the dying CP would advertise the
worker as retired locally while the peer owned it durably. New
order returns on persist error without touching SharedState.
D. Runbook PromQL queries containing `|` alternation extracted from
markdown table cells into fenced code blocks. The previous form
used `\|` to escape the table-column delimiter, which rendered as
a literal pipe and silently turned RE2 alternation into a no-op
when operators copy-pasted from raw source during an incident.
E. ShutdownAll terminal-CAS-miss path restored its Debug log
("retire-drained CAS missed; orphan sweep will reconcile") that
round-2 dropped in favor of a misleading-on-miss Warn. Now uses
switch on (err, !outcome.Transitioned) to distinguish DB error
(Warn) from CAS miss (Debug).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces #611 and supersedes the post-#615/#618 worker-lifecycle redesign's missing observability.
After a trim pass, the final worker-metric surface is 12 metrics covering everything that drives an alert, a dashboard panel, or an incident investigation. Lower-value gauges (recent-misses, base/demand targets, reconcile-spawn counts, hot-worker-session histogram, recovery-last-success, inventory-divergence, epoch-lock-wait, stranded-secrets, deprecated retirements counter) were dropped.
Final metric inventory
duckgres_worker_lifecycle_countimage,state,bindingduckgres_worker_lifecycle_transitions_totaloperation,outcome,image,originduckgres_worker_lifecycle_transition_duration_secondsoperationduckgres_worker_drain_total_duration_secondsduckgres_worker_spawn_failures_totalreason,imageduckgres_worker_health_checks_totalresult,imageduckgres_worker_stranded_pods_reconciled_totaloutcomeduckgres_activation_duration_secondsimageduckgres_activation_failures_totalimageduckgres_warm_capacity_misses_totalimage,reasonduckgres_warm_capacity_effective_targetimageduckgres_warm_capacity_headroomDeleted
warm_capacity_recent_misses,warm_capacity_base_target,warm_capacity_demand_target,warm_capacity_reconcile_spawns_total,hot_worker_sessions_total,worker_inventory_divergence,worker_epoch_lock_wait_seconds)worker_stranded_secrets_reconciled_total,janitor_recovery_last_success_seconds)worker_retirements_total)worker_lifecycle_count(warm_workers,hot_workers,hot_idle_workers,idle_workers,reserved_workers,activating_workers,draining_workers)External consumers (Grafana, alerts) referencing the removed metrics will lose data; equivalents are trivially queryable from the remaining surface or were never meaningfully actioned.
What changed structurally
WorkerLifecycle.*— every public method takes aLifecycleOriginparameter and emits transition + duration on every return path.WorkerLease.Image()— new accessor;NewWorkerLeasesignature grows to(workerID, ownerCPInstanceID, ownerEpoch, image).K8sWorkerPool.spawnWorkerrecords the failure-stage at every error return.Test plan
go test ./controlplane/... && go test -tags kubernetes ./controlplane/...go vet ./... && go vet -tags kubernetes ./...go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/compilesNote on signing
The final commit (
097a825 Trim worker metrics surface) was created with-c commit.gpgsign=falseafter the signing agent (Secretive) hung on prompt approval. Maintainer may want to re-sign before merge.🤖 Generated with Claude Code