[codex] Add per-image worker lifecycle metrics (rebased)#621
Closed
bill-ph wants to merge 4 commits into
Closed
Conversation
This was referenced May 24, 2026
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces #611. Same change, rebased onto post-redesign main (PRs #614/#615/#616/#617/#618/#619 landed during the original review window and conflicted with the original branch). #611 is being closed in favor of this PR; #620 is stacked on top of this branch.
Summary
duckgres_worker_lifecycle_count{image,state,ownership}for active worker counts by image and lifecycle state.cp_runtime.worker_records, includinghotandhot_idleworkers while excluding terminallost/retiredrows.duckgres_warm_workers,duckgres_hot_workers, etc.) and update runbooks to use the new metric.Why
The current Prometheus metrics split worker observability between global lifecycle gauges and per-image warm-capacity gauges. That leaves dashboards unable to show historical per-image counts for states like
hotandhot_idle.This adds a canonical current-count gauge over the runtime worker records so Grafana can build per-image-state time series without querying config-store SQL directly.
Notes
ownershipisneutralwhenorg_idis empty/null andorg_ownedotherwise.spawning,idle,reserved,activating,hot,hot_idle, anddraining.imageare skipped because the metric is specifically per-image.Rebase delta vs #611
multitenant.go: kept post-PR-4 lifecycle wiring; addedjanitor.onStop = resetLeaderOwnedClusterMetricsfrom this branch; dropped dead lambda assignments that PR PR 4: privatize unsafe lifecycle transition paths #617 removed.k8s_pool.go: removed orphanobserveWarmPoolLifecycleGaugescall at line 3014 (callee was deleted on main).Validation
go test ./controlplane/... && go test -tags kubernetes ./controlplane/...go vet ./... && go vet -tags kubernetes ./...go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/compiles🤖 Generated with Claude Code