Skip to content

[codex] Add per-image worker lifecycle metrics (rebased)#621

Closed
bill-ph wants to merge 4 commits into
mainfrom
codex/per-image-lifecycle-metrics-rebase
Closed

[codex] Add per-image worker lifecycle metrics (rebased)#621
bill-ph wants to merge 4 commits into
mainfrom
codex/per-image-lifecycle-metrics-rebase

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 24, 2026

Replaces #611. Same change, rebased onto post-redesign main (PRs #614/#615/#616/#617/#618/#619 landed during the original review window and conflicted with the original branch). #611 is being closed in favor of this PR; #620 is stacked on top of this branch.

Summary

  • Add duckgres_worker_lifecycle_count{image,state,ownership} for active worker counts by image and lifecycle state.
  • Aggregate the metric from cp_runtime.worker_records, including hot and hot_idle workers while excluding terminal lost/retired rows.
  • Remove the old global lifecycle gauges (duckgres_warm_workers, duckgres_hot_workers, etc.) and update runbooks to use the new metric.
  • Keep existing warm-capacity target/miss/headroom metrics unchanged.

Why

The current Prometheus metrics split worker observability between global lifecycle gauges and per-image warm-capacity gauges. That leaves dashboards unable to show historical per-image counts for states like hot and hot_idle.

This adds a canonical current-count gauge over the runtime worker records so Grafana can build per-image-state time series without querying config-store SQL directly.

Notes

  • ownership is neutral when org_id is empty/null and org_owned otherwise.
  • Active states are spawning, idle, reserved, activating, hot, hot_idle, and draining.
  • Rows with blank image are skipped because the metric is specifically per-image.

Rebase delta vs #611

  • multitenant.go: kept post-PR-4 lifecycle wiring; added janitor.onStop = resetLeaderOwnedClusterMetrics from this branch; dropped dead lambda assignments that PR PR 4: privatize unsafe lifecycle transition paths #617 removed.
  • k8s_pool.go: removed orphan observeWarmPoolLifecycleGauges call at line 3014 (callee was deleted on main).

Validation

  • go test ./controlplane/... && go test -tags kubernetes ./controlplane/...
  • go vet ./... && go vet -tags kubernetes ./...
  • go test -c -tags 'k8s_integration kubernetes' ./tests/k8s/ compiles

🤖 Generated with Claude Code

@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented May 24, 2026

Consolidating into #620 — same diff plus the observability work, single PR. The codex/per-image-lifecycle-metrics-rebase branch is left in place since #620's branch is built on top of it.

@bill-ph bill-ph closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant