xpk: add `xpk workload status` for team-routed workload diagnostics by ultrons · Pull Request #1181 · AI-Hypercomputer/xpk

ultrons · 2026-04-30T21:59:00Z

Stacked on #1180. Per @jamOne-'s feedback on #1180 (thread), splitting the focused-workload-diagnosis surface out of the team-routing PR so each can be reviewed in isolation.

What this adds

A new xpk workload status subcommand answering "why is my job stuck?" with a 3-line plain-English diagnosis:

$ xpk workload status --cluster=foo --team=ml-perf --workload=conv-run-42
Workload : conv-run-42  ->  jobset-conv-run-42-a3f1
Age      : 12m
Status   : QUEUED — waiting for quota
Position : 3rd in line  (2 workload(s) ahead: foo-bench-01, bar-eval-09)
Team quota (poc-ml-perf):
  Quota   : 4096 chips nominal  +0 borrow  = 4096 max
  Running : 4096 chips (2 workload(s) admitted)
  Queued  : 3 workload(s) waiting for quota
Diagnosis: Things look normal — waiting behind other workloads.

Differs from `xpk inspector`, which is a kitchen-sink debug dump for SREs investigating cluster-wide bugs. `workload status` is the focused user-facing answer; `inspector` is the everything-and-the-kitchen-sink diagnostic.

States the diagnosis recognises:

RUNNING (admitted)
QUEUED (with queue position + workloads ahead)
STUCK (quota reserved but admission check failed; surfaces the underlying error and a specific fix when the failure is the well-known "slice name > 49 chars" one)
FINISHED (success / failure with the controller's reason)

Implementation notes

Single new file `src/xpk/commands/workload_status.py` plus the parser hookup in `parser/workload.py`. Reuses the team-routing infrastructure that #1180 introduces:

`resolve_team_for_args(args)` to find namespace + ClusterQueue
`kubectl_common.parse_kubernetes_status` for typed condition handling
`run_command_for_value` for kubectl invocations (consistent with the rest of xpk)
The `team-quota-config` ConfigMap as the source of truth for the team list

All round-2 cleanup feedback that applied to this file is already incorporated (typed helpers, no `subprocess.run`, no lazy imports, no pylint disables).

Reviewing this PR

Please wait for #1180 to land before reviewing. This branch is currently based on #1180's tip, so the diff against `main` includes #1180's work. Once #1180 merges, I'll rebase onto `main` and the diff will shrink to just the `workload_status` delta (one new file ~370 lines + the parser hookup ~40 lines + a couple of test additions).

Marking as draft until then.

Adds three flags to `xpk workload create`: --team routes the job to <namespace>/<lq>/<priorityClass> --value-class job-class label for audit and priority ordering --declared-duration-minutes honest p90 estimate, used by a time-limit controller When --team is set: - Pod template is placed in the team's namespace (poc-<team>) and labeled with the team's LocalQueue and PriorityClass. - declared-duration-minutes is propagated into the pod template's metadata labels (Kueue does not copy arbitrary JobSet metadata to the Workload, but it does copy pod-template metadata into spec.podSets[*].template). When --team is unset, behavior is identical to upstream (default namespace, multislice-queue). Team routing initially uses a hardcoded POC_TEAM_CONFIG dict in core/kueue_manager.py; a follow-up commit replaces this with discovery from a cluster-side ConfigMap.

Adds `xpk workload status --team=<t> --workload=<name>` (or omit --workload to list all). The command tells the user, with one diagnostic, whether their workload is: - QUEUED (and what's ahead of them) - STUCK (quota reserved but admission failing) - RUNNING - FINISHED For STUCK workloads the command parses recent Warning events and surfaces the most common causes (e.g. slice name length over the super-slice 49-char limit), with a copy-pasteable shorter-name suggestion. Looks up workloads by the xpk.google.com/workload label so the user can pass their full display name regardless of any internal name shortening. Cluster credentials are fetched at runtime; the user no longer needs compute/zone in their gcloud config for this command to work.

Three related changes that move team configuration from hardcoded dicts to runtime discovery, eliminating xpk releases as the gating step for cluster admin changes: 1. Decouple xpk --workload from the K8s JobSet name. xpk now derives a short, deterministic K8s name (`{ldap_prefix}-{hex4}`) that fits the super-slice admission controller's 49-char limit, while keeping the user-facing display name on the JobSet via the xpk.google.com/workload label. Users pass the same display name to `workload status / delete / list` regardless of length. 2. Replace POC_TEAM_CONFIG and POC_TEAM_MAX_WORKLOAD_NAME dicts (and the two argparse choices=[...] lists) with discovery from a kueue-system/poc-team-config ConfigMap. New module core/poc_discovery.py: fetch_poc_config(), resolve_team(), max_k8s_workload_name_len(), available_teams(), available_value_classes(). When --team is unset there is no cluster call — upstream behavior is preserved bit-for-bit. When --team is set with a value not on the cluster, xpk now prints the live available list as the error. 3. Add an opportunistic local cache under ~/.xpk/poc-cache/<context>.json so neither tab completion (argcomplete completers on --cluster, --team, --value-class) nor did-you-mean suggestions (`Did you mean: ml-perf?`) need to call the cluster on every keystroke. The cache is refreshed any time a live ConfigMap fetch succeeds; if it is missing or stale the worst case is a slightly less useful error message — never a wrong-routing decision.

When a PoC team is in use, args.workload is the user-friendly display name (e.g. mlperf-conv-13) and the actual K8s JobSet name is a derived short name (e.g. mlperf-d6cc). The "Follow your workload here:" and "Follow your worker 0, slice 0 logs here:" URLs were hardcoding namespace="default" and using args.workload for the pod prefix, so they pointed nowhere — the workload lives in the team namespace and its pods are named after the JobSet, not args.workload. This commit: - Initializes poc_namespace/k8s_name early in workload_create so the PoC values are available where the URLs are emitted. - Substitutes those values into both the dashboard URL and the Cloud Logging filter (namespace_name and pod_name prefix). - Falls back to the upstream behavior (namespace=default, prefix= args.workload) when no PoC team is set, so non-PoC users are unaffected.

Renamings: - core/poc_discovery.py -> core/quota_discovery.py - _resolve_poc_team -> _resolve_quota_team, _load_poc_cfg -> _load_quota_cfg, fetch_poc_config -> fetch_quota_config - args._poc_cfg -> args._quota_cfg - _build_poc_labels -> _build_team_labels (and pod-template variant) - {poc_labels} / {poc_pod_template_labels} YAML keys -> {team_labels} / {team_pod_template_labels} - Local cache dir: ~/.xpk/poc-cache -> ~/.xpk/quota-cache - All "PoC" references in user-facing strings, docstrings and parser help text replaced with neutral team-quota wording. Structure: - Replace ambiguous (str, str, str) tuple return of resolve_team() with @DataClass TeamRouting{namespace, local_queue, priority_class}. - Move workload_status() and its helpers from commands/workload.py to a dedicated commands/workload_status.py. - In core/kueue_manager.py, move derive_k8s_workload_name() below the module-level constants block so the constants stay together at the top. - Tightened "team is set" check to a real non-empty-string test (still passes existing MagicMock-based tests). UX: - Add --no-shorten-jobset-name flag so the JobSet-name shortening is opt-out for users whose workload names already fit the super-slice admission charLimit budget. Justify the shortening with an inline comment that points at quota_discovery.max_k8s_workload_name_len for the budget math. - Only print the "workload -> JobSet name" message when the name was actually shortened. Tests: - New core/quota_discovery_test.py (17 tests): resolve_team / available_teams / available_value_classes / max_k8s_workload_name_len / suggest / fetch_quota_config (mocked). - Append derive_k8s_workload_name tests to core/kueue_manager_test.py (5 tests): determinism, length cap, ldap-prefix shape, distinct inputs, short-input behavior. - All 502 tests pass (+23 new). - Goldens regenerated for the renamed YAML template variables.

- workload.py: drop unused run_command_for_value import; add explicit protected-access pylint pragmas for args._quota_cfg (using args as a namespace bag is intentional); hoist the TeamRouting import to module level instead of importing it inside _resolve_quota_team. - workload_status.py: add check=False to subprocess.run; drop unused re import + unused regex match; the deferred .workload import (kept to avoid circular import) gets an explicit import-outside-toplevel pragma. - local_cache.py: add check=False to both subprocess.run calls. - kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped imports block and use rsplit(maxsplit=1) for the suffix-length checks. - quota_discovery_test.py: replace `== []` empty-list comparisons with `not <list>` to match pylint's use-implicit-booleaness preference. Local pylint on the touched files now scores 10.00/10. All 502 tests still pass. Goldens unchanged.

- Narrow json.loads return to dict in local_cache.read and quota_discovery.fetch_quota_config to satisfy strict no-any-return. - Tighten _load_quota_cfg cache-hit type check to isinstance(cached, dict). - Suppress 'Action has no attribute completer' on argcomplete .completer assignments in parser/workload.py (existing pattern from common.py).

Pyink split the prior single-line assignment across three lines, which moved the '# type: ignore[attr-defined]' off the actual attribute access and left mypy flagging both the original error and the now-misplaced ignore comment. Hoist the Action into a local so the assignment fits on one line.

Bucket 1 — small, mechanical fixes: - kueue_manager_test.py: drop the decorative banner comments. - workload_status.py: type-annotate every inner helper. - workload_status.py: replace direct subprocess.run for the gcloud-project lookup with run_command_for_value, matching the rest of the file. - workload_status.py: condition handling now goes through kubectl_common.parse_kubernetes_status (typed access for type/status/lastTransitionTime/message/reason). - parser/workload.py: drop every '# type: ignore[attr-defined]' on argcomplete .completer assignments. New _set_completer helper uses setattr so the assignment is explicit and the type-checker is not bypassed. Bucket 2 — refactor to eliminate the workload_status -> workload lazy import: - Move load_quota_cfg + resolve_team_for_args from commands/workload.py to core/quota_discovery.py (their natural home). resolve_team_for_args now returns Optional[TeamRouting]; callers handle the unset case according to their own semantics. - workload.py and workload_status.py both import from quota_discovery at module top — no more 'pylint: disable=import-outside-toplevel'. - Rename args._quota_cfg -> args.quota_cfg (drops the protected-access pylint disable too). Supporting change: - kubectl_common.KubernetesCondition gains a 'reason' field (parse_kubernetes_status populates it). Lets workload_status use the shared helper end-to-end without falling back to dict access. Verified locally: 502 tests pass, pyink/mypy/pylint clean (mypy 0 errors; pylint score unchanged from main), goldens unchanged.

ultrons added 10 commits April 28, 2026 17:20

Apply pyink formatting to changed files

c21257f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpk: add `xpk workload status` for team-routed workload diagnostics#1181

xpk: add `xpk workload status` for team-routed workload diagnostics#1181
ultrons wants to merge 10 commits into
AI-Hypercomputer:mainfrom
ultrons:workload-status-pr

ultrons commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ultrons commented Apr 30, 2026

What this adds

Implementation notes

Reviewing this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant