Skip to content

xpk: add xpk workload status for team-routed workload diagnostics#1181

Draft
ultrons wants to merge 10 commits into
AI-Hypercomputer:mainfrom
ultrons:workload-status-pr
Draft

xpk: add xpk workload status for team-routed workload diagnostics#1181
ultrons wants to merge 10 commits into
AI-Hypercomputer:mainfrom
ultrons:workload-status-pr

Conversation

@ultrons
Copy link
Copy Markdown

@ultrons ultrons commented Apr 30, 2026

Stacked on #1180. Per @jamOne-'s feedback on #1180 (thread), splitting the focused-workload-diagnosis surface out of the team-routing PR so each can be reviewed in isolation.

What this adds

A new xpk workload status subcommand answering "why is my job stuck?" with a 3-line plain-English diagnosis:

$ xpk workload status --cluster=foo --team=ml-perf --workload=conv-run-42
Workload : conv-run-42  ->  jobset-conv-run-42-a3f1
Age      : 12m
Status   : QUEUED — waiting for quota
Position : 3rd in line  (2 workload(s) ahead: foo-bench-01, bar-eval-09)
Team quota (poc-ml-perf):
  Quota   : 4096 chips nominal  +0 borrow  = 4096 max
  Running : 4096 chips (2 workload(s) admitted)
  Queued  : 3 workload(s) waiting for quota
Diagnosis: Things look normal — waiting behind other workloads.

Differs from `xpk inspector`, which is a kitchen-sink debug dump for SREs investigating cluster-wide bugs. `workload status` is the focused user-facing answer; `inspector` is the everything-and-the-kitchen-sink diagnostic.

States the diagnosis recognises:

  • RUNNING (admitted)
  • QUEUED (with queue position + workloads ahead)
  • STUCK (quota reserved but admission check failed; surfaces the underlying error and a specific fix when the failure is the well-known "slice name > 49 chars" one)
  • FINISHED (success / failure with the controller's reason)

Implementation notes

Single new file `src/xpk/commands/workload_status.py` plus the parser hookup in `parser/workload.py`. Reuses the team-routing infrastructure that #1180 introduces:

  • `resolve_team_for_args(args)` to find namespace + ClusterQueue
  • `kubectl_common.parse_kubernetes_status` for typed condition handling
  • `run_command_for_value` for kubectl invocations (consistent with the rest of xpk)
  • The `team-quota-config` ConfigMap as the source of truth for the team list

All round-2 cleanup feedback that applied to this file is already incorporated (typed helpers, no `subprocess.run`, no lazy imports, no pylint disables).

Reviewing this PR

Please wait for #1180 to land before reviewing. This branch is currently based on #1180's tip, so the diff against `main` includes #1180's work. Once #1180 merges, I'll rebase onto `main` and the diff will shrink to just the `workload_status` delta (one new file ~370 lines + the parser hookup ~40 lines + a couple of test additions).

Marking as draft until then.

ultrons added 10 commits April 28, 2026 17:20
Adds three flags to `xpk workload create`:
  --team                       routes the job to <namespace>/<lq>/<priorityClass>
  --value-class                job-class label for audit and priority ordering
  --declared-duration-minutes  honest p90 estimate, used by a time-limit controller

When --team is set:
  - Pod template is placed in the team's namespace (poc-<team>) and labeled with
    the team's LocalQueue and PriorityClass.
  - declared-duration-minutes is propagated into the pod template's metadata
    labels (Kueue does not copy arbitrary JobSet metadata to the Workload, but
    it does copy pod-template metadata into spec.podSets[*].template).

When --team is unset, behavior is identical to upstream (default namespace,
multislice-queue).

Team routing initially uses a hardcoded POC_TEAM_CONFIG dict in
core/kueue_manager.py; a follow-up commit replaces this with discovery from a
cluster-side ConfigMap.
Adds `xpk workload status --team=<t> --workload=<name>` (or omit --workload to
list all). The command tells the user, with one diagnostic, whether their
workload is:
  - QUEUED (and what's ahead of them)
  - STUCK (quota reserved but admission failing)
  - RUNNING
  - FINISHED

For STUCK workloads the command parses recent Warning events and surfaces the
most common causes (e.g. slice name length over the super-slice 49-char
limit), with a copy-pasteable shorter-name suggestion.

Looks up workloads by the xpk.google.com/workload label so the user can pass
their full display name regardless of any internal name shortening.

Cluster credentials are fetched at runtime; the user no longer needs
compute/zone in their gcloud config for this command to work.
Three related changes that move team configuration from hardcoded dicts to
runtime discovery, eliminating xpk releases as the gating step for cluster
admin changes:

1. Decouple xpk --workload from the K8s JobSet name. xpk now derives a short,
   deterministic K8s name (`{ldap_prefix}-{hex4}`) that fits the super-slice
   admission controller's 49-char limit, while keeping the user-facing display
   name on the JobSet via the xpk.google.com/workload label. Users pass the
   same display name to `workload status / delete / list` regardless of length.

2. Replace POC_TEAM_CONFIG and POC_TEAM_MAX_WORKLOAD_NAME dicts (and the two
   argparse choices=[...] lists) with discovery from a kueue-system/poc-team-config
   ConfigMap. New module core/poc_discovery.py: fetch_poc_config(),
   resolve_team(), max_k8s_workload_name_len(), available_teams(),
   available_value_classes(). When --team is unset there is no cluster call —
   upstream behavior is preserved bit-for-bit. When --team is set with a value
   not on the cluster, xpk now prints the live available list as the error.

3. Add an opportunistic local cache under ~/.xpk/poc-cache/<context>.json so
   neither tab completion (argcomplete completers on --cluster, --team,
   --value-class) nor did-you-mean suggestions (`Did you mean: ml-perf?`) need
   to call the cluster on every keystroke. The cache is refreshed any time a
   live ConfigMap fetch succeeds; if it is missing or stale the worst case is a
   slightly less useful error message — never a wrong-routing decision.
When a PoC team is in use, args.workload is the user-friendly display
name (e.g. mlperf-conv-13) and the actual K8s JobSet name is a derived
short name (e.g. mlperf-d6cc). The "Follow your workload here:" and
"Follow your worker 0, slice 0 logs here:" URLs were hardcoding
namespace="default" and using args.workload for the pod prefix, so
they pointed nowhere — the workload lives in the team namespace and
its pods are named after the JobSet, not args.workload.

This commit:
- Initializes poc_namespace/k8s_name early in workload_create so the
  PoC values are available where the URLs are emitted.
- Substitutes those values into both the dashboard URL and the Cloud
  Logging filter (namespace_name and pod_name prefix).
- Falls back to the upstream behavior (namespace=default, prefix=
  args.workload) when no PoC team is set, so non-PoC users are
  unaffected.
Renamings:
- core/poc_discovery.py -> core/quota_discovery.py
- _resolve_poc_team -> _resolve_quota_team, _load_poc_cfg -> _load_quota_cfg,
  fetch_poc_config -> fetch_quota_config
- args._poc_cfg -> args._quota_cfg
- _build_poc_labels -> _build_team_labels (and pod-template variant)
- {poc_labels} / {poc_pod_template_labels} YAML keys -> {team_labels} /
  {team_pod_template_labels}
- Local cache dir: ~/.xpk/poc-cache -> ~/.xpk/quota-cache
- All "PoC" references in user-facing strings, docstrings and parser help
  text replaced with neutral team-quota wording.

Structure:
- Replace ambiguous (str, str, str) tuple return of resolve_team() with
  @DataClass TeamRouting{namespace, local_queue, priority_class}.
- Move workload_status() and its helpers from commands/workload.py to a
  dedicated commands/workload_status.py.
- In core/kueue_manager.py, move derive_k8s_workload_name() below the
  module-level constants block so the constants stay together at the top.
- Tightened "team is set" check to a real non-empty-string test (still
  passes existing MagicMock-based tests).

UX:
- Add --no-shorten-jobset-name flag so the JobSet-name shortening is
  opt-out for users whose workload names already fit the super-slice
  admission charLimit budget. Justify the shortening with an inline
  comment that points at quota_discovery.max_k8s_workload_name_len for
  the budget math.
- Only print the "workload -> JobSet name" message when the name was
  actually shortened.

Tests:
- New core/quota_discovery_test.py (17 tests): resolve_team /
  available_teams / available_value_classes / max_k8s_workload_name_len /
  suggest / fetch_quota_config (mocked).
- Append derive_k8s_workload_name tests to core/kueue_manager_test.py
  (5 tests): determinism, length cap, ldap-prefix shape, distinct inputs,
  short-input behavior.
- All 502 tests pass (+23 new).
- Goldens regenerated for the renamed YAML template variables.
- workload.py: drop unused run_command_for_value import; add explicit
  protected-access pylint pragmas for args._quota_cfg (using args as a
  namespace bag is intentional); hoist the TeamRouting import to module
  level instead of importing it inside _resolve_quota_team.
- workload_status.py: add check=False to subprocess.run; drop unused
  re import + unused regex match; the deferred .workload import (kept
  to avoid circular import) gets an explicit import-outside-toplevel
  pragma.
- local_cache.py: add check=False to both subprocess.run calls.
- kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped
  imports block and use rsplit(maxsplit=1) for the suffix-length checks.
- quota_discovery_test.py: replace `== []` empty-list comparisons with
  `not <list>` to match pylint's use-implicit-booleaness preference.

Local pylint on the touched files now scores 10.00/10. All 502 tests
still pass. Goldens unchanged.
- Narrow json.loads return to dict in local_cache.read and
  quota_discovery.fetch_quota_config to satisfy strict no-any-return.
- Tighten _load_quota_cfg cache-hit type check to isinstance(cached, dict).
- Suppress 'Action has no attribute completer' on argcomplete .completer
  assignments in parser/workload.py (existing pattern from common.py).
Pyink split the prior single-line assignment across three lines, which
moved the '# type: ignore[attr-defined]' off the actual attribute access
and left mypy flagging both the original error and the now-misplaced
ignore comment. Hoist the Action into a local so the assignment fits on
one line.
Bucket 1 — small, mechanical fixes:
- kueue_manager_test.py: drop the decorative banner comments.
- workload_status.py: type-annotate every inner helper.
- workload_status.py: replace direct subprocess.run for the gcloud-project
  lookup with run_command_for_value, matching the rest of the file.
- workload_status.py: condition handling now goes through
  kubectl_common.parse_kubernetes_status (typed access for
  type/status/lastTransitionTime/message/reason).
- parser/workload.py: drop every '# type: ignore[attr-defined]' on
  argcomplete .completer assignments. New _set_completer helper uses
  setattr so the assignment is explicit and the type-checker is not
  bypassed.

Bucket 2 — refactor to eliminate the workload_status -> workload lazy
import:
- Move load_quota_cfg + resolve_team_for_args from commands/workload.py
  to core/quota_discovery.py (their natural home). resolve_team_for_args
  now returns Optional[TeamRouting]; callers handle the unset case
  according to their own semantics.
- workload.py and workload_status.py both import from quota_discovery at
  module top — no more 'pylint: disable=import-outside-toplevel'.
- Rename args._quota_cfg -> args.quota_cfg (drops the protected-access
  pylint disable too).

Supporting change:
- kubectl_common.KubernetesCondition gains a 'reason' field
  (parse_kubernetes_status populates it). Lets workload_status use the
  shared helper end-to-end without falling back to dict access.

Verified locally: 502 tests pass, pyink/mypy/pylint clean (mypy 0
errors; pylint score unchanged from main), goldens unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant