xpk: add xpk workload status for team-routed workload diagnostics#1181
Draft
ultrons wants to merge 10 commits into
Draft
xpk: add xpk workload status for team-routed workload diagnostics#1181ultrons wants to merge 10 commits into
xpk workload status for team-routed workload diagnostics#1181ultrons wants to merge 10 commits into
Conversation
Adds three flags to `xpk workload create`:
--team routes the job to <namespace>/<lq>/<priorityClass>
--value-class job-class label for audit and priority ordering
--declared-duration-minutes honest p90 estimate, used by a time-limit controller
When --team is set:
- Pod template is placed in the team's namespace (poc-<team>) and labeled with
the team's LocalQueue and PriorityClass.
- declared-duration-minutes is propagated into the pod template's metadata
labels (Kueue does not copy arbitrary JobSet metadata to the Workload, but
it does copy pod-template metadata into spec.podSets[*].template).
When --team is unset, behavior is identical to upstream (default namespace,
multislice-queue).
Team routing initially uses a hardcoded POC_TEAM_CONFIG dict in
core/kueue_manager.py; a follow-up commit replaces this with discovery from a
cluster-side ConfigMap.
Adds `xpk workload status --team=<t> --workload=<name>` (or omit --workload to list all). The command tells the user, with one diagnostic, whether their workload is: - QUEUED (and what's ahead of them) - STUCK (quota reserved but admission failing) - RUNNING - FINISHED For STUCK workloads the command parses recent Warning events and surfaces the most common causes (e.g. slice name length over the super-slice 49-char limit), with a copy-pasteable shorter-name suggestion. Looks up workloads by the xpk.google.com/workload label so the user can pass their full display name regardless of any internal name shortening. Cluster credentials are fetched at runtime; the user no longer needs compute/zone in their gcloud config for this command to work.
Three related changes that move team configuration from hardcoded dicts to
runtime discovery, eliminating xpk releases as the gating step for cluster
admin changes:
1. Decouple xpk --workload from the K8s JobSet name. xpk now derives a short,
deterministic K8s name (`{ldap_prefix}-{hex4}`) that fits the super-slice
admission controller's 49-char limit, while keeping the user-facing display
name on the JobSet via the xpk.google.com/workload label. Users pass the
same display name to `workload status / delete / list` regardless of length.
2. Replace POC_TEAM_CONFIG and POC_TEAM_MAX_WORKLOAD_NAME dicts (and the two
argparse choices=[...] lists) with discovery from a kueue-system/poc-team-config
ConfigMap. New module core/poc_discovery.py: fetch_poc_config(),
resolve_team(), max_k8s_workload_name_len(), available_teams(),
available_value_classes(). When --team is unset there is no cluster call —
upstream behavior is preserved bit-for-bit. When --team is set with a value
not on the cluster, xpk now prints the live available list as the error.
3. Add an opportunistic local cache under ~/.xpk/poc-cache/<context>.json so
neither tab completion (argcomplete completers on --cluster, --team,
--value-class) nor did-you-mean suggestions (`Did you mean: ml-perf?`) need
to call the cluster on every keystroke. The cache is refreshed any time a
live ConfigMap fetch succeeds; if it is missing or stale the worst case is a
slightly less useful error message — never a wrong-routing decision.
When a PoC team is in use, args.workload is the user-friendly display name (e.g. mlperf-conv-13) and the actual K8s JobSet name is a derived short name (e.g. mlperf-d6cc). The "Follow your workload here:" and "Follow your worker 0, slice 0 logs here:" URLs were hardcoding namespace="default" and using args.workload for the pod prefix, so they pointed nowhere — the workload lives in the team namespace and its pods are named after the JobSet, not args.workload. This commit: - Initializes poc_namespace/k8s_name early in workload_create so the PoC values are available where the URLs are emitted. - Substitutes those values into both the dashboard URL and the Cloud Logging filter (namespace_name and pod_name prefix). - Falls back to the upstream behavior (namespace=default, prefix= args.workload) when no PoC team is set, so non-PoC users are unaffected.
Renamings:
- core/poc_discovery.py -> core/quota_discovery.py
- _resolve_poc_team -> _resolve_quota_team, _load_poc_cfg -> _load_quota_cfg,
fetch_poc_config -> fetch_quota_config
- args._poc_cfg -> args._quota_cfg
- _build_poc_labels -> _build_team_labels (and pod-template variant)
- {poc_labels} / {poc_pod_template_labels} YAML keys -> {team_labels} /
{team_pod_template_labels}
- Local cache dir: ~/.xpk/poc-cache -> ~/.xpk/quota-cache
- All "PoC" references in user-facing strings, docstrings and parser help
text replaced with neutral team-quota wording.
Structure:
- Replace ambiguous (str, str, str) tuple return of resolve_team() with
@DataClass TeamRouting{namespace, local_queue, priority_class}.
- Move workload_status() and its helpers from commands/workload.py to a
dedicated commands/workload_status.py.
- In core/kueue_manager.py, move derive_k8s_workload_name() below the
module-level constants block so the constants stay together at the top.
- Tightened "team is set" check to a real non-empty-string test (still
passes existing MagicMock-based tests).
UX:
- Add --no-shorten-jobset-name flag so the JobSet-name shortening is
opt-out for users whose workload names already fit the super-slice
admission charLimit budget. Justify the shortening with an inline
comment that points at quota_discovery.max_k8s_workload_name_len for
the budget math.
- Only print the "workload -> JobSet name" message when the name was
actually shortened.
Tests:
- New core/quota_discovery_test.py (17 tests): resolve_team /
available_teams / available_value_classes / max_k8s_workload_name_len /
suggest / fetch_quota_config (mocked).
- Append derive_k8s_workload_name tests to core/kueue_manager_test.py
(5 tests): determinism, length cap, ldap-prefix shape, distinct inputs,
short-input behavior.
- All 502 tests pass (+23 new).
- Goldens regenerated for the renamed YAML template variables.
- workload.py: drop unused run_command_for_value import; add explicit protected-access pylint pragmas for args._quota_cfg (using args as a namespace bag is intentional); hoist the TeamRouting import to module level instead of importing it inside _resolve_quota_team. - workload_status.py: add check=False to subprocess.run; drop unused re import + unused regex match; the deferred .workload import (kept to avoid circular import) gets an explicit import-outside-toplevel pragma. - local_cache.py: add check=False to both subprocess.run calls. - kueue_manager_test.py: hoist derive_k8s_workload_name to the grouped imports block and use rsplit(maxsplit=1) for the suffix-length checks. - quota_discovery_test.py: replace `== []` empty-list comparisons with `not <list>` to match pylint's use-implicit-booleaness preference. Local pylint on the touched files now scores 10.00/10. All 502 tests still pass. Goldens unchanged.
- Narrow json.loads return to dict in local_cache.read and quota_discovery.fetch_quota_config to satisfy strict no-any-return. - Tighten _load_quota_cfg cache-hit type check to isinstance(cached, dict). - Suppress 'Action has no attribute completer' on argcomplete .completer assignments in parser/workload.py (existing pattern from common.py).
Pyink split the prior single-line assignment across three lines, which moved the '# type: ignore[attr-defined]' off the actual attribute access and left mypy flagging both the original error and the now-misplaced ignore comment. Hoist the Action into a local so the assignment fits on one line.
Bucket 1 — small, mechanical fixes: - kueue_manager_test.py: drop the decorative banner comments. - workload_status.py: type-annotate every inner helper. - workload_status.py: replace direct subprocess.run for the gcloud-project lookup with run_command_for_value, matching the rest of the file. - workload_status.py: condition handling now goes through kubectl_common.parse_kubernetes_status (typed access for type/status/lastTransitionTime/message/reason). - parser/workload.py: drop every '# type: ignore[attr-defined]' on argcomplete .completer assignments. New _set_completer helper uses setattr so the assignment is explicit and the type-checker is not bypassed. Bucket 2 — refactor to eliminate the workload_status -> workload lazy import: - Move load_quota_cfg + resolve_team_for_args from commands/workload.py to core/quota_discovery.py (their natural home). resolve_team_for_args now returns Optional[TeamRouting]; callers handle the unset case according to their own semantics. - workload.py and workload_status.py both import from quota_discovery at module top — no more 'pylint: disable=import-outside-toplevel'. - Rename args._quota_cfg -> args.quota_cfg (drops the protected-access pylint disable too). Supporting change: - kubectl_common.KubernetesCondition gains a 'reason' field (parse_kubernetes_status populates it). Lets workload_status use the shared helper end-to-end without falling back to dict access. Verified locally: 502 tests pass, pyink/mypy/pylint clean (mypy 0 errors; pylint score unchanged from main), goldens unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #1180. Per @jamOne-'s feedback on #1180 (thread), splitting the focused-workload-diagnosis surface out of the team-routing PR so each can be reviewed in isolation.
What this adds
A new
xpk workload statussubcommand answering "why is my job stuck?" with a 3-line plain-English diagnosis:Differs from `xpk inspector`, which is a kitchen-sink debug dump for SREs investigating cluster-wide bugs. `workload status` is the focused user-facing answer; `inspector` is the everything-and-the-kitchen-sink diagnostic.
States the diagnosis recognises:
Implementation notes
Single new file `src/xpk/commands/workload_status.py` plus the parser hookup in `parser/workload.py`. Reuses the team-routing infrastructure that #1180 introduces:
All round-2 cleanup feedback that applied to this file is already incorporated (typed helpers, no `subprocess.run`, no lazy imports, no pylint disables).
Reviewing this PR
Please wait for #1180 to land before reviewing. This branch is currently based on #1180's tip, so the diff against `main` includes #1180's work. Once #1180 merges, I'll rebase onto `main` and the diff will shrink to just the `workload_status` delta (one new file ~370 lines + the parser hookup ~40 lines + a couple of test additions).
Marking as draft until then.